Articles

Performance Analysis of KNN on Different Types of Attributes

R.Nancy Beaulah,
Article Date Published : 25 February 2018 | Page No.: 23628-23631 | Google Scholar

Downloads

Download data is not yet available.

Abstract

Data Mining is an inter-disciplinary promising field that focuses on access of information useful for high level decisions and also includes Machine Learning. Data Miners evaluate and filter the data as a result and convert the data into useful information. The useful information is converted into knowledge by performing some techniques. The K-Nearest Neighbor (K-NN) algorithm is an instance based learning method that has been widely used in many pattern classification tasks due to its simplicity, effectiveness and robustness. This paper presents a performance comparison of KNN algorithm in various data sets that includes different types of attributes. The results of this paper are achieved using WEKA tool.

Literature Review

  1. In [1] Tina R. Patil, Mrs. S. S. Sherekar attempted to make comparative evaluation of classifiers NAIVE BAYES AND J48. The performance compared on the basis of classification accuracy, sensitivity and specificity. In [2] Satish Kumar David, Amr T.M. Saeb, Khalid Al Rubeaan compared Decision tree J4.8 classification algorithm, Bayesian Network, and a Naïve Bayes algorithms in Medical Bioinformatics. The evaluation is based on their accuracy, learning time and error rate. In [3] Mahendra Tiwari, Manu Bhai Jha, OmPrakash Yadav proposed a methodology for comparing the accuracy of different data mining algorithms on various datasets. The performance analysis depends on many factors encompassing test mode, different nature of data sets, and size of data set. In [4] Yogita Rani, Manju, Harish Rohil used BIRCH and CURE data mining algorithms for comparative analysis on Iris Plant dataset. In [5] Kavitha C.R, Mahalekshmi T used toxicity dataset of aliphatic carboxylic acids to make a comparison of different classification algorithms and to find out the best algorithm out of the five chosen algorithm which gives the most accurate result.In [1] Tina R. Patil, Mrs. S. S. Sherekar attempted to make comparative evaluation of classifiers NAIVE BAYES AND J48. The performance compared on the basis of classification accuracy, sensitivity and specificity. In [2] Satish Kumar David, Amr T.M. Saeb, Khalid Al Rubeaan compared Decision tree J4.8 classification algorithm, Bayesian Network, and a Naïve Bayes algorithms in Medical Bioinformatics. The evaluation is based on their accuracy, learning time and error rate. In [3] Mahendra Tiwari, Manu Bhai Jha, OmPrakash Yadav proposed a methodology for comparing the accuracy of different data mining algorithms on various datasets. The performance analysis depends on many factors encompassing test mode, different nature of data sets, and size of data set. In [4] Yogita Rani, Manju, Harish Rohil used BIRCH and CURE data mining algorithms for comparative analysis on Iris Plant dataset. In [5] Kavitha C.R, Mahalekshmi T used toxicity dataset of aliphatic carboxylic acids to make a comparison of different classification algorithms and to find out the best algorithm out of the five chosen algorithm which gives the most accurate result.

Proposed Methodology

For this study, the experiments and observations are carried out by using data mining tool i.e. WEKA (Waikato Environment for Knowledge Learning). It was developed by the University of Waikato, New Zealand. WEKA supports many data mining tasks such as data pre-processing, classification, clustering, regression and visualization. The workflow of WEKA would be as follows:

Figure 1.Distribution in groups of patients by the stroke type.

  1. In this paper k-nearest neighbor classification algorithm is used and it is tested with six different data sets with different types of attributes. Before classifying the data, the data sets should be preprocessed. Preprocessing is done to clean the data, to remove noise and inconsistency. In WEKA, to remove missing values in the dataset, ReplaceMissingValues filter is used. k-nearest neighbor algorithm is implemented using IBk algorithm with k=5 neighbors. The test mode used is percentage split i.e. 50% of the data set is considered as Training set and the remaining 50% as Test set.In this paper k-nearest neighbor classification algorithm is used and it is tested with six different data sets with different types of attributes. Before classifying the data, the data sets should be preprocessed. Preprocessing is done to clean the data, to remove noise and inconsistency. In WEKA, to remove missing values in the dataset, ReplaceMissingValues filter is used. k-nearest neighbor algorithm is implemented using IBk algorithm with k=5 neighbors. The test mode used is percentage split i.e. 50% of the data set is considered as Training set and the remaining 50% as Test set.

2.1 Data Selection:

In this paper datasets have been collected from UCI Machine Learning Repository website. The dataset contains different attributes and instances. The complete description of dataset is shown in Table 1.

Attribute Types

S. No No. of Instances No. of Attributes
1. Auto Imports Database 205 26 15 - Continuous 1 - Integer 10 - Nominal
2. Ionosphere database 351 34 34 - Continuous
3. King+Rook versus King+Pawn (kr-vs-kp) Data set 3196 36 36 - Discrete
4. Letter Image Recognition Data Set 20000 16 16 - Integer
5. Mushroom Database 8124 22 22 - Nominal
6. Vehicle Silhouettes Data Set 846 18 18 - Real

Experimental Works and Results

An experimental comparison of NN classification technique with six different datasets is carried out in WEKA. Each of the datasets involved contains different data types as well as varied number of attributes. The computation results of NN with six datasets are listed in Table 2. The accuracy of NN is tabulated in Table 3. The comparison of error is tabulated in Table 4.

Results of -NN

S. No Correctly Classified Incorrectly Classified Kappa Statistic
1 Auto Imports Database 54 52.9412% 48 47.0588 % 0.3865
2 Ionosphere database 145 82.8571 % 30 17.1429 % 0.6044
3 kr-vs-kp Data set 1491 93.3041% 107 6.6959 % 0.8652
4 Letter Image Recognition Data Set 9274 92.74 % 726 7.26% 0.9245
5 Mushroom Database 4059 99.9261 % 3 0.0739% 0.9985
6 Vehicle Silhouettes Data Set 277 65.4846 % 146 34.5154 % 0.5398

Table 3 – Accuracy

Auto Imports Database 52.9
Ionosphere database 82.9
kr-vs-kp Data set 93.3
Letter Image Recognition Data Set 92.7
Mushroom Database 99.9
Vehicle Silhouettes Data Set 65.5

Table 4 – Error Comparison

S. No
1. Auto Imports Database 0.1525 68.42%
2. Ionosphere database 0.1957 42.43%
3. kr-vs-kp Data set 0.1771 35.49%
4. Letter Image Recognition Data Set 0.0094 12.74%
5. Mushroom Database 0.0007 0.13%
6. Vehicle Silhouettes Data Set 0.1982 52.76%

The performance of NN classifier is evaluated by using parameter such as TP (True Positive) rate, FP (False Positive) rate, TN (True Negative) rate, FN (False Negative) rate. TP is the proportion of positive cases that were correctly identified. FP is the proportion of negatives cases that were incorrectly classified as positive. TN is the proportion of negatives cases that were classified correctly. FN is the proportion of positives cases that were incorrectly classified as negative. Based on these values Figure 1 shows the comparative results of correctly classified instances with incorrectly classified instances. Figure 2 represents the accuracy of NN with these six datasets. Figure 3 shows the kappa statistic values. Figure 4 shows the error comparison results.

Figure 1.Distribution in groups of patients by the stroke type.

Figure 1.Distribution in groups of patients by the stroke type.

Figure 1.Distribution in groups of patients by the stroke type.

Figure 1.Distribution in groups of patients by the stroke type.

Conclusion

The above results clearly show that the highest accuracy is achieved for Mushroom dataset that contains nominal attributes. The next highest accuracy is achieved for discrete and integer attributes. The comparative results of error show a minimum value for mushroom dataset. In fact, in this experimental comparison k-NN algorithm is more suitable for nominal type of attributes.

References

  1. Vanitha P, Mayilvaganan andDrM. SURVEY ON METEOROLOGICAL WEATHER ANALYSIS BASED ON NAÏVE BAYES CLASSIFICATION ALGORITHM 2016-jan. | Google Scholar
  2. kumar Yugal. Analysis of Bayes, Neural Network and Tree Classifier of Classification Technique in Data Mining using WEKA 2012. | Google Scholar
  3. Tiwari Mahendra. Performance analysis of Data Mining algorithms in Weka 2012;:32-41. | Google Scholar
  4. Comparing EM Clustering Algorithm with Density Based Clustering Algorithm Using WEKA Tool 2016-jul;:1199-1201. | Google Scholar
  5. Ahmed Kawsar, Jesmin Tasnuba. Comparative Analysis of Data Mining Classification Algorithms in Type-2 Diabetes Prediction Data Using WEKA Approach 2014-oct. | Google Scholar

Author's Affiliation

  • R.Nancy Beaulah

    Google Scholar

Copyrights & License

International Journal of Engineering and Computer Science, 2018.
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Article Details


Issue: Vol 7 No 02 (2018)
Page No.: 23628-23631
Section: Articles
DOI:

How to Cite

Beaulah, R. (2018). Performance Analysis of KNN on Different Types of Attributes. International Journal of Engineering and Computer Science, 7(02), 23628-23631. Retrieved from http://ijecs.in/index.php/ijecs/article/view/3967

Download Citation

  • HTML Viewed - 1009 Times
  • PDF Downloaded - 150 Times
  • XML Downloaded - 0 Times