A Simplified Comparative Study of Machine Learning Classifiers

Abstract

Machine Learning is an emerging field, aiming to make the machines:

do predictions about future
classify the information in order to help out people so that they can make better decisions.

Several machine learning algorithms are available. ML algorithm is made to learn from past experiences by analyzing the historical data. In this way ML algorithm is said to be trained enough to make future prediction very well [1]. Classification is a supervised machine learning approach, where a set of given data points are assigned different classes being already defined. In other words classification predicts the class and labels the given data points.

Several classification algorithms are available in order to classify the given data. For example Logistic regression, Naïve Bayes, KNN, Decision Trees etc. In this paper, I have applied six classification algorithms on same dataset. Results have been compared then: using some performance evaluation measures like precision, accuracy, incorrect predictions and recall.

Introduction

Whenever there is a problem to be solved using machine learning, we are not sure that which ML algorithm’s performance will be best [2]. Observing the problem nature, one can easily identify the type of algorithm to be used to solve a particular problem, either its regression or classification? But to choose the exact regression or classification algorithm type, which will outperform, is a difficult task. The only way to choose the best algorithm is checking in advance the performance of specific algorithms and then select certain algorithms to move forward.

Main purpose of the paper is to simplify the task of choosing best algorithm(s) for your problem. Here six classification ML algorithms are compared:

Logistic Regression
K-Nearest Neighbors
Support Vector Machine(SVM)
Kernel SVM
Naive Bayes
Decision Trees
Random Forest

The problem is a standard binary classification dataset called the Loan Dataset for loan prediction problem. This dataset has 615 instances, 13 attributes and 2 classes.

Related Work

Mostly all the research work dealing with classifiers comparison falls into two main categories:

Relatively few classifiers are compared and validated in order to justify the need of a new approach (e.g [3]-[7])
Some has done the comparison of many classifiers in a very systematic way both qualitatively and quantitatively.

For example [8],[9],[10] have done qualitative analysis over many different classifiers telling the advantages and limitations, disadvantages of each method. While [11] has done quantitative analysis of classifiers.

Mostly the researchers conducted research in order to find out that which classifiers are more suitable for problems under discussion (see e.g [12]-[14]), but very few of these compare the performance of these classifiers in a quantitative way.

Moreover it is also found that mostly these classifiers are analyzed/ compared on multiple datasets. This research paper simplifies the analysis task by targeting same dataset for different classifiers in order to facilitate ML beginners so that they could easily take an overview of these algorithms’ working.

Research Methodology

Selected Dataset

I selected loan classification dataset taken from kaggle. This dataset has 615 instances, 13 attributes and 2 classes.

Basically this dataset was provided by Dream Housing Finance company (dealing in all home loans).

Selected Classification Algorithms

I have used following algorithms for same loan dataset:

Logistic Regression
K-Nearest Neighbors
Support Vector Machine(SVM)
Kernel SVM
Naive Bayes
Decision Trees
Random Forest

Kaggle is a platform to perform analytics and predictive modeling, arrange competitions related to Data Science around the world. Moreover it has powerful tools and resources to test our projects. Python language is used for project coding.

Project Steps explained:

Step 1: Import required libraries and Load the dataset

Step 2: Data preprocessing (Filling missing values, converting non-numeric attributes to numeric etc,).

Step 3: Visualizing the dataset

Step4: Splitting the Data set

This dataset has around 615 records. I used 80% of it for training the model and 20% of the records to evaluate all chosen classifiers one by one. As this dataset has lot of columns, I used most influencing fields Income fields, loan amount, loan duration and credit history fields to train the models.

Step5: Applying different classifiers one by one. In this step I applied different classifiers (already told) and then evaluated model performance using confusion matrix.

Results

In classification problems, the predicted results can be compared with the actual results by using the confusion matrix. In simple words, confusion matrix tells the count of correct and incorrect entries.

Comparative Analysis

Below is given bar charts of above Table:2 comparison parameters erforms the other classifiers as it has higher accuracy and less number of incorrect predictions.

Conclusion

Classification is the main field of machine learning dealing with the categorization of given data in different classes.

A large number of classification algorithms are there. So it’s a difficult and technical task (especially for ML beginners), to find out in advance which algorithm will solve our machine learning problem effectively. I have applied seven different classifiers on a single dataset in order to get an idea that which classifier is best for this dataset.

Based on different quality measures like accuracy, precision, recall and number of incorrect predictions, I concluded that Logistic regression fits best for this dataset.

References

Blog, Machine learning 6 May 2020 What is Machine Learning? A definition by Expert System Team
Machine Learning Project 17 — Compare Classification Algorithms by Omair Aasim
Yang J, Frangi AF, Yang JY, Zhang D, Jin Z (2005) KPCA plus LDA: a complete kernel fisher discriminant framework for feature extraction and recognition. IEEE Transactions Pattern Analysis and Machine Intelligence 27: 230–244. [PubMed] [Google Scholar]
Bezdek JC, Chuah SK, Leep D (1986) Generalized k-nearest neighbor rules. Fuzzy Sets and Systems 18: 237–256. [Google Scholar]
Seetha H, Narasimha MM, Saravanan R (2011) On improving the generalization of SVM classifier. Communications in Computer and Information Science 157: 11–20. [Google Scholar]
Fan L, Poh K-L (1999) Improving the Naïve Bayes classifier. Encyclopedia of Artificial Intelligence 879–883. [Google Scholar]
Tsang IW, Kwok JT, Cheung P-K (2005) Core vector machines: Fast SVM training on very large data sets. Journal of Machine Learning Research 6: 363–392. [Google Scholar]
Jain AK, Duin RPW, Mao J (2000) Statistical pattern recognition: a review. IEEE Trans. Pattern Analysis and Machine Intelligence 22: 4–37. [Google Scholar]
Wu X, Kumar V, Quinlan JR, Ghosh J (2007) Top 10 algorithms in data mining. Springer-Verlag.
Kotsiantis SB (2007) Supervised machine learning: A review of classification techniques. Informatica 31: 249-268. [Google Scholar]
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7: 1–30. [Google Scholar]
Howell AJ, Buxton H (2002) RBF network methods for face detection and attentional frames. Neural Processing Letters 15: 197–211.[Google Scholar]
Darrell T, Indyk P, Shakhnarovich F (2006). Nearest neighbor methods in learning and vision: theory and practice. MIT Press.
Huang L-C, Hsu S-E, Lin E (2009) A comparison of classification methods for predicting Chronic Fatigue Syndrome based on genetic. Journal of Translational Medicine 7: 81.[PMC free article] [PubMed] [Google Scholar]

Related Topics

Abstract

Introduction

Related Work

Research Methodology

Selected Dataset

Selected Classification Algorithms

Results

Comparative Analysis

Conclusion

References

Need custom essay sample written special for your assignment?

Related Topics

A Simplified Comparative Study of Machine Learning Classifiers

Abstract

Introduction

Related Work

Research Methodology

Selected Dataset

Selected Classification Algorithms

Results

Comparative Analysis

Conclusion

References

Need custom essay sample written special for your assignment?

More related essays