ML | Supervised Learning Classification Techniques

Chandima Jayamina
5 min readSep 3, 2024

--

In the ever-evolving world of artificial intelligence, machine learning has emerged as a game-changer across various industries. Among its many applications, classification stands out as a fundamental technique that powers everything from spam filters in our email inboxes to medical diagnosis tools in healthcare.

But what exactly is classification in machine learning? How does it work, and why is it so pivotal in turning raw data into actionable insights? Whether you’re a data science enthusiast or a curious learner, this article will take you on a journey through the basics of machine learning classification. We’ll explore key concepts, popular algorithms, and real-world applications that demonstrate the immense potential of this powerful tool.

1.0 Logistic Regression

Logistic regression is a linear model used for binary classification. It predicts the probability that a given input belongs to a certain class, typically using a sigmoid function to map predictions to probabilities.
The model assumes a linear relationship between the features and the log-odds of the outcome.

from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression()

When to use :

  • Works well when the classes are linearly separable.
  • The problem involves binary classification.

Cons :

  • Struggles with non-linear relationships.
  • Can be less accurate when dealing with complex datasets.

2.0 Decision Tree

Decision trees split the data into branches based on feature values, making decisions at each node to classify the data.
The tree structure is easy to visualize, with leaves representing class labels and branches representing feature conditions.

from sklearn import tree
DTree = tree.DecisionTreeClassifier()

When to use :

  • When data contains a mix of feature types.
  • Handles both numerical and categorical data and do not need feature scaling

Cons :

  • Prone to overfitting, especially with deep trees.
  • Can be unstable, as small changes in data can result in a completely different tree

3.0 Random Forest

Random Forest is an ensemble method that builds multiple decision trees (usually hundreds) and combines their predictions to improve accuracy and reduce overfitting.
Each tree in the forest is trained on a random subset of the data, and the final prediction is made by averaging or majority voting.

from sklearn.ensemble import RandomForestClassifier
randf_model = RandomForestClassifier()
# Default n_estimators(random trees):10

When to use :

  • When data is complex with many features.
  • Handles large datasets with higher dimensionality well, Reduces variance by averaging multiple trees, High accuracy and robust against overfitting.

Cons :

  • Less interpretable than a single decision tree.
  • Computationally more intensive compared to individual models like logistic regression.

4.0 Support Vector

SVMs work by finding the optimal hyperplane that maximally separates the classes in the feature space.
For non-linear data, SVM uses the kernel trick to project the data into a higher-dimensional space where it can be linearly separated.

from sklearn.svm import SVC
svc_model = SVC()
svc_model.fit(X_train, y_train) # Default C=1, gamma=auto, kernel=rbf

Gamma :

Gamma affects the shape of the decision boundary by controlling the influence range of a single training point. It’s more about how much a single data point affects the model.

  • Low Gamma: Leads to a more generalized model with a smoother decision boundary, potentially underfitting the data.
  • High Gamma: Leads to a more complex model with a tighter decision boundary around data points, which can cause overfitting.

C (Regularization Parameter) :

C affects the smoothness of the decision boundary by controlling the trade-off between misclassifications on the training set and model complexity. It’s more about the overall tolerance to misclassification errors.

  • Low C: Encourages a simpler decision boundary, allowing more misclassifications. This can result in underfitting but with better generalization to new data.
  • High C: Prioritizes classifying all training examples correctly, leading to a more complex decision boundary that might overfit the data.

When to use :

  • When the classes are not linearly separable.

Cons :

  • Computationally intensive, especially with large datasets.
  • Choosing the right kernel and tuning parameters can be tricky.

Overall

  • Use Logistic Regression: If your problem is binary classification, you want a simple and interpretable model, and your data is linearly separable.
  • Use SVM: If you have a complex dataset, possibly non-linear, and you’re looking for a robust method that can handle outliers and high-dimensional data.
  • Use Decision Tree: If interpretability is crucial and your problem involves mixed feature types or requires a model that’s easy to visualize.
  • Use Random Forest: If you need a powerful and accurate model for a complex problem, especially when dealing with large datasets and when overfitting is a concern.

Choosing the right model often involves considering the trade-offs between interpretability, accuracy, computational cost, and the nature of the data. In practice, it’s common to try multiple models and select the one that performs best based on your evaluation metrics.

--

--

Chandima Jayamina
Chandima Jayamina

Written by Chandima Jayamina

Aspiring Data Scientist with a DevOps background. Exploring machine learning, data analysis, and predictive modeling. Sharing my journey in tech.

No responses yet