Skip to main content

How to choose Machine Learning model for your problem ??


Greetings fellas,
Here we will discuss the above mentioned question.
 
Machine learning is part art and part science. When you look at machine learning algorithms, there is no one solution or one approach that fits all. There are several factors that can affect your decision to choose a machine learning algorithm.

Some problems are very specific and require a unique approach. E.g. if you look at a recommender system, it’s a very common type of machine learning algorithm and it solves a very specific kind of problem. While some other problems are very open and need a trial & error approach. Supervised learning, classification and regression etc. are very open. They could be used in anomaly detection, or they could be used to build more general sorts of predictive models.

Understand Your Data

The type and kind of data we have plays a key role in deciding which algorithm to use. Some algorithms can work with smaller sample sets while others require tons and tons of samples. Certain algorithms work with certain types of data. So first Know your data
  1. Look at Summary statistics and visualizations
  • Percentiles can help identify the range for most of the data
  • Averages and medians can describe central tendency
  • Correlations can indicate strong relationships
     2. Visualize the data
  • Box plots can identify outliers
  • Density plots and histograms show the spread of data
  • Scatter plots can describe bivariate relationships
Clean your data
  1. Deal with missing value. Missing data affects some models more than others. Even for models that handle missing data, they can be sensitive to it (missing data for certain variables can result in poor predictions)
  2. Choose what to do with outliers
  • Outliers can be very common in multidimensional data.
  • Some models are less sensitive to outliers than others. Usually tree models are less sensitive to the presence of outliers. However regression models, or any model that tries to use equations, could definitely be effected by outliers.
  • Outliers can be the result of bad data collection, or they can be legitimate extreme values.
      3. Does the data needs to be aggregated

Augment your data
  1. Feature engineering is the process of going from raw data to data that is ready for modeling. It can serve multiple purposes:
  • Make the models easier to interpret (e.g. binning)
  • Capture more complex relationships (e.g. NNs)
  • Reduce data redundancy and dimensionality (e.g. PCA)
  • Re-scale variables (e.g. standardizing or normalizing)
     2. Different models may have different feature engineering requirements. Some have built in feature engineering.
  
Categorize the problem. This is a two-step process.
    1. Categorize by input. If you have labelled data, it’s a supervised learning problem. If you have unlabelled data and want to find structure, it’s an unsupervised learning problem. If you want to optimize an objective function by interacting with an environment, it’s a reinforcement learning problem.
    2. Categorize by output. If the output of your model is a number, it’s a regression problem. If the output of your model is a class, it’s a classification problem. If the output of your model is a set of input groups, it’s a clustering problem.
Find the available algorithms. Now that you have categorized the problem, you can identify the algorithms that are applicable and practical to implement using the tools at your disposal. 
 
Implement all of them. Set up a machine learning pipeline that compares the performance of each algorithm on the dataset using a set of carefully selected evaluation criteria. The best one is automatically selected. You can either do this once or have a service running that does this in intervals when new data is added.
 
Optimize hyperparameters (optional). Using cross-validation, you can tune each algorithm to optimize performance, if time permits it. If not, manually selected hyperparameters will work well enough for the most part.
 
Now Some description of few algorithms. I would briefly describe them at some point of time in another blog where they would be given examples of each algorithm & will be more in depth. But for now we will see which algorithm can be used when.
  1. Linear regression and Linear classifier. Despite an apparent simplicity, they are very useful on a huge amount of features where better algorithms suffer from over-fitting.
  2. Logistic regression is the simplest non-linear classifier with a linear combination of parameters and nonlinear function (sigmoid) for binary classification.
  3. Decision trees is often similar to people’s decision process and is easy to interpret. But they are most often used in compositions such as Random forest or Gradient boosting.
  4. K-means is more primal, but a very easy to understand algorithm, that can be perfect as a baseline in a variety of problems.
  5. PCA is a great choice to reduce dimensionality of your feature space with minimum loss of information.
  6. Neural Networks are a new era of machine learning algorithms and can be applied for many tasks, but their training needs huge computational complexity.
Finally we have a cheat sheet from Scikit Learn documentation. They have developed an detailed answer to the question "How to decide when to use which algorithm & on what dataset"

(http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)
Go through the mentioned link, there you will find this chart. It is an live chart with each node linked to algorithm's documentation.

Thank you for your time. Please comment your views.

 

Comments

  1. Thank you for sharing such a useful article. I had a great time. This article was fantastic to read. Continue to publish more articles on

    AI Services

    Data Engineering Services 

    Data Analytics Solutions

    Data Modernization Solutions

    ReplyDelete

Post a Comment

Popular posts from this blog

Kaggle's informative Tutorial

Greetings, I would like to declare that this is just an Informative post. No promotion of site or any of its partner is done here. Would share this kind of platforms regularly with everyone so one can code online & learn.  Today we will see how to use kaggle . What is kaggle? Well it is an site where MNCs host competitions to find solutions of their problems based on ML & AI. Also it is one of the best platforms to learn Python, R, Machine Learning & many more stuffs. It has its own coding environment like Jupyter Notebooks, Which is known as Kernel. This kernel supports both R & Python. In this kernels you can code freely and develop your own code & run it there itself as it gives 14gb RAM & 5.2gb GPU for each session. Let's see how we can leverage power of kaggle for learning. You need to make account first which only require an active E-Mail id. Then after signing in. You will see dashboard like this. My Kaggle Newsfeed Competitions Ma

Requirements for learning ML

Greetings guys, Recently we saw that what is machine learning & why we need it this days. Well we would like to know now that how to start coding & using ML for learning & business purposes. Here in this post we will learn basic requirements for learning ML. For starters we will require Python & Mathematics. But here we will focus on Python because not everyone wants to learn math in beginning. Tho one needs to have math background for better understanding. Yet i will have some lectures where in we will cover math but one can skip them if they don't want to do it. I would highly recommend that they see math irrespective of time. This is my GitHub link :- Basics of Python This file contains basics necessary to get used to with Python. If you want to test or practice code then download file & use Jupyter Notebook to run it. We will be using 2 different IDEs. Jupyter Notebooks Spyder Why 2 IDEs?? Because both have different environments. Jupyter Note