Skip to main content

How to choose Machine Learning model for your problem ??


Greetings fellas,
Here we will discuss the above mentioned question.
 
Machine learning is part art and part science. When you look at machine learning algorithms, there is no one solution or one approach that fits all. There are several factors that can affect your decision to choose a machine learning algorithm.

Some problems are very specific and require a unique approach. E.g. if you look at a recommender system, it’s a very common type of machine learning algorithm and it solves a very specific kind of problem. While some other problems are very open and need a trial & error approach. Supervised learning, classification and regression etc. are very open. They could be used in anomaly detection, or they could be used to build more general sorts of predictive models.

Understand Your Data

The type and kind of data we have plays a key role in deciding which algorithm to use. Some algorithms can work with smaller sample sets while others require tons and tons of samples. Certain algorithms work with certain types of data. So first Know your data
  1. Look at Summary statistics and visualizations
  • Percentiles can help identify the range for most of the data
  • Averages and medians can describe central tendency
  • Correlations can indicate strong relationships
     2. Visualize the data
  • Box plots can identify outliers
  • Density plots and histograms show the spread of data
  • Scatter plots can describe bivariate relationships
Clean your data
  1. Deal with missing value. Missing data affects some models more than others. Even for models that handle missing data, they can be sensitive to it (missing data for certain variables can result in poor predictions)
  2. Choose what to do with outliers
  • Outliers can be very common in multidimensional data.
  • Some models are less sensitive to outliers than others. Usually tree models are less sensitive to the presence of outliers. However regression models, or any model that tries to use equations, could definitely be effected by outliers.
  • Outliers can be the result of bad data collection, or they can be legitimate extreme values.
      3. Does the data needs to be aggregated

Augment your data
  1. Feature engineering is the process of going from raw data to data that is ready for modeling. It can serve multiple purposes:
  • Make the models easier to interpret (e.g. binning)
  • Capture more complex relationships (e.g. NNs)
  • Reduce data redundancy and dimensionality (e.g. PCA)
  • Re-scale variables (e.g. standardizing or normalizing)
     2. Different models may have different feature engineering requirements. Some have built in feature engineering.
  
Categorize the problem. This is a two-step process.
    1. Categorize by input. If you have labelled data, it’s a supervised learning problem. If you have unlabelled data and want to find structure, it’s an unsupervised learning problem. If you want to optimize an objective function by interacting with an environment, it’s a reinforcement learning problem.
    2. Categorize by output. If the output of your model is a number, it’s a regression problem. If the output of your model is a class, it’s a classification problem. If the output of your model is a set of input groups, it’s a clustering problem.
Find the available algorithms. Now that you have categorized the problem, you can identify the algorithms that are applicable and practical to implement using the tools at your disposal. 
 
Implement all of them. Set up a machine learning pipeline that compares the performance of each algorithm on the dataset using a set of carefully selected evaluation criteria. The best one is automatically selected. You can either do this once or have a service running that does this in intervals when new data is added.
 
Optimize hyperparameters (optional). Using cross-validation, you can tune each algorithm to optimize performance, if time permits it. If not, manually selected hyperparameters will work well enough for the most part.
 
Now Some description of few algorithms. I would briefly describe them at some point of time in another blog where they would be given examples of each algorithm & will be more in depth. But for now we will see which algorithm can be used when.
  1. Linear regression and Linear classifier. Despite an apparent simplicity, they are very useful on a huge amount of features where better algorithms suffer from over-fitting.
  2. Logistic regression is the simplest non-linear classifier with a linear combination of parameters and nonlinear function (sigmoid) for binary classification.
  3. Decision trees is often similar to people’s decision process and is easy to interpret. But they are most often used in compositions such as Random forest or Gradient boosting.
  4. K-means is more primal, but a very easy to understand algorithm, that can be perfect as a baseline in a variety of problems.
  5. PCA is a great choice to reduce dimensionality of your feature space with minimum loss of information.
  6. Neural Networks are a new era of machine learning algorithms and can be applied for many tasks, but their training needs huge computational complexity.
Finally we have a cheat sheet from Scikit Learn documentation. They have developed an detailed answer to the question "How to decide when to use which algorithm & on what dataset"

(http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)
Go through the mentioned link, there you will find this chart. It is an live chart with each node linked to algorithm's documentation.

Thank you for your time. Please comment your views.

 

Comments

  1. Thank you for sharing such a useful article. I had a great time. This article was fantastic to read. Continue to publish more articles on

    AI Services

    Data Engineering Services 

    Data Analytics Solutions

    Data Modernization Solutions

    ReplyDelete

Post a Comment

Popular posts from this blog

Introduction to ML

Greetings buddies Today we are going to see what is machine learning, why it is trending these days, why it is used and its applications. As we all know that the word machine learning is hyped and often used in technical conversations or when someone is boasting about their skills. But do we actually know it?? Maybe someone reading this blog does know it yet half of the students here may not know. So what is machine learning then? Is it some thing where people sit and let machine do task on it's own?? Or is it something where machine gets it's own brain?? Is it so?? The answer for this question is NO. Machine learning is task where computer/machine focuses on data to use them, find patterns, build logic, use math functions and give some output based on those data & math functions. It is true that math is one of the base pillar on which ML is developed. Our system dose not understand anything except math. So to make some sense from data, scientists have developed function

Kaggle's informative Tutorial

Greetings, I would like to declare that this is just an Informative post. No promotion of site or any of its partner is done here. Would share this kind of platforms regularly with everyone so one can code online & learn.  Today we will see how to use kaggle . What is kaggle? Well it is an site where MNCs host competitions to find solutions of their problems based on ML & AI. Also it is one of the best platforms to learn Python, R, Machine Learning & many more stuffs. It has its own coding environment like Jupyter Notebooks, Which is known as Kernel. This kernel supports both R & Python. In this kernels you can code freely and develop your own code & run it there itself as it gives 14gb RAM & 5.2gb GPU for each session. Let's see how we can leverage power of kaggle for learning. You need to make account first which only require an active E-Mail id. Then after signing in. You will see dashboard like this. My Kaggle Newsfeed Competitions Ma