How to choose Machine Learning model for your problem ??

Greetings fellas,

Here we will discuss the above mentioned question.

Machine learning is part art and part science. When you look at machine learning algorithms, there is no one solution or one approach that fits all. There are several factors that can affect your decision to choose a machine learning algorithm.

Some problems are very specific and require a unique approach. E.g. if you look at a recommender system, it’s a very common type of machine learning algorithm and it solves a very specific kind of problem. While some other problems are very open and need a trial & error approach. Supervised learning, classification and regression etc. are very open. They could be used in anomaly detection, or they could be used to build more general sorts of predictive models.

Understand Your Data

The type and kind of data we have plays a key role in deciding which algorithm to use. Some algorithms can work with smaller sample sets while others require tons and tons of samples. Certain algorithms work with certain types of data. So first Know your data

Look at Summary statistics and visualizations

Percentiles can help identify the range for most of the data
Averages and medians can describe central tendency
Correlations can indicate strong relationships

2. Visualize the data

Box plots can identify outliers
Density plots and histograms show the spread of data
Scatter plots can describe bivariate relationships

Clean your data

Deal with missing value. Missing data affects some models more than others. Even for models that handle missing data, they can be sensitive to it (missing data for certain variables can result in poor predictions)
Choose what to do with outliers

Outliers can be very common in multidimensional data.
Some models are less sensitive to outliers than others. Usually tree models are less sensitive to the presence of outliers. However regression models, or any model that tries to use equations, could definitely be effected by outliers.
Outliers can be the result of bad data collection, or they can be legitimate extreme values.

3. Does the data needs to be aggregated

Augment your data

Feature engineering is the process of going from raw data to data that is ready for modeling. It can serve multiple purposes:

Make the models easier to interpret (e.g. binning)
Capture more complex relationships (e.g. NNs)
Reduce data redundancy and dimensionality (e.g. PCA)
Re-scale variables (e.g. standardizing or normalizing)

2. Different models may have different feature engineering requirements. Some have built in feature engineering.

Categorize the problem. This is a two-step process.

Categorize by input. If you have labelled data, it’s a supervised learning problem. If you have unlabelled data and want to find structure, it’s an unsupervised learning problem. If you want to optimize an objective function by interacting with an environment, it’s a reinforcement learning problem.
Categorize by output. If the output of your model is a number, it’s a regression problem. If the output of your model is a class, it’s a classification problem. If the output of your model is a set of input groups, it’s a clustering problem.

Find the available algorithms. Now that you have categorized the problem, you can identify the algorithms that are applicable and practical to implement using the tools at your disposal.

Implement all of them. Set up a machine learning pipeline that compares the performance of each algorithm on the dataset using a set of carefully selected evaluation criteria. The best one is automatically selected. You can either do this once or have a service running that does this in intervals when new data is added.

Optimize hyperparameters (optional). Using cross-validation, you can tune each algorithm to optimize performance, if time permits it. If not, manually selected hyperparameters will work well enough for the most part.

Now Some description of few algorithms. I would briefly describe them at some point of time in another blog where they would be given examples of each algorithm & will be more in depth. But for now we will see which algorithm can be used when.

Linear regression and Linear classifier. Despite an apparent simplicity, they are very useful on a huge amount of features where better algorithms suffer from over-fitting.
Logistic regression is the simplest non-linear classifier with a linear combination of parameters and nonlinear function (sigmoid) for binary classification.
Decision trees is often similar to people’s decision process and is easy to interpret. But they are most often used in compositions such as Random forest or Gradient boosting.
K-means is more primal, but a very easy to understand algorithm, that can be perfect as a baseline in a variety of problems.
PCA is a great choice to reduce dimensionality of your feature space with minimum loss of information.
Neural Networks are a new era of machine learning algorithms and can be applied for many tasks, but their training needs huge computational complexity.

Finally we have a cheat sheet from Scikit Learn documentation. They have developed an detailed answer to the question "How to decide when to use which algorithm & on what dataset"


(http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)

Go through the mentioned link, there you will find this chart. It is an live chart with each node linked to algorithm's documentation.

Thank you for your time. Please comment your views.

Introduction to ML

Greetings buddies Today we are going to see what is machine learning, why it is trending these days, why it is used and its applications. As we all know that the word machine learning is hyped and often used in technical conversations or when someone is boasting about their skills. But do we actually know it?? Maybe someone reading this blog does know it yet half of the students here may not know. So what is machine learning then? Is it some thing where people sit and let machine do task on it's own?? Or is it something where machine gets it's own brain?? Is it so?? The answer for this question is NO. Machine learning is task where computer/machine focuses on data to use them, find patterns, build logic, use math functions and give some output based on those data & math functions. It is true that math is one of the base pillar on which ML is developed. Our system dose not understand anything except math. So to make some sense from data, scientists have developed function...

Machine Learning

Search This Blog