A Definitive Guide to Data science workflow

Aspire Thought Leadership! Ever wondered about Data science workflow?. Find out more on what has changed with Data science workflow in the current age

There are so many different things that you are able to work with when it comes to data science workflow in machine learning. But to help us get started, we are going to take a look at some of the pieces that come with the workflow of a machine learning project. Basically, this post is going to help us go through all of the steps that are needed to build up a proper machine learning introduction project right from scratch, no matter what your end goals are.

Data science workflow

Along with this, we are going to go over some of the other aspects of the process that are valuable including feature engineering, feature exploration, data cleaning, and data pre-processing and look at the impact that all of this is going to have on the performance of a machine learning model. We will also look at some of the pre-modeling steps that you can take to help improve the model performance.

Now, before we start, we need to make sure that there are a few different Python libraries in place that we can use to help achieve this task. For the tasks that we are going to talk about specifically in this post, you will find the Python libraries of Matplotlib, Sci-kit Learn, Pandas, and NumPy the best to help you get the work done.

Understanding the Data science workflow of machine learning

We can define the workflow of machine learning in five main stages. These stages are going to include the following:

Gathering up the data we want to use
Pre-processing the data
Researching the model that is going to be the best based on the type of data we have.
Training and then testing the model.
Evaluating the model.

With those in mind, we need to take a look at each part and how they are going to be used in machine learning. But first, we should take a look at what is a machine learning model in the first place. We have mentioned these a bit already in this guide, but what do these mean and why would we want to work with these models here.

You will find that a machine learning model is simply going to be a piece of code that a data scientist or an engineer is going to make. This piece of code is going to be made smart by that engineer or data scientist through some training they do with the components of data science. [What is data science?] So, if you are not careful with the training that you provide to the model, and you just give it some garbage that isn’t very strong or good, you are going to get garbage and false or wrong predictions.

This means that when you are working on training the model that you want to use, make sure that the information and the data that you present to it is as strong and valuable as possible. This is going to ensure that you are going to be able to get the results that you want with the model, and ensures (at least better than other methods), that your model is going to give accurate results and will learn things the right way.

Data science workflow: Gathering data

Now that we have this information in mind, it is time to take a look at some of the different stages that come with the workflow of machine learning. And the first step is going to be gathering data that we want to use. It is impossible to figure out what machine learning algorithms we want to use with machine learning, and what kind of prediction we want to be able to get if we do not first stop and gather some data to help us with this.

The process of gathering the data that we plan to use is going to depend on the type of project we are aiming for in the end. For example, if we are looking to make some kind of machine learning project that relies on real-time data, then we can build up an IoT system that will use different data that is considered sensors. The data set is going to be collected from various sources, including a file, a database, a sensor, and other sources.

Before we jump in on this, though, we have to remember that the collected data can’t be used directly for performing the analysis process. Doing this is a bad idea because it means that there could be data that is important but missing, it could include some extremely large values, and the data could be really unorganized if it is in text form or even noisy data. As you can see, this is a lot of potential for problems, and adding these into the algorithm makes it less reliant.

So, how do we work on improving this issue? There is a simple process called Data Preparation that can be done to help us. It is also possible to use some of the different sets of data that are free to use online. UCI Machine Learning Repository and Kaggle are going to be two options of repositories that are the most common to use for making models of machine learning. Kaggle is one of the most visited, and if you need some practice working with the different algorithms that we talk about later one. They also have some competitions where you can participate and compete to ensure there is some testing of your knowledge in machine learning.

Data science workflow: Data Pre-processing

Once you have figured out the kind of data that you would like to use in your model, it is time to go through the steps of data pre-processing. In fact, this is one of the most important steps that come with machine learning. It is going to be the most important at helping you to build up these models and ensuring they are accurate at the same time. In machine learning, we are going to focus on the 80/20 rule. This means that as a data scientist, we need to spend 80 percent of our time pre-processing the data that we want to use in the model, and then 20 percent to actually perform the analysis that we need to use.

To help us out here, we need to take a look at data processing. Data pre-processing is going to be the process of cleaning up the raw data that you bring in. Raw data is simply going to refer to the data that has been collected out in the real world, from a variety of sources, and then it is converted over to a clean set of data. This means that whenever the data has been fathered from different sources, it is going to be collected in a raw format, and this data, while useful, is not going to be feasible to complete your analysis on. [what is big data?]

To take that raw data and make sure that it is going to work well with our analysis, and to ensure that it works the way that we want, we need to convert it into a small clean data set. And this part of the process is going to be known as data pre-processing.

As we just spent some time discussing, data pre-processing is going to be the processing of taking our raw data and cleaning it up in a manner that makes it easier to use in the machine learning model. So, we definitely need to take our time with this part of the process. This helps us actually to use the data that we have and ensures that we get good results from the applied model in both deep learning and in machine learning projects that you want to work with.

While you have to collect the data that you are working with is going to come from the real world, it is important to note that this kind of data is messy and incomplete and has other issues. If we submitted this into the model of machine learning that we want to make, it is going to be inaccurate and has a bunch of issues. Some of the messy types of data that we need to watch out for, and for which data pre-processing will help us to solve, include:

Missing data: Missing data is going to be found when it is not created continuously, or when some technical issues happen in the application.
Noisy data: This is a data type that is often going to go by the name of outliners. This is often going to happen when a human makes an error, such as the person gathering the data manually. It could also happen when the device that collects the data ends up having a technical glitch in the process.
Inconsistent data: This type of data could be collected thanks to some errors from humans, such as with a mistake in the values or with the name that was entered, or when the data is duplicated for some reason or another.

As you work through data pre-processing, you will find that there are three different types of data that you need to work with. The first kind is going to be numeric, which would include options like income and age. The second type is going to be more categorical, which is going to include things like nationality and gender. And the third type is going to be known as ordinal and would include things like low, medium, and high.

Now, the next question that we need to focus on is how we can perform data pre-processing in the first place. There are a few methods and options that you can use to make this happen. Some of the basics when it comes to pre-processing that you can use to convert the raw data you collected include:

Conversion of data. As we already understand that the models of machine learning can handle just numeric features, we also know that we have to take ordinal and categorical data and convert them in some manner so that they are considered numeric features instead. If you put in the categorical and ordinal values into the model, there will be some problems.
Ignoring the values that are missing. There are times when your data will have missing parts. In this method, we are going to remove the column or the row of data, depending on what makes the most sense of your needs. This method is efficient if you just have a few values that are missing but do not try to use it if the data set is missing out on a lot of values.
Filling in the values that are missing. For this one, when we come across data that is missing in our set, then we are able to fill in this missing data manually. The easiest way to do this is to add in the mean, median, or the highest frequency value that is found in our set of data.
Machine learning: If we see that our raw data is missing some parts, then we are able to predict what data is going to be present in that empty part. We simply will do this by using some of the data that is already found in our data set.
Outliers detection: Some data is in error that could be in our data set, and it is going to deviate quite a bit from the observations in the rest of the data set. This method is going to detect these and often get rid of them, so they do not mess with the rest of the information that is in our data set.

Research the model that is going to be the best for the data you are using.

Once we can go through and work on the different parts of data pre-processing, it is time to do a bit of research on which model that we need to use for machine learning. Our main goal here is to train with the best performing model possible, using the data that we already went through and pre-processed. There are a few different methods that we can do this, and they include:

Supervised learning. When we are using the supervised learning, the AI system is going to be presented with the data that is labeled, which means that each data is already going to be tagged with the label that it is supposed to use. The supervised learning that we can work with can be categorized into two other parts, including classification and regression, and each of these is going to work in a slightly different manner.

First, let us look at the classification. Classification problems are going to be when the variable of the target is categorical. This means that the output could be classified out into different classes. You can have as many of these as you would like, but the different target points will belong to one of the classes, such as with Class A or Class B. A classification problem is going to be when the output variable is a category of some point such as not spam and spam, or disease and no disease.

There are a lot of different algorithms that you can use to do a classification algorithm. These will include things like Logistic Regression, Support Vector Machine, Decision Trees and Random Forests, Naïve Bayes, and K-Nearest Neighbor.

Then we have the regression problems that work with supervised machine learning. When we look at a regression problem, we are looking a time when the target variable is going to be more continuous. There are a few different algorithms that you can use that go with regression, including Ensemble Methods, Gaussian Progresses Regression, Decision Trees and Random Forests, Support Vector Regression, and Linear Regression.

From there, we are going to move away from the supervised learning algorithms that you can work with over to the unsupervised machine learning algorithms. When we are working with unsupervised machine learning, the AI system is going to be presented with unlabeled and un-categorized data, and the algorithm for the system is going to act on the data, without needing to have any training before it starts.

The output that is used will be dependent upon the coded algorithms that are used. It is going to subject a system to unsupervised learning is one way of testing the AI to make sure that the algorithm is going to work in the manner that you would like. When you work with unsupervised machine learning, you are going to be able to break it up into two other categories, including Association and Clustering.

The first category of unsupervised machine learning that we will look at is the idea of clustering. With this one, we are going to take our data and divide it up into some groups. Unlike what we find with classification, these groups will not be known to us beforehand, which is why it fits into a task that is considered unsupervised. Some of the different algorithms and methods that you can use when working with clustering include Spectral Clustering, K-Means Clustering, Hierarchical Clustering, Boosting, K-Means Clustering, and Gaussian mixtures.

Training and then testing the model on the data

Once we have chosen which model we want to use, and which algorithm is the best for our needs, it is time to stop and do some training and some testing with the model and see how it works with the data that we are using. For training that needs to be done in the model, we are going to split up the model so that we have three sections, including Training Data, Validation Data, and Testing Data. Let us explore how each of these works so we can get a better idea of how they work and why each one is so important.

You are going to work to train the classifier of your choice using the training data set. You would then go through and tune the parameters using a validation set, and then test the performance of your classifier on the test data set that is not seen ahead of time. One thing that we need to note here is that when we are trying to train our classifier, only the validation and the training set is going to be available, and sometimes just one or the other is going to be there. You should not use the test data set during this training process because it can mess with the results that you get. This means that the test set is only going to be available when we are testing out the classifier.

First, we need to look at the training set. The training set is basically going to be the material through which the computer is going to learn how to process the different information it receives. Machine learning is able to utilize some different algorithms to help with training the sets of data that you would like. A set of data that is used for learning that is to fit the parameters that you set with the classifier.

Then we need to move on with the validation set. Cross-validation is primarily going to be used in machine learning to estimate the skill of a machine learning model, even on data that is unseen and that you do not know all that much about ahead of time. The unseen data set is going to be used from the training data to tune the parameters that you can set on your classifier.

And then finally we are going to work with a test set. This is going to be a set of unseen data, and it is only going to be used in this part to assess the performance of a full-specified classifier. Having this test set in place is one of the best ways to try out the data that you are using and the model and algorithm that you choose to go with, without wasting time and money doing all of it at once.

Once you have been able to divide up the data that you have into the three segments that we have from above, you will then be able to go through and do the training process that you need. When we work with a set of data, a training set is going to be implemented in order to help us build up the model that we need, and then the test, or the validation, is going to be set up in a way to validate the model that is built. Data points on the training set are going to be excluded from the test or the validation set. Usually, the data set is then divided up into the training set and the validation set through each of the iterations. Or you can choose to divide it up into the training set, a validation set, and then the test set with each iteration that you do.

The model that you choose, if it is the right one for your needs is going to use any of the examples that we had from step three. Once the model has had some time to be trained, we can work with that same trained model in order to predict using the testing data, or the data that is unseen at this point. Once we have been able to do this, we are then able to create something that is known as the confusion matrix, and this is going to tell us how well we did with training the model.

A confusion matrix is easy to work with, and it is going to come with four different parameters that help us to work on our model and see what the data points are bringing up for us. The four parameters that are going to be shown include True Positive, True Negatives, False Positives, and False Negatives. It is best if we are able to get more values into the true positives and the true negatives because this helps us to see that our model is more accurate. The size that you get the confusion matrix to be in is going to depend on the amount of data that you work with and how many classes of data that you are trying to add into the confusion matrix.

But what do each of these different parameters on the confusion matrix mean to us, and how do we know that the points are falling into the right part or not? The four parameters are going to include:

True Positives: These are going to be the times when we predict TRUE, and then the predicted output from the model is correct.
True Negatives: These are going to be the times when we predicted FALSE, and then the predicted output from the model is correct.
False Positives: These are going to be the times when we predicted TRUE, but the actual prediction that we get from the model for our output is going to be FALSE.
FalseNegatives: These are going to be the times when we predicted FALSE, but the actually predicted outputs that we get from our chosen model are going to be TRUE.

As you can imagine here, we want to get the correct answers as often as possible. This is going to ensure that the model that we choose is working properly and that we are not having to go through and fix a bunch of things. If you find that there are a lot of the False Positives and False Negatives that are showing up, either the model that we are using is wrong, or there is something wrong with our data. Either way, we need to go through and make some adjustments to this to improve it.

This is why we do some of the testings first and check out the accuracy of what we are doing with these models before we even get started. It is better to look at just some of the information and see that the model is wrong, rather than going through all of it and wasting a lot of time and effort and money, on a model that is not right. At this point, we can go back through and pick out another model and see how that works, and repeat those steps, if it ends up that this is a necessary step.

Data science workflow evaluation

And the fifth step that we need to spend some time on is the model evaluation. Just because it is the final step does not mean that it is not an integral part of the model development process. It is going to help us out because it finds the best model to use in order to represent our data and how well this model is going to be if you choose to work with it not only now, but also in the future.

You have to go through and evaluate the model that you are using. Do not just glance at it and assume everything is good. Evaluate the kind of results that you are getting out of this model, and see if it is actually something that you will be able to get accurate results from, or if it is something that is going to become ancient and unusable within a few years.

To help us to improve the model that we are using, we might tune in to some of the hyperparameters that come with the model. And then we also want to try and improve the accuracy while looking at our confusion matrix to try to increase the number of true negatives and true positives that we will see. This helps us to know that the model we chose, no matter which one it is, is accurate and good for our needs.

And those are the steps that are needed when it comes to the workflow of a machine learning model. It is not enough to just pick out a model or an algorithm randomly and assume that it is going to meet all of our needs and work every time. There are a lot of great algorithms that work with reinforcement learning in machine learning, but not all of them are going to work for your needs or for the project that you want to create.

When you take some time to go through the different steps of the workflow that we talked about above, you will find that it is much easier to create the right model and algorithm that your project needs, based on the information, and even the amount of information, that is present for you to use. It may seem more effective and more fun to just jump right in and get to the algorithms, but using the five steps above helps you to

find the data you want
clean the data, so it actually works with the algorithms you want
pick out the algorithm that you want to work with, how to test out the algorithm to see if it is right, and even how to evaluate it all in the end.

When you put all of this together, it may take a bit more time, but it can definitely help you to get the results that you want and helps you to get the right predictions no matter what kind of algorithm you are using.

Thought Leadership

Header$type=social_icons

A Definitive Guide to Data science workflow

Understanding the Data science workflow of machine learning

Data science workflow: Gathering data

Data science workflow: Data Pre-processing

Data science workflow evaluation

COMMENTS

Trending

Footer Social$type=social_icons