How to predict below and which algorithm is the best suit.
Employee has Work Activity Start_Date & End_Date (Columns).
Sheet has few other columns such as Work_Complexity (High & Low) , no. of sub-tasks for each activity.
How to predict Work Activity End_Date for a Start_Date? Which ML Algorithm has to be used ?
Is this can be considered as a realistic use case ?
thanks!!!
Yes, this is a realistic use case.
If you have a labelled data means, you have a sheet where employee start date and end date is known for existing tasks and now you want to predict the end date for any new task, you can use Linear Regression with multiple variable.
For more info related to Linear Regression with multiple variable, go through this link:
https://www.investopedia.com/terms/m/mlr.asp
Anyway, don't get much confused in that theory. In simple terms, Linear Regression is an approach to modelling a relationship between the variables (columns). Linear Regression with one variable means, you are trying to predict the end date with only using one variable(column) i.e. start date in your case. If you want to predict the end date with using more than one variable(columns) i.e. start date, complexity of task, sub-task etc; you have to use Linear Regression with multiple variable. I am using House Price Prediction model.
Below is the Implementation of Linear Regression with one variable using python, where we will predict the house price using only one variable:
import pandas as pd #used for uploading your datasets #you have to import machine learning libraries
import numpy as np #for array
from sklearn import linear_model #for prediction
df = pd.read_csv('/content/MLPractical2 - Sheet1.csv') #you need to upload your file
df
Output: File which I have uploaded, contains following data
Area || Price
2600 || 555000
3000 || 565000
3200 || 610000
3600 || 680000
4000 || 725000
Let's make a prediction of house price which is having area 3601:
reg = linear_model.LinearRegression()
reg.fit(df[['Area']], df.Price)
reg.predict([[3601]])
Output : array([669653.42465753])
We are predicting price on basis of only one variable(column) i.e Area
As you can observe in file which i have uploaded, Price of House having area 3600 is 680000 and price which our algorithm is predicting for area 3601 is 669653.42465753 which is very close.
Let's look at the implementation of Linear Regression with multiple variable using python; where we'll use multiple variable to predict our house price
import pandas as pd #same as above
import numpy as np
from sklearn import linear_model
df = pd.read_csv('/content/ML_Sheet_2.csv')
df
Output: File which I have uploaded in this case contains following data
Area || Bedroooms || Age || Price
2600 || 3.0 || 20 || 550000
3000 || 4.0 || 15 || 565000
3200 ||3.0 ||18 || 610000
3600 || 3.0 || 30 || 595000
4000 || 5.0 || 8 || 760000
Let's make a prediction of house price which is having area 3500, 3 bedrooms and 10 years old
reg = linear_model.LinearRegression()
reg.fit(df[['Area', 'Bedroooms', 'Age']], df.Price)
reg.predict([[3500, 3, 10]])
Output: array([717775])
We are predicting the house price on the basis of three variable i.e. Area, Number od bedrooms and Age of House.
As you can observe in the file which I have uploaded, Price of House having area 3200, 3 bedrooms and 18 years old is 610000 and price which our algorithm is predicting for area 3500(more than 3200), 3 bedrooms and 10 years old is 717775 which is very close and understandable because we are predicting for house which is having more area than 3200 and less age(New house has more price) than 18.
Similarly, you can also prepare a excel sheet of your existing data and save it in .csv format and proceed further as I did. I am using google colab for writing my code; I prefer you to use the same:
https://colab.research.google.com/notebooks/intro.ipynb#recent=true
Hope this helps you!
Related
Dataset Sample
Can I use any algorithm to train above dataset ?
Because Each Row (Id) has Dependent Variable(Status) . But Each "Id" again as Mulitple Rows as per Features
You Can Assume it as "Each Id has multiple transaction and All transactions have common Status"
Will Machine learning find some Patterns from these transaction
Is there any other approach to solve these type of problems
Just fill your ID row with the value from the above row , same for the status row, this will lead to:
df
ID Feature1 Feature2 Feature3 Status
8079 100 Asia High Approved
8079 200 Africa Low Approved
When you run a classification algorithm, you can use: ID, Feature1, Feature2, Feature3as features and Status as target. A classifier will learn with this and everything is completly the same as before.
The features are still independet. Dependet features you will only have if the variables are somehow dependet to each other, in your case the ID 8079 does not lead to Feature1: Africa. They are independet.
You can fill your cells with:
import numpy as np
df[df[0]==""] = np.NaN
df.fillna(method='ffill')
Based on your comments, the approach can be slightly different, you need to convert your entries to new features (Python pandas convert rows to columns where multiple columns exist):
The dataframe then should look like:
ID Feature1 Feature2 Feature3 Feature1a .... Feature3z Status
8079 100 Asia High 200 Approved
you can either assume that each row is independent and ignore the id column or if every ID has 3 rows, you could extend the dataset with more features
Hi I was following the Machine Learning course by Andrew Ng.
I found that in regression problems, specially logistic regression they have used integer values for the features which could be plotted in a graph. But there are so many use cases where the feature values may not be integer.
Let's consider the follow example :
I want to build a model to predict if any particular person will take a leave today or not. From my historical data I may find the following features helpful to build the training set.
Name of the person, Day of the week, Number of Leaves left for him till now (which maybe a continuous decreasing variable), etc.
So here are the following questions based on above
How do I go about designing the training set for my logistic regression model.
In my training set, I find some variables are continuously decreasing (ex no of leaves left). Would that create any problem, because I know continuously increasing or decreasing variables are used in linear regression. Is that true ?
Any help is really appreciated. Thanks !
Well, there are a lot of missing information in your question, for example, it'll be very much clearer if you have provided all the features you have, but let me dare to throw some assumptions!
ML Modeling in classification always requires dealing with numerical inputs, and you can easily infer each of the unique input as an integer, especially the classes!
Now let me try to answer your questions:
How do I go about designing the training set for my logistic regression model.
How I see it, you have two options (not necessary both are practical, it's you who should decide according to the dataset you have and the problem), either you predict the probability of all employees in the company who will be off in a certain day according to the historical data you have (i.e. previous observations), in this case, each employee will represent a class (integer from 0 to the number of employees you want to include). Or you create a model for each employee, in this case the classes will be either off (i.e. Leave) or on (i.e. Present).
Example 1
I created a dataset example of 70 cases and 4 employees which looks like this:
Here each name is associated with the day and month they took as off with respect to how many Annual Leaves left for them!
The implementation (using Scikit-Learn) would be something like this (N.B date contains only day and month):
Now we can do something like this:
import math
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
# read dataset example
df = pd.read_csv('leaves_dataset.csv')
# assign unique integer to every employee (i.e. a class label)
mapping = {'Jack': 0, 'Oliver': 1, 'Ruby': 2, 'Emily': 3}
df.replace(mapping, inplace=True)
y = np.array(df[['Name']]).reshape(-1)
X = np.array(df[['Leaves Left', 'Day', 'Month']])
# create the model
parameters = {'penalty': ['l1', 'l2'], 'C': [0.1, 0.5, 1.0, 10, 100, 1000]}
lr = LogisticRegression(random_state=0)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=0)
clf = GridSearchCV(lr, parameters, cv=cv)
clf.fit(X, y)
#print(clf.best_estimator_)
#print(clf.best_score_)
# Example: probability of all employees who have 10 days left today
# warning: date must be same format
prob = clf.best_estimator_.predict_proba([[10, 9, 11]])
print({'Jack': prob[0,0], 'Oliver': prob[0,1], 'Ruby': prob[0,2], 'Emily': prob[0,3]})
Result
{'Ruby': 0.27545, 'Oliver': 0.15032,
'Emily': 0.28201, 'Jack': 0.29219}
N.B
To make this relatively work you need a real big dataset!
Also this can be better than the second one if there are other informative features in the dataset (e.g. the health status of the employee at that day..etc).
The second option is to create a model for each employee, here the result would be more accurate and more reliable, however, it's almost a nightmare if you have too many employees!
For each employee, you collect all their leaves in the past years and concatenate them into one file, in this case you have to complete all days in the year, in other words: for every day that employee has never got it off, that day should be labeled as on (or numerically speaking 1) and for the days off they should be labeled as off (or numerically speaking 0).
Obviously, in this case, the classes will be 0 and 1 (i.e. on and off) for each employee's model!
For example, consider this dataset example for the particular employee Jack:
Example 2
Then you can do for example:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
# read dataset example
df = pd.read_csv('leaves_dataset2.csv')
# assign unique integer to every on and off (i.e. a class label)
mapping = {'off': 0, 'on': 1}
df.replace(mapping, inplace=True)
y = np.array(df[['Type']]).reshape(-1)
X = np.array(df[['Leaves Left', 'Day', 'Month']])
# create the model
parameters = {'penalty': ['l1', 'l2'], 'C': [0.1, 0.5, 1.0, 10, 100, 1000]}
lr = LogisticRegression(random_state=0)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=0)
clf = GridSearchCV(lr, parameters, cv=cv)
clf.fit(X, y)
#print(clf.best_estimator_)
#print(clf.best_score_)
# Example: probability of the employee "Jack" who has 10 days left today
prob = clf.best_estimator_.predict_proba([[10, 9, 11]])
print({'Off': prob[0,0], 'On': prob[0,1]})
Result
{'On': 0.33348, 'Off': 0.66651}
N.B in this case you have to create a dataset for each employee + training especial model + filling all the days the never taken in the past years as off!
In my training set, I find some variables are continuously decreasing (ex no of leaves left). Would that create any problem,
because I know continuously increasing or decreasing variables are
used in linear regression. Is that true ?
Well, there is nothing preventing you from using contentious values as features (e.g. number of leaves) in Logistic Regression; actually it doesn't make any difference if it's used in Linear or Logistic Regression but I believe you got confused between the features and the response:
The thing is, discrete values should be used in the response of Logistic Regression and Continuous values should be used in the response of the Linear Regression (a.k.a dependent variable or y).
I know of a couple of classification algorithms such as decision trees, but I can't use any of them to the problem I have at hands.
I have a dataset in which each row contains information about a purchase. It's columns are:
- customer id
- store id where the purchase took place
- date and time of the event
- amount of money spent
I'm trying to make a prediction that, given the information of who, where and when, predicts how much money is going to be spent.
What are some possible ways of doing this? Are there any well-known algorithms?
Also, I'm currently learning RapidMiner, and I'm experimenting with some of its features. Everything that I've tried there doesn't allow me to have a real number (amount spent) as a label. Maybe I'm doing something wrong?
You could use a Decision Tree Regressor for this. Using a toolkit like scikit-learn, you could use the DecisionTreeRegressor algo where your features would be store id, date and time, and customer id, and your target would be the amount spent.
You could turn this into a supervised learning problem. This is untested code, but it could probably get you started
# Load libraries
import numpy as np
import pylab as pl
from sklearn import datasets
from sklearn.tree import DecisionTreeRegressor
from sklearn import cross_validation
from sklearn import metrics
from sklearn import grid_search
def fit_predict_model(data_import):
"""Find and tune the optimal model. Make a prediction on housing data."""
# Get the features and labels from your data
X, y = data_import.data, data_import.target
# Setup a Decision Tree Regressor
regressor = DecisionTreeRegressor()
parameters = {'max_depth':(4,5,6,7), 'random_state': [1]}
scoring_function = metrics.make_scorer(metrics.mean_absolute_error, greater_is_better=False)
## fit your data to it ##
reg = grid_search.GridSearchCV(estimator = regressor, param_grid = parameters, scoring=scoring_function, cv=10, refit=True)
fitted_data = reg.fit(X, y)
print "Best Parameters: "
print fitted_data.best_params_
# Use the model to predict the output of a particular sample
x = [## input a test sample in this list ##]
y = reg.predict(x)
print "Prediction: " + str(y)
fit_predict_model(##your data in here)
I took this from a project I was working on almost directly to predict housing prices so there are probably some unnecessary libraries and without doing validation you have no clue how accurate this case would be, but this should get you started.
Check out this link:
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
Yes, as comments have pointed out it's regression that you need. Linear regression does sound like a good starting point as you don't have a huge number of variables.
In RapidMiner type regression into the Operators menu and you'll see several options under Modelling-> Functions. Linear Regression, Polynomical, Vector, etc. (There's more, but as a beginner let's start here).
Right click any of these operators and press Show Operator Info and you'll see numerical labels are allowed.
Next scroll through the help documentation of the operator and you'll see a link to a tutorial process. It's really simple to use, but it's good to get you started with an example.
Let me know if you need any help.
I am new to statistics, Python, machine learning and Scikit-learn. However, I am trying this project where I have a CSV with 35 columns of student data. The first column is an ID which I think I can ignore. The last 3 columns are the grade 1, grade 2 and grade 3 scores. I have 400 rows. I want to see if I can learn some machine learning with it, and make some sense of the data I have. Now I understand Scikit works on Numpy arrays which do not handle categorical data like sex ('male', 'female') and so on. So I codified all the 30 categories with 1 for male, 2 for female and so on and so forth. I then did the following
X = my_data[:,1:33]
y = my_data[:,34]
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X,y)
expected = y
predicted = model.predict(X)
mse = np.mean((predicted-expected)**2)
print(mse)
print(model.score(X,y))
I got a MSE of 6.0839840461 and a model score of 0.709407474898.
I got some result. So far so good for a first attempt. However, I realized that since I assigned increasing code values like 1 for male, 2 for female, the Linear Regression would have treated them as weights. How do I replace the Gender column with [1,0] or [0,1], which I learn is the right way to represent categorical data? Would it be a dictionary type column or a list type column? If so how will it be part of the Numpy array?
This is called indicator or dummy variables, and Pandas allows to easily encode such categorical values:
>>> import pandas as pd
>>> pd.get_dummies(['male', 'female'])
female male
0 0 1
1 1 0
Don't forget about multicollinearity, though - algorithms like linear regression rely on independence of variables, while in your case female=0 definitely means male=1. In this case simply remove one dummy variable (e.g. use only female var and not male).
There is also a LabelEncoder() in sklearn.preprocessing package:
from sklearn import preprocessing
le1 = preprocessing.LabelEncoder()
y = le1.transform(y)
You can also inverse transform back with le1.inverse_transform(y).
The encoding is done automatically though, you cannot change the order.
What are the attributes used in time series to be forecasted using SVM? I have two values the date and the value at that date for the class I already know that I can use -1 and 1 when price gets up or down but still don't know how to plot the time series to calculate the hyperplane
There are some papers that show some ways to do it:
Financial time series forecasting using support vector machine
Using Support Vector Machines in Financial Time Series Forecasting
Financial Forecasting Using Support Vector Machines
I really recommend that you go through the existent literature, but just for fun I will describe an easy way (probably not the best) to do it.
Let's say you have N pairs where is the particular date/time of the pair and its corresponding value. The pairs are sorted by its X component.
Let's say you want to predict if given , the corresponding unknown value will go up or down (Notice that you could also use regression and instead try to predict the value itself).
Then we could train a model with a training set like this:
Input Value
====================== ================
y_t0, y_t1, ..., y_ti-1 1 :if y_ti > y_ti-1, -1 otherwise
y_t1, y_t2, ..., y_ti 1 :if y_ti+1 > y_ti, -1 otherwise
y_t2, y_t3, ..., y_ti+1 1 :if y_ti+2 > y_ti+1, -1 otherwise
y_t3, y_t4, ..., y_ti+2 1 :if y_ti+3 > y_ti+2, -1 otherwise
y_t4, y_t5, ..., y_ti+3 1 :if y_ti+4 > y_ti+3, -1 otherwise
Basically, you will be training the algorithm to make an educated guess of the next "tick" in the future by given to it a glimpse of the past. Once your model is trained to make a prediction you feed the model with the N values (where N is the amount of values you used as input in your training phase) previous to the value you want to predict.