Can I use logistic regression algorithm to predict an ETA for a given task based on historical data? - machine-learning

Can I use logistic regression algorithm to predict an ETA for a given task based on historical data? I have some tasks which takes variable amount of time based on few factors like task type, weather, season, time of request etc.
Today we capture the time taken for all the tasks based on task types in a mysql store. Now we want to add a feature where based on factors and task type, we want to predict an ETA for the task and show it to customer.
We are planning to use Spark and use Logistic Regression and SVM algorithm. We are too new to this domain and need your guidance in terms of validating the approach and additional pointers.

You can achieve this with just a linear regression model because you're trying to predict a continuous outcome (ETA).
You would just train a regression model where you're predicting ETA from your input features (task type, weather, season etc). So what this model learns is how long would the task takes to complete given a certain set of inputs, the predicted outcome is what you would then show to customers
Take a look at this: http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression
Logistic regression/SVM is used for classifying discrete outcomes (i.e. categories/groups).
So another approach might be to stratify the ETA scores in your mysql database into something like short/medium/long time to complete, and then use those 3 categories as your labels instead of the actual numerical value. Then you can use logistic regression to train a model that classifies into those 3 categories, based on your listed input features. This would work, but you lose some resolution due to condensing your ETA data into only 3 groups but that's a design decision you'd have to make.

Related

How to predict the stock price using the pattern of other stocks?

I have three months worth of stock prices before and after of certain events (bio-clinical success, dividends, m&a, etc.).
I want to analyze the trend after a specific event using these data, and based on this, I want to analyze the trend of new stocks waiting for a specific event.
But I'm not sure which algorithm to use.
Which algorithm should I use, LSTm or ARIMA or etc?
I would recommend starting with something simple like linear regression. Linear regression is used to find trends in data also it is a very simple algorithm that requires little understanding of advanced math compared to other algorithms. In linear regression, the relationships are modeled using linear predictor functions whose unknown model parameters are estimated from the data. Such models are called linear models. Most commonly, the conditional mean of the response given the values of the explanatory variables (or predictors) is assumed to be an affine function of those values; less commonly, the conditional median or some other quantile is used. Like all forms of regression analysis, linear regression focuses on the conditional probability distribution of the response given the values of the predictors, rather than on the joint probability distribution of all of these variables, which is the domain of multivariate analysis. But you can choose what algorithm you want to use

Why does my Random Forest Classifier perform better on test and validation data than on training data?

I'm currently training a random forest on some data I have and I'm finding that the model performs better on the validation set, and even better on the test set, than on the train set. Here are some details of what I'm doing - please let me know if I've missed any important information and I will add it in.
My question
Am I doing anything obviously wrong and do you have any advice for how I should improve my approach because I just can't believe that I'm doing it right when my model predicts significantly better on unseen data than training data!
Data
My underlying data consists of tables of features describing customer behaviour and a binary target (so this is a binary classification problem). Technically I have one such table per month and I tend to use several months of data to train and then a different month to predict (e.g. Train on Apr, May and Predict on Jun)
Generally this means I end up with a training dataset of about 100k rows and 20 features (I've previously looked into feature selection and found a set of 7 features which seem to perform best, so have been using these lately). My prediction set generally has around 50k rows.
My dataset is heavily unbalanced (approximately 2% incidence of target feature), so I'm using oversampling techniques - more on that below.
Method
I've searched around online quite a lot and this has led me to the following approach:
Take scaleable (continuous) features in the training data and standardise them (currently using sklearn StandardScaler)
Take categorical features and encode them into separate binary columns (one-hot) using Pandas get_dummies function
Remove 10% of the training data to form a validation set (I'm currently using a random seed in this process for comparability whilst I vary different things such as hyperparameters in the model)
Take the remaining 90% of training data and perform a grid search across a few parameters of the RandomForestClassifier() (currently min_samples_split, max_depth, n_estimators and max_features)
Within each hyperparameter combination from the grid I perform kfold validation with 5 folds and using a random state
Within each fold I oversample my minority class for training data only (sometimes using imbalanced-learn's RandomOverSampler() and sometimes using SMOTE() from the same package), train the model on the training data and then apply the model to the kth fold and record performance metrics (precision, recall, F1 and AUC)
Once I've been through 5 folds on each hyperparameter combination I find the best F1 score (and best precision if two combinations are tied on F1 score) and retrain a random forest on the entire 90% training data using those hyperparameters. During this step I use the same oversampling technique as I did in the kfold process
I then use this model to make predictions on the 10% of training data that I put aside earlier as a validation set, evaluating the same metrics as above
Finally I have a test set, which is actually based on data from another month, which I apply the already trained model to and evaluate the same metrics
Outcome
At the moment I'm finding that my training set achieves an F1 score of around 30%, the validation set is consistently slightly higher than this at around 36% (mostly driven by a much better precision than the training data e.g. 60% vs. 30%) and then the testing set is getting an F1 score of between 45% and 50% which is again driven by a better precision (around 65%)
Notes
Please do ask about any details I haven't mentioned; I've had my stuck in this for weeks and so have doubtless omitted some details
I've had a brief look (not a systematic analysis) of the stability of metrics between folds in the kfold validation and it seems that they aren't varying very much, so I'm fairly happy with the stability of the model here
I'm actually performing the grid search manually rather than using a Python pipeline because try as I might I couldn't get imbalanced-learn's Pipeline function to work with the oversampling functions and so I run a loop with combinations of hyperparameters, but I'm confident that this isn't impacting the results I've talked about above in an adverse way
When I apply the final model to the prediction data (and get an F1 score around 45%) I also apply it back to the training data itself out of interest and get F1 scores around 90% - 100%. I suppose this is to be expected as the model is trained and predicts on almost exactly the same data (except the 10% holdout validation set)

How to evaluate ML image classifier with confidence

Suppose I have a model that classifies images in to one of n categories. I know how to calculate the accuracy and sensitivity based just on the output label. However, I want to be more specific. How could I also incorporate the confidence percentage which is produced with each output???
You could use bootstrapping to obtain a confidence interval of your model on the dataset. A full demonstration here. If you want it for an individual sample, you may define another list, like the stat list, and store the predicted probabilities for that individual in there instead.

How to define when auto training should stop?

In AutoML Natural Language is there a way I can define parameters/thresholds where training should stop, e.g., after training for 5 hours or when reaching 50% accuracy?
In the documentation there is neither information about how long the model will be trained nor the training/eval progress, so I can't make an informed decision about when I should finish the training.
Currently there is no way to define exit criteria for training. AutoML NL will pick the best model fit for users’ problem and optimize the best result. Model training will use all data with valid labels and splits, and respect the split semantic.

evaluating accuracy of decision tree/forrest model

Im relatively new to ML. Ive created a decision tree model to predict prices of an item based on some criteria.
For an example, lets say the model predicts the price of a car based on a few features such as engine size, number of doors, fuel type, mileage and age.
Analysis of the data showed me that my data was not linear, so decision tree was a better fit. The model also does an ok job at predicting but before i can give it to any users, i need to quantify its accuracy.
As its non linear, R squared doesnt seem liek a good method of assessing accuracy, but im unsure what i should use.
Appreciate any advice on this.
In these cases, what you can usually do is to assess the performance of the model against a test or hold-out set (not used during the construction of the model), using a evaluation metric.
For regression problems (like the ones you are describing) there are several evaluation metrics available. The most common ones are MAE (Mean Absolute Error) and RMSE (Root Mean Squared Error)
To fully understand how good the performance of your model is, you can then compare it against other models, or against simple baselines (like predicting always the average price, or returning the price of the most similar car in the training set).

Resources