Can we setup a multivariate regression model for additional data - machine-learning

I am working on an education dataset, I am trying to fit in a multivariate regression model.
The response variable is the total number of students in various schools and the available predictors are "Students in Pre Primary", "Students in Primary", "Students in Secondary" and "Students in higher Secondary".
The thing here is the "Total number of students" is the addition of those 4 columns. I am able to fit in a model to predict the total enrollment but I wanted to know if can do that or the fact that it's simple addition of the predictors makes the prediction model redundant.
Or - can I lower the number of predictors to 2 and build a model that will predict the total number of students only on 2 of the 4 available columns.

Related

adding float value beside an image as an input for deep learning model , is it possible?

I have dataset for biomedical images for some plant leaves with disease and the classes are 6 classes and also i have a disease rate for these leaves and that affect the manual classification by users that collected the data "some how an image with class number 3 may be classified to a class number 5 if the disease rate was different"
I want to create a deep learning model that takes the image and disease rate into consideration together, Any Ideas ?

How do I test CNN model for less classes than training set in python

I have tropical storms images and trying build up a model to categorize the storm category. Here I am trying to predict the last storm stage using previous storm images but Last storm accounts only 5 categories training set has 7 categories( basically I split dataset like last storm for testing and first storms as training).Finally I have to ask, are there any methods to predict less classes than training classes.
In my opinion, it does not matter if your test set contains fewer categories than the training categories as long as the 5 categories that you care about can still be predicted by the model. When the model produces a prediction for a given test sample, you can sort the predicted classes and take only the predicted class with the highest accuracy (or top 3 accuracies, for example) and ignore the rest.
Otherwise, I would suggest training your model only with the number of classes you care about (5) with only the training set of these classes.

How to improve predictions for consumer expenditure survey?

I am trying to predict the total expenditure of a consumer from the consumer expenditure survey (data here). I chose the variables as Age, Income, Urban/ Rural, Sex, Education to predict the total expenditures in an household.
The correlation between Income, Expenditure is relatively less, and the predictions have an RMSE of ~3000 for a data with mean ~10000. I used transformation, normalization, scaling, and cross validation for pre-processing the data. However, none of the models are performing well in predicting the total expenditure. Is there any way to improve the predictions?
( I tried Linear regression, Lasso, KNN, Random forest, Gradient boosting algorithms)
Here's the scatterplot for income and expenditure,
scatterplot for income vs expenditure
I think the model is not performing well because of the less correlation. Any ideas to tackle such situations?

Which prediction model should we use to predict the list of colleges for a student

I have a training data set containing College names,student rank, branch, college cutoff. Which prediction model should I use to predict the list of colleges a student will get admission in according to his rank, college cutoff and the branch?
I am new to machine learning.
I expect the output to display a list of colleges in which a student can be admitted instead of displaying if a college is allocated to a student.
Your problem can be treated as a multi class classification problem where every college will become a class. You can use a simple random forest model and predict the class probabilities for every student record. Since you are using probabilities, the model will return the list of college along with the probability. Set a probability threshold and take the college above that threshold as your result.
This is a multiclass classification problem. If you are new, I suggest use tree based models such as random forest classifier (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) or try Xgboost if you are not getting good enough results from random forest. They are easy to use and perform nicely in multi-class classification problems. They will also give you feature importance easily that will help you to explain your model as well.

Customer behaviour prediction when there are no negative examples

Imagine you own a postal service and you want to optimize your business processes. You have a history of orders in the following form (sorted by date):
# date user_id from to weight-in-grams
Jan-2014 "Alice" "London" "New York" 50
Jan-2014 "Bob" "Madrid" "Beijing" 100
...
Oct-2017 "Zoya" "Moscow" "St.Petersburg" 30
Most of the records (about 95%) contain positive numbers in the "weight-in-grams" field, but there are a few that have zero weight (perhaps, these messages were cancelled or lost).
Is it possible to predict whether the users from the history file (Alice, Bob etc.) will use the service in Nov., 2017? What machine learning methods should I use?
I tried to use simple logistic regression and decision trees, but they evidently give positive outcome for any user, as there are very few negative examples in the training set. I also tried to apply Pareto/NBD model (BTYD library in R), but it seems to be extremely slow for large datasets, and my data set contains more than 500 000 records.
I have another problem: if I impute negative examples (considering that the user, who didn't send a letter in the certain month is a negative example for this month) the dataset grows from 30 Mb up to 10 Gb.
The answer is yes you can try to predict.
You can approach this as a time series and run RNN:
Train your RNN on your set pivoted so each user is one sample.
You can also pivot your set so each user is a a row (observation) by aggregating each users' data. Then run multivariate logistic regression. You will loose information this way, but it might be simpler. You can add time related columns such as 'average delay between orders', 'average orders per year' etc.
You can use Bayesian methods to estimate the probability with which the user will return.

Resources