Must train and test data have the same exact features? - machine-learning

Let's say we want to predict goals scored by both of the teams in a game of football. And let's say that our training data is going to contain: team1, team2, team1 goals, team2 goals, score, winner.
My understanding is that the testing data has to have the same amount of features as the training data, and that test data can only contain "input" and not output.
So does that mean that we cant train our model with those features since we do not know how many goals are scored before the game has started?

Related

How do I test CNN model for less classes than training set in python

I have tropical storms images and trying build up a model to categorize the storm category. Here I am trying to predict the last storm stage using previous storm images but Last storm accounts only 5 categories training set has 7 categories( basically I split dataset like last storm for testing and first storms as training).Finally I have to ask, are there any methods to predict less classes than training classes.
In my opinion, it does not matter if your test set contains fewer categories than the training categories as long as the 5 categories that you care about can still be predicted by the model. When the model produces a prediction for a given test sample, you can sort the predicted classes and take only the predicted class with the highest accuracy (or top 3 accuracies, for example) and ignore the rest.
Otherwise, I would suggest training your model only with the number of classes you care about (5) with only the training set of these classes.

Which prediction model should we use to predict the list of colleges for a student

I have a training data set containing College names,student rank, branch, college cutoff. Which prediction model should I use to predict the list of colleges a student will get admission in according to his rank, college cutoff and the branch?
I am new to machine learning.
I expect the output to display a list of colleges in which a student can be admitted instead of displaying if a college is allocated to a student.
Your problem can be treated as a multi class classification problem where every college will become a class. You can use a simple random forest model and predict the class probabilities for every student record. Since you are using probabilities, the model will return the list of college along with the probability. Set a probability threshold and take the college above that threshold as your result.
This is a multiclass classification problem. If you are new, I suggest use tree based models such as random forest classifier (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) or try Xgboost if you are not getting good enough results from random forest. They are easy to use and perform nicely in multi-class classification problems. They will also give you feature importance easily that will help you to explain your model as well.

Incorporating prior knowledge to machine learning models

Say I have a data set of students with features such as income level, gender, parents' education levels, school, etc. And the target variable is say, passing or failing a national exam. We can train a machine learning model to predict, given these values whether a student is likely to pass or fail (say in sklearn, using predict_prob we can say the probability of passing)
Now say I have a different set of information which has nothing to do with the previous data set, which includes the schools and percentage of students from that particular school who has passed that national exam last year and years before. say, schoolA: 10%, schoolB: 15%, etc.
How can I use this additional knowledge to improve my model. For sure this data is valuable. (Students from certain schools have a higher chance of passing the exam due to their educational facilities, qualified staff, etc.).
Do i some how add this information as a new feature to the data set? If so what is the recommend way. Or do I use this information after the model prediction and somehow combine these to get a final probability ? Obviously an average or a weighted average doesn't work due to the second data set having probabilities in the range below 20% which then drags the total probability very low. How do data scientist usually incorporate this kind of prior knowledge? Thank you
You can try different ways to add this data and see if your model will be able to learn on this set. More likely you'll see right away, that this additional data will just confuse the model. Mostly because you're already providing more precise data on each student of the school and the model has more freedom to use this information.
But artificial neural network training is all about continuous trials and errors, so you definitely should try to train it with all possible data you can imagine to see if it'll be able to get a descent error in the end.
Use the average pass percentage of the students' school as a new feature of each student is worth to try.

Advice on classification approach

I need to classify incoming car rentals, but my historic data that I could use for training is in "grouped" form and I can't see how I could train a classification model.
My incoming data is a list of car model, quantity and unit price:
Chevrolet Spark, 1, 196.91
Fiat 500, 1, 196.91
Toyota Prius Hybrid, 3, 213.73
This incoming data is currently manually classified and saved grouped by class and total price per group (Chevy and Fiat is Economy, Prius is Hybrid):
Economy, 393.82
Hybrid, 641.19
This problem should be solvable by machine learning but I can't figure out how to build a training set for a supervised classifier. Any guidance appreciated.
Thanks
A naive bayes classifier should do what you are trying to do... You can use the price as the feats to use and learn from what is already tagged.
However i don't get how you can have consistent data using the TOTAL price to classify since you don't always have as many objects from one group to another... You would have to use the unit price.
There are lots of algorithms that provide multiclass classification, but could you explain more about what you're trying to predict? From what you've written, it sounds more like a scenario for an ETL process than a machine learning model.
If I understand your example correctly, an incoming record with a car model of "Chevy Spark" or "Fiat 500" would always be labeled "Economy", while an incoming record with a car model of "Toyota Prius Hybrid" would be labeled "Hybrid". A simple lookup table would do the job here - no need for fancy machine learning mathematics. :)

Can I use logistic regression algorithm to predict an ETA for a given task based on historical data?

Can I use logistic regression algorithm to predict an ETA for a given task based on historical data? I have some tasks which takes variable amount of time based on few factors like task type, weather, season, time of request etc.
Today we capture the time taken for all the tasks based on task types in a mysql store. Now we want to add a feature where based on factors and task type, we want to predict an ETA for the task and show it to customer.
We are planning to use Spark and use Logistic Regression and SVM algorithm. We are too new to this domain and need your guidance in terms of validating the approach and additional pointers.
You can achieve this with just a linear regression model because you're trying to predict a continuous outcome (ETA).
You would just train a regression model where you're predicting ETA from your input features (task type, weather, season etc). So what this model learns is how long would the task takes to complete given a certain set of inputs, the predicted outcome is what you would then show to customers
Take a look at this: http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression
Logistic regression/SVM is used for classifying discrete outcomes (i.e. categories/groups).
So another approach might be to stratify the ETA scores in your mysql database into something like short/medium/long time to complete, and then use those 3 categories as your labels instead of the actual numerical value. Then you can use logistic regression to train a model that classifies into those 3 categories, based on your listed input features. This would work, but you lose some resolution due to condensing your ETA data into only 3 groups but that's a design decision you'd have to make.

Resources