Handling a missing value in machine learning - machine-learning

I was analyzing a dataset in which i have column names as follows: [id , location, tweet, target_value]. I want to handle the missing values for column location in some rows. So i thought to extract location from tweet column from that row(if tweet contains some location) itself and put that value in the location column for that row.
Now i have some questions regarding above approach.
Is this a good way to do it this way?. Can we fill some missing values by using the training data itself?. Will not this be considered as a redundant feature(because we are deriving the values of this feature using some other feature)

Can you please clarify your dataset a little bit more?
First, If we assume that the location is the information of the tweet that has been posted from, then your method (filling out the location columns in the rows in which that information is missing) becomes wrong.
Secondly, if we assume that the tweet contains the location information correctly, then you can fill out the missing rows using the tweets' location information.
If our second assumption is correct, then it would be a good way because you are feeding your dataset with correct information. In other words, you are giving the model a more detailed information so that it could predict more correctly in the testing process.
Regarding to your question about "Will not this be considered as a redundant feature(because we are deriving the values of this feature using some other feature)":
You can try to remove the location column from your model and train your model with the rest of your 3 columns. Then, you can check the success of the new model using different parameters (accuracy etc.). You can compare it with the results of the model that you have trained using all 4 different columns. After that, if there is not any important difference or the results become severe, then you would say it, the column is redundant. Also you can use Principal Component Analysis(PCA) to detect correlated columns.
Finally, please NEVER use training data in your test dataset. It will lead to overtraining and when you use your model in the real world environment, your model will most probably fail.

Related

Evaluating LightFM model performance with lightfm.evaluation.precision_at_k

I am trying to use lightfm.evaluation.precision_at_k() to evaluate performance of my model.
My questions are around the parameters that I need to pass to it:
Does the test_interactions parameter needs to be the exact same shape (user indexes matching) as the interactions set that the model was trained on? In the examples I have seen from LightFM's Movielens data, it appears the test and train sets have the same number of rows (so they index to the same exact users in the same order). This would make sense since the model it self does not store any user ID -> matrix index mappings. However, I wonder if I can use the precision_at_k() method at all if I just want to get the evaluation done on a subset of users. If not, I guess I would have to iterate over my test users by hand and call .predict() on each one and calculate precision-at-k for each user in my own code?
I trained the model with item features but I'm confused why I would need to pass it again as item_features to precision_at_k(). If I'm just trying to predict the recommendations for a user that was part of my train data set (but has now some new interactions data), and if the features of the items haven't changed, is it safe to just not pass the item_features again here? If I have to pass them, I have to store them along with the model somewhere - just painful and I'm not sure why it is needed. What are item_features used for in this precision_at_k() case?
I might end up trying to just manually evaluate predictions for each user and skip using precision_at_k() completely.

How to pre process a class data (with a large number of unique values) before feeding it to machine learning model?

Let's say I have a large data from an online gaming platform (like steam) which has 'date, user_id, number_of_hours_played, no_of_games' and I have to write a model to predict how many hours a user will play in future for a given date. Now, user_id has a large number of unique values (in millions). I know for class data we can use one hot encoding, but not sure what to do when I have millions of unique classes. Also, suggest if we can use any other method to preprocess the data.
Using directly the user id in the model is not a good idea, since that would result like you said into a large number of features, but also in overfitting since you would get one id per line (If I understood correctly your data). It would also make your model useless in case of a new user id and you would have to retrain your model each time you have a new user.
What I would recommand in the first place is to drop this variable and try to build a model with only the other variables.
Another Idea that you could try is to perform a clustering on the users you have based on other features, and then pass the cluster as a feature instead of the user id, but I don't know if this is a good idea since I don't know the kind of data you have.
Also, you are talking about making a prediction on a given date. The data you described doesn't suggest that but if you have the number of hours per multiple dates, this is closer to a time series prediction problem, which is different from a 'classic' regression problem.

How to fill null value in object attributes in feature engineering?

I have look into the fill null method on Kaggle in feature engineering.
Some players fill the NA with another object value.
For example, there are 'Male', 'Female' and NA values in sex column. The method is fill NA with another object value, like, 'Middle'. And after that, it treats the sex attribute without any null and pandas will not find null.
I want to know the method has really good impact on machine learning model's performance or a good feature engineering?
Besides that, is there any other good way to fill NA after no knowledgeable discovery in the data set?
First, it depends if your model can manages NA (like xgboost).
Second, are the dropouts explanatory of a behavior (like a depressed man is more likely to skip a task)
There is a whole literature about this questions. The main ways to do are:
Just drop the rows
Fill the missing data with replacements (the median, the most seen value...)
Fill the missing data and add some error to it
So here, you can either leave it NA and use xgboost, drop the uncomplete rows or put the most frequent value between male and female
A few recommendations if you wan to go further :
Try to understand why the datas are missing
Perform sensitivity analysis of the solution you chose
It largely depends on your data.
But still there are few things you can do and check if they work.
1.If there are few missing values compared to number of rows,its better to drop them.
2.If there are large missing values,make a feature "IsMissing"(1 for NULL 0 for others).Sometimes it works great.
3.If you have lot of data and somehow you figured out that the feature is really important,you can train a model to predict Male/Female using your train data.Then use the rows of Null values as test data to predict their value(Male/Female).
Its all about creativity and logic.Every hypothesis you make doesn't work great, as you can see the last method i described above assumes that the NULL values can only have two values(M/F),which in reality may not be the case.
So,play around with different tactics and see what works great for your data.
Hope it helps!!

Which machine learning model should be used in this situation?

Recently I'm working on my course project, it's an android app that can automatically help fill consuming form based on the user's voice. So here is one sample sentence:
So what I want to do is let the app fill forms automatically, my forms have several fields: time(yesterday), location(MacDonald), cost(10 dollars), type(food). Here the "type" field will include food, shopping, transport, etc.
I have used the word-splitting library to split the sentence into several parts and parse it, so I can already extract the time, location and cost fields from the user's voice.
What I want to do is deduce the "type" field with some kind of machine learning model. So there should be some records in advance, input by user manually to train the model. After training, when new record comes in, I first extract the time, location and cost fields, and then calculate the type field based on the model.
But I don't know how to represent the location field, should I use a dictionary to include many famous locations and use index to represent the location? If so, which kind of machine learning method should I use to model this requirement?
I would start with the Naive Bayes classifier. The links below should be useful in understanding it:
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
http://cs229.stanford.edu/notes/cs229-notes2.pdf
http://scikit-learn.org/stable/modules/naive_bayes.html
I wonder if time and cost are that discriminative/informative in comparison to location for your task.
In general, look at the following link on working with text data (it should be useful even if you dont know python):
http://scikit-learn.org/dev/tutorial/text_analytics/working_with_text_data.html
It should include three stages:
Feature Representation:
One way to represent the features is the Bag-of-Word representation, which you fix an order of the dictionary and use a word frequency vector to represent the documents. See https://en.wikipedia.org/wiki/Bag-of-words_model for details.
Data and Label Collection:
Basically, in this stage, you should prepare some [feature]-[type] pairs to training your model, which can be tedious or expensive. If you had already published your app, and collected a lot of [sentence]-[type] pair (probably chosen by app user), you can extract the features and build a training set.
Model Learning:
Cdeepakroy has suggested a good choice of the model: Naive Bayes, which is very efficient for classification task like this. At this stage, you can just find a suitable package, insert your training data, and enjoy the classifier it returns.

Popular Items suggestion - Time Sensitive Data - Data Mining

I am a newbee in the field of data mining. I am working on very interesting Data Minign problem. Data description is as follows:
Data is time sensitive. Item attributes are dependent on time factor as well as its class label. I am grouping weekly data as one instance of training or test record. Each week, some of the item attributes may change along with its Popularity(i.e. Class label).
Some sample data as below:
IsBestPicture,MovieID,YearOfRelease,WeekYear,IsBestDirector,IsBestActor,IsBestAc‌​tress,NumberOfNominations,NumberOfAwards,..,Label
-------------------------------------------------
0_1,60000161,2000,1,9-00,0,0,0,0,0,0,0
0_1,60004480,2001,22,19-02,1,0,0,11,3,0,0
0_1,60000161,2000,5,13-00,0,0,0,0,0,0,1
0_1,60000161,2000,6,14-00,0,0,0,0,0,0,0
0_1,60000161,2000,11,19-00,0,0,0,0,0,0,1
My research advisor suggested to use Naive Bayes algorithm which can adapt such dynamic data that is changing with time.
I am using data from 2000-2004 as Training an 2005 as Testing. If i include Week-Year attribute in my items data set, then it will cause 0 probability in Naive Bayes. Is it ok to omit this attribute from my data set after organizing my data in chronological order?
Moreover, how to adapt my model as i read new test cases ? as the new test cases might cause change in Class label ?
Can you provide a little more insight into your methods? For instance, are you using R, SPSS, Python, SQL Server 2008R2, or RapidMiner 5.2? And if you can include a very small (3-4 row segment) of some of your data, that would help people figure out how to tackle this.
One immediate approach to get an idea of what you are looking at would be to do a Random Forest/Decision Tree and K-Means clustering in order to determine common seperation points in the data. Have you begun by a quick glance at the data's histograms, averages, and outliers?

Resources