How to fill null value in object attributes in feature engineering? - machine-learning

I have look into the fill null method on Kaggle in feature engineering.
Some players fill the NA with another object value.
For example, there are 'Male', 'Female' and NA values in sex column. The method is fill NA with another object value, like, 'Middle'. And after that, it treats the sex attribute without any null and pandas will not find null.
I want to know the method has really good impact on machine learning model's performance or a good feature engineering?
Besides that, is there any other good way to fill NA after no knowledgeable discovery in the data set?

First, it depends if your model can manages NA (like xgboost).
Second, are the dropouts explanatory of a behavior (like a depressed man is more likely to skip a task)
There is a whole literature about this questions. The main ways to do are:
Just drop the rows
Fill the missing data with replacements (the median, the most seen value...)
Fill the missing data and add some error to it
So here, you can either leave it NA and use xgboost, drop the uncomplete rows or put the most frequent value between male and female
A few recommendations if you wan to go further :
Try to understand why the datas are missing
Perform sensitivity analysis of the solution you chose

It largely depends on your data.
But still there are few things you can do and check if they work.
1.If there are few missing values compared to number of rows,its better to drop them.
2.If there are large missing values,make a feature "IsMissing"(1 for NULL 0 for others).Sometimes it works great.
3.If you have lot of data and somehow you figured out that the feature is really important,you can train a model to predict Male/Female using your train data.Then use the rows of Null values as test data to predict their value(Male/Female).
Its all about creativity and logic.Every hypothesis you make doesn't work great, as you can see the last method i described above assumes that the NULL values can only have two values(M/F),which in reality may not be the case.
So,play around with different tactics and see what works great for your data.
Hope it helps!!

Related

Handling a missing value in machine learning

I was analyzing a dataset in which i have column names as follows: [id , location, tweet, target_value]. I want to handle the missing values for column location in some rows. So i thought to extract location from tweet column from that row(if tweet contains some location) itself and put that value in the location column for that row.
Now i have some questions regarding above approach.
Is this a good way to do it this way?. Can we fill some missing values by using the training data itself?. Will not this be considered as a redundant feature(because we are deriving the values of this feature using some other feature)
Can you please clarify your dataset a little bit more?
First, If we assume that the location is the information of the tweet that has been posted from, then your method (filling out the location columns in the rows in which that information is missing) becomes wrong.
Secondly, if we assume that the tweet contains the location information correctly, then you can fill out the missing rows using the tweets' location information.
If our second assumption is correct, then it would be a good way because you are feeding your dataset with correct information. In other words, you are giving the model a more detailed information so that it could predict more correctly in the testing process.
Regarding to your question about "Will not this be considered as a redundant feature(because we are deriving the values of this feature using some other feature)":
You can try to remove the location column from your model and train your model with the rest of your 3 columns. Then, you can check the success of the new model using different parameters (accuracy etc.). You can compare it with the results of the model that you have trained using all 4 different columns. After that, if there is not any important difference or the results become severe, then you would say it, the column is redundant. Also you can use Principal Component Analysis(PCA) to detect correlated columns.
Finally, please NEVER use training data in your test dataset. It will lead to overtraining and when you use your model in the real world environment, your model will most probably fail.

How can I one-hot encode the data which has multiple same values for different properties?

I have data containing candidates who look for a job. The original data I got was a complete mess but I managed to enhance it. Now, I am facing an issue which I am not able to resolve.
One candidate record looks like
https://i.imgur.com/LAPAIbX.png
Since ML algorithms cannot work with categorical data, I want to encode this. My goal is to have a candidate record looking like this:
https://i.imgur.com/zzsiDzy.png
What I need to change is to add a new column for each possible value that exists in Knowledge1, Knowledge2, Knowledge3, Knowledge4, Tag1, and Tag2 of original data, but without repetition. I managed to encode it to get way more attributes than I need, which results in an inaccurate model. The way I tried gives me newly created attributes Jscript_Knowledge1, Jscript_Knowledge2, Jscript_Knowledge3 and so on, for each possible option.
If the explanation is not clear enough please let me know so that I could explain it further.
Thanks and any help is highly appreciated.
Cheers!
I have some understanding of your problem based on your explanation. I will try and elaborate how I would approach this problem. If that is not solving your problem, I may need more explanation to understand your problem. Lets get started.
For all the candidate data that you would have, collect a master
skill/knowledge list
This list becomes your columns
For each candidate, if he has this skill, the column becomes 1 for his record else it stays 0
This is the essence of one hot encoding, however, since same skill is scattered across multiple columns you are struggling with autoencoding it.
An alternative approach could be:
For each candidate collect all the knowledge skills as list and assign it into 1 column for knowledge and tags as another list and assign it to another column instead of current 4(Knowledge) + 2 (tags).
Sort the knowledge(and tag) list alphabetically within this column.
Auto One hot encoding after this may yield smaller columns than earlier
Hope this helps!

How to pre process a class data (with a large number of unique values) before feeding it to machine learning model?

Let's say I have a large data from an online gaming platform (like steam) which has 'date, user_id, number_of_hours_played, no_of_games' and I have to write a model to predict how many hours a user will play in future for a given date. Now, user_id has a large number of unique values (in millions). I know for class data we can use one hot encoding, but not sure what to do when I have millions of unique classes. Also, suggest if we can use any other method to preprocess the data.
Using directly the user id in the model is not a good idea, since that would result like you said into a large number of features, but also in overfitting since you would get one id per line (If I understood correctly your data). It would also make your model useless in case of a new user id and you would have to retrain your model each time you have a new user.
What I would recommand in the first place is to drop this variable and try to build a model with only the other variables.
Another Idea that you could try is to perform a clustering on the users you have based on other features, and then pass the cluster as a feature instead of the user id, but I don't know if this is a good idea since I don't know the kind of data you have.
Also, you are talking about making a prediction on a given date. The data you described doesn't suggest that but if you have the number of hours per multiple dates, this is closer to a time series prediction problem, which is different from a 'classic' regression problem.

Detecting HTML table orientation based only on table data

Given an HTML table with none of it's cells identified as "< th >" or "header" cells, I want to automatically detect whether the table is a "Vertical" table or "Horizontal" table.
For example:
This is a Horizontal table:
and this is a vertical table:
of course keep in mind that the "Bold" property along with the shading and any styling properties will not be available at the classification time.
I was thinking of approaching this by a statistical means, I can hand write couple of features like "if the first row has numbers, but the first column doesn't. That's probably a Vertical table" and give score for each feature and combine to decide the Class of the table orientation.
Is that how you approach such a problem? I haven't used any statistical-based algorithm before and I am not sure what would be optimal for such a problem
This is a bit confusing question. You are asking about ML method, but it seems you have not created training/crossvalidation/test sets yet. Without data preprocessing step any discussion about ML method is useless.
If I'm right and you didn't created datasets yet - give us more info on data (if you take a look on one example how do you know the table is vertical or horizontal?, how many data do you have, are you always sure whether s table is vertical/horizontal,...)
If you already created training/crossval/test sets - give us more details how the training set looks like (what are the features, number of examples, do you need white-box solution (you can see why a ML model give you this result),...)
How general is the domain for the tables? I know some Web table schema identification algorithms use types, properties, and instance data from a general knowledge schema such as Freebase to attempt to identify the property associated with a column. You might try leveraging that knowledge in an classifier.
If you want to do this without any external information, you'll need a bunch of hand labelled horizontal and vertical examples.
You say "of course" the font information isn't available, but I wouldn't be so quick to dismiss this since it's potentially a source of very useful information. Are you sure you can't get your data from a little bit further back in the pipeline so that you can get access to this info?

Popular Items suggestion - Time Sensitive Data - Data Mining

I am a newbee in the field of data mining. I am working on very interesting Data Minign problem. Data description is as follows:
Data is time sensitive. Item attributes are dependent on time factor as well as its class label. I am grouping weekly data as one instance of training or test record. Each week, some of the item attributes may change along with its Popularity(i.e. Class label).
Some sample data as below:
IsBestPicture,MovieID,YearOfRelease,WeekYear,IsBestDirector,IsBestActor,IsBestAc‌​tress,NumberOfNominations,NumberOfAwards,..,Label
-------------------------------------------------
0_1,60000161,2000,1,9-00,0,0,0,0,0,0,0
0_1,60004480,2001,22,19-02,1,0,0,11,3,0,0
0_1,60000161,2000,5,13-00,0,0,0,0,0,0,1
0_1,60000161,2000,6,14-00,0,0,0,0,0,0,0
0_1,60000161,2000,11,19-00,0,0,0,0,0,0,1
My research advisor suggested to use Naive Bayes algorithm which can adapt such dynamic data that is changing with time.
I am using data from 2000-2004 as Training an 2005 as Testing. If i include Week-Year attribute in my items data set, then it will cause 0 probability in Naive Bayes. Is it ok to omit this attribute from my data set after organizing my data in chronological order?
Moreover, how to adapt my model as i read new test cases ? as the new test cases might cause change in Class label ?
Can you provide a little more insight into your methods? For instance, are you using R, SPSS, Python, SQL Server 2008R2, or RapidMiner 5.2? And if you can include a very small (3-4 row segment) of some of your data, that would help people figure out how to tackle this.
One immediate approach to get an idea of what you are looking at would be to do a Random Forest/Decision Tree and K-Means clustering in order to determine common seperation points in the data. Have you begun by a quick glance at the data's histograms, averages, and outliers?

Resources