I was analyzing a dataset in which i have column names as follows: [id , location, tweet, target_value]. I want to handle the missing values for column location in some rows. So i thought to extract location from tweet column from that row(if tweet contains some location) itself and put that value in the location column for that row.
Now i have some questions regarding above approach.
Is this a good way to do it this way?. Can we fill some missing values by using the training data itself?. Will not this be considered as a redundant feature(because we are deriving the values of this feature using some other feature)
Can you please clarify your dataset a little bit more?
First, If we assume that the location is the information of the tweet that has been posted from, then your method (filling out the location columns in the rows in which that information is missing) becomes wrong.
Secondly, if we assume that the tweet contains the location information correctly, then you can fill out the missing rows using the tweets' location information.
If our second assumption is correct, then it would be a good way because you are feeding your dataset with correct information. In other words, you are giving the model a more detailed information so that it could predict more correctly in the testing process.
Regarding to your question about "Will not this be considered as a redundant feature(because we are deriving the values of this feature using some other feature)":
You can try to remove the location column from your model and train your model with the rest of your 3 columns. Then, you can check the success of the new model using different parameters (accuracy etc.). You can compare it with the results of the model that you have trained using all 4 different columns. After that, if there is not any important difference or the results become severe, then you would say it, the column is redundant. Also you can use Principal Component Analysis(PCA) to detect correlated columns.
Finally, please NEVER use training data in your test dataset. It will lead to overtraining and when you use your model in the real world environment, your model will most probably fail.
I want to get the details (unique id) of the incorrectly classified instances using Weka GUI. I am following the answers of this question. In that, they ask to use the filter StringToNominal in Preprocessing tab to convert the unique id, which is an string. However, by following that, I doubt if the classifier is considering the unique id column also as a feature during the classification?
Please suggest me the correct way of approaching this.
I happy to provide examples if needed.
Let's suppose you want to (1) add an instance ID, (2) not use that instance ID in the model, and (3) see the individual predictions, with the instance ID and maybe some other attributes.
We’re going to show this with a smaller data set. Open iris.arff, for example.
Use the AddID filter in the Preprocess tab, in the Unsupervised Attribute filters. ID will be the first attribute.
Now we need to ignore it during the modeling. Use the filtered classifier with the Remove filter.
And we need to output the predictions with the ID variable so we can see what happened. Here we are outputting all the attributes, although we don’t need to do all.
We get out this detail in the output window:
=== Predictions on test split ===
inst#,actual,predicted,error,prediction,ID,sepallength,sepalwidth,petallength,petalwidth
1,2:Iris-versicolor,2:Iris-versicolor,,0.968,53,6.9,3.1,4.9,1.5
2,3:Iris-virginica,3:Iris-virginica,,0.968,131,7.4,2.8,6.1,1.9
3,2:Iris-versicolor,2:Iris-versicolor,,0.968,59,6.6,2.9,4.6,1.3
4,1:Iris-setosa,1:Iris-setosa,,1,36,5,3.2,1.2,0.2
5,3:Iris-virginica,3:Iris-virginica,,0.968,101,6.3,3.3,6,2.5
6,2:Iris-versicolor,2:Iris-versicolor,,0.968,88,6.3,2.3,4.4,1.3
7,1:Iris-setosa,1:Iris-setosa,,1,42,4.5,2.3,1.3,0.3
8,1:Iris-setosa,1:Iris-setosa,,1,8,5,3.4,1.5,0.2
and so on.
I have look into the fill null method on Kaggle in feature engineering.
Some players fill the NA with another object value.
For example, there are 'Male', 'Female' and NA values in sex column. The method is fill NA with another object value, like, 'Middle'. And after that, it treats the sex attribute without any null and pandas will not find null.
I want to know the method has really good impact on machine learning model's performance or a good feature engineering?
Besides that, is there any other good way to fill NA after no knowledgeable discovery in the data set?
First, it depends if your model can manages NA (like xgboost).
Second, are the dropouts explanatory of a behavior (like a depressed man is more likely to skip a task)
There is a whole literature about this questions. The main ways to do are:
Just drop the rows
Fill the missing data with replacements (the median, the most seen value...)
Fill the missing data and add some error to it
So here, you can either leave it NA and use xgboost, drop the uncomplete rows or put the most frequent value between male and female
A few recommendations if you wan to go further :
Try to understand why the datas are missing
Perform sensitivity analysis of the solution you chose
It largely depends on your data.
But still there are few things you can do and check if they work.
1.If there are few missing values compared to number of rows,its better to drop them.
2.If there are large missing values,make a feature "IsMissing"(1 for NULL 0 for others).Sometimes it works great.
3.If you have lot of data and somehow you figured out that the feature is really important,you can train a model to predict Male/Female using your train data.Then use the rows of Null values as test data to predict their value(Male/Female).
Its all about creativity and logic.Every hypothesis you make doesn't work great, as you can see the last method i described above assumes that the NULL values can only have two values(M/F),which in reality may not be the case.
So,play around with different tactics and see what works great for your data.
Hope it helps!!
Hello I am new to WEKA and am using weka 3.6.10.
Sorry if the answer to this question is something obvious.
I have a dataset containing 10 attributes and one decision class. Decision class is composed of values {1,2,3,4}, is there a way to change configuration so that the values would be considered as {1} and {2,3,4}(binary) rather than each of the values separately without modifying the other attributes?
I had a look at the WEKA filter but did not find anything useful.
Thanks guys
Use an Unsupervised Attribute filter, e.g. the NumericToBinary filter. In the topmost field of the configuration dialog, enter the posistion of the "Decision class" attribute. If it is in the 8th column, enter 8.
The filter will create "dummy variable" columns for each unique value of this attribute. If there are 4 unique values, after applying this filter your dataset will have 4 additional columns. Remove 3 of them.
For a recommendation engine I am trying to convert my movie data to arff format, and even though the arff format is clear to me I am unsure what the best way is to solve the following problem.
My dataset is going to be in the following (or similar) format where rating is the to be predicted classification variable:
For each user a list of:
MovieID-Movie Title-year of release-Genre(s)-Actor(s)-Director-Writer(s)-Runtime-Rating
My problem here is the fact that features Genre, Actor, Writers, can have one or multiple entries and weka arff only allows one value for each attribute. A solution for this I though of is:
Have attributes such as genre0, genre1, genre2. And leave some empty if a movie has for example only 1 genre. The problem I see with this is that this would work great for genre, but does that mean that for the actors for example I'd have to include all actors in the attribute declaration?
#ATTRIBUTE actor1 {all actors}
#ATTRIBUTE actor2 {all actors}
#ATTRIBUTE actor3 {all actors}
Since they're all possible values for that specific feature. This approach does make the most sense to me, but since there are thousands of actors, directors and writers this would be rather big attribute declarations.
Is there any better, more efficient, way to do this?
I don't know of a way around it, but some preprocessing may help reduce the expected size of the attribute declarations. for example:
{'cruise' : 1, 'smith' : 2}