Should I change my object variables to integers or create dummy variables? - machine-learning

I am trying to create a model to predict whether or not someone is at risk of a stroke. My data contains some "object" variables that could easily be coded to 0 and 1 (like sex). However, I have some object variables with 4+ categories (e.g. type of job).
I'm trying to encode these objects into integers so that my models can ingest them. I've come across two methods to do so:
Create dummy variables for each feature, which creates more columns and encodes them as 0 and 1
Convert the object into an integer using LabelEncoder, which assigns values to each category like 0, 1, 2, 3, and so on within the same column.
Is there a difference between these two methods? If so, what is the recommended best path forward?

Yeah this 2 are different. If you used 1 st method it creates more cols. That means more features for model to get fit. If you use second way it create only 1 feature for model to get fit.In machine learning both ways have set of own pros and cons.
Recommending 1 path is depend on the ml algorithm you use, feature importance, etc...

Go the dummy variable route.
Say you have a column that consists of 5 job types: construction worker, data scientist, retail associate, machine learning engineer, and bartender. If you use a label encoder (0-4) to keep your data narrow, your model is going to interpret the job title of "data scientist" as 1 greater than the job title of "construction worker". It would also interpret the job title of "bartender" is 4 greater than "construction worker".
The problem here is that these job types really have no relation to each other as they are purely categorical variables. If you dummy out the column, it does widen your data but you have a far more accurate representation of what the data actually represents.

Use dummy variable, thereby creating more columns/features for fitting your data. As your data will be scaled beforehand it will not create problems in the future.
Overall, the accuracy of any model depends on the no. of features involved and the more features we have, the more accurately we can predict

Related

How to do prediction for regression analysis with multiple target variable

I have a bike rental dataset. In this dataset our target variable is Count i.e. total count of bike rental which is the sum of two variables in our dataset i.e casual user count variable and registered user count variable.
So my question is how should i perform modelling on this dataset ?
Please suggest a step as I'm thinking of dropping casual and registered user variable and keeping only count variable as our tagert variable along with other predictor variables
The question is rather vague but I will attempt to answer it.
I am not too sure what it is that you want to predict. Assuming it is the amount of bikes that would be rented out at some future time.
If the distinction between casual and registered is important and has significant meaning to the purpose of your project, then you should probably treat them as separate features and not combine them into one.
On the contrary, if the distinction is not important and you only care for the amount of bikes, then you should be fine combining them and using the total sum.
I think you should try to understand what you are trying to accomplish and what questions you wish to answer with your analysis.
Converted my two target variables into one by summing them up and then created a new model with only one target variable.

How to pre process a class data (with a large number of unique values) before feeding it to machine learning model?

Let's say I have a large data from an online gaming platform (like steam) which has 'date, user_id, number_of_hours_played, no_of_games' and I have to write a model to predict how many hours a user will play in future for a given date. Now, user_id has a large number of unique values (in millions). I know for class data we can use one hot encoding, but not sure what to do when I have millions of unique classes. Also, suggest if we can use any other method to preprocess the data.
Using directly the user id in the model is not a good idea, since that would result like you said into a large number of features, but also in overfitting since you would get one id per line (If I understood correctly your data). It would also make your model useless in case of a new user id and you would have to retrain your model each time you have a new user.
What I would recommand in the first place is to drop this variable and try to build a model with only the other variables.
Another Idea that you could try is to perform a clustering on the users you have based on other features, and then pass the cluster as a feature instead of the user id, but I don't know if this is a good idea since I don't know the kind of data you have.
Also, you are talking about making a prediction on a given date. The data you described doesn't suggest that but if you have the number of hours per multiple dates, this is closer to a time series prediction problem, which is different from a 'classic' regression problem.

Detect common features in multidimensional data

I am designing a system for anomaly detection.
There are multiple approaches for building such system. I choose to implement one facet of such system by detection of features shared by the majority of samples. I acknowledge the possible insufficiencies of such method but for my specific use-case: (1) It suffices to know that a new sample contains (or lacks) features shared by the majority of past data to make a quick decision.(2) I'm interested in the insights such method will offer to the data.
So, here is the problem:
Consider a large data set with M data points, where each data point may include any number of {key:value} features. I choose to model a training dataset by grouping all the features observed in the data (the set of all unique keys) and setting it as the model's feature space. I define each sample by setting its values for existing keys and None for values in features it does not include.
Given this training data set I want to determine which features reoccur in the data; and for such reoccurring features, do they mostly share a single value.
My question:
A simple solution would be to count everything - for each of the N features calculate the distribution of values. However as M and N are potentially large, I wonder if there is a more compact way to represent the data or more sophisticated method to make claims about features' frequencies.
Am I reinventing an existing wheel? If there's an online approach for accomplishing such task it would be even better.
If I understand correctly your question,
you need to go over all the data anyway, so why not using hash?
Actually two hash tables:
Inner hash table for the distribution of feature values.
Outer hash table for feature existence.
In this way, the size of the inner hash table will indicate how is the feature common in your data, and the actual values will indicate how they differ one another. Another thing to notice is that you go over your data only once, and the time complexity for every operation (almost) on hash tables (if you allocate enough space from the beginning) is O(1).
Hope it helps

Regression when size of explanatory variables differ in length/size

What is generally considered the correct approach when you are performing a regression and your training data contains 'incidents' of some sort, but there may be a varying number of these items per training line?
To give you an example - suppose I wanted to predict the likelihood of accidents on a number of different roads. For each road, I may have a history multiple accidents and each accident will have its own different attributes (date (how recent), number of casualties, etc). How does one encapsulate all this information on one line?
You could for example assume a maximum of (say) ten and include the details of each as a separate input (date1, NoC1, date2, NoC2, etc....) but the problem is we want each item to be treated similarly and the model will treat items in column 4 as fundamentally separate from those in column 2 above, which it should not.
Alternatively we could include one row for each incident, but then any other columns in each row which are not related to these 'incidents' (such as age of road, width, etc) will be included multiple times and hence produce bias in the results.
What is the standard method that is use to accomplish this?
Many thanks

Popular Items suggestion - Time Sensitive Data - Data Mining

I am a newbee in the field of data mining. I am working on very interesting Data Minign problem. Data description is as follows:
Data is time sensitive. Item attributes are dependent on time factor as well as its class label. I am grouping weekly data as one instance of training or test record. Each week, some of the item attributes may change along with its Popularity(i.e. Class label).
Some sample data as below:
IsBestPicture,MovieID,YearOfRelease,WeekYear,IsBestDirector,IsBestActor,IsBestAc‌​tress,NumberOfNominations,NumberOfAwards,..,Label
-------------------------------------------------
0_1,60000161,2000,1,9-00,0,0,0,0,0,0,0
0_1,60004480,2001,22,19-02,1,0,0,11,3,0,0
0_1,60000161,2000,5,13-00,0,0,0,0,0,0,1
0_1,60000161,2000,6,14-00,0,0,0,0,0,0,0
0_1,60000161,2000,11,19-00,0,0,0,0,0,0,1
My research advisor suggested to use Naive Bayes algorithm which can adapt such dynamic data that is changing with time.
I am using data from 2000-2004 as Training an 2005 as Testing. If i include Week-Year attribute in my items data set, then it will cause 0 probability in Naive Bayes. Is it ok to omit this attribute from my data set after organizing my data in chronological order?
Moreover, how to adapt my model as i read new test cases ? as the new test cases might cause change in Class label ?
Can you provide a little more insight into your methods? For instance, are you using R, SPSS, Python, SQL Server 2008R2, or RapidMiner 5.2? And if you can include a very small (3-4 row segment) of some of your data, that would help people figure out how to tackle this.
One immediate approach to get an idea of what you are looking at would be to do a Random Forest/Decision Tree and K-Means clustering in order to determine common seperation points in the data. Have you begun by a quick glance at the data's histograms, averages, and outliers?

Resources