I have a bike rental dataset. In this dataset our target variable is Count i.e. total count of bike rental which is the sum of two variables in our dataset i.e casual user count variable and registered user count variable.
So my question is how should i perform modelling on this dataset ?
Please suggest a step as I'm thinking of dropping casual and registered user variable and keeping only count variable as our tagert variable along with other predictor variables
The question is rather vague but I will attempt to answer it.
I am not too sure what it is that you want to predict. Assuming it is the amount of bikes that would be rented out at some future time.
If the distinction between casual and registered is important and has significant meaning to the purpose of your project, then you should probably treat them as separate features and not combine them into one.
On the contrary, if the distinction is not important and you only care for the amount of bikes, then you should be fine combining them and using the total sum.
I think you should try to understand what you are trying to accomplish and what questions you wish to answer with your analysis.
Converted my two target variables into one by summing them up and then created a new model with only one target variable.
Related
I am trying to create a model to predict whether or not someone is at risk of a stroke. My data contains some "object" variables that could easily be coded to 0 and 1 (like sex). However, I have some object variables with 4+ categories (e.g. type of job).
I'm trying to encode these objects into integers so that my models can ingest them. I've come across two methods to do so:
Create dummy variables for each feature, which creates more columns and encodes them as 0 and 1
Convert the object into an integer using LabelEncoder, which assigns values to each category like 0, 1, 2, 3, and so on within the same column.
Is there a difference between these two methods? If so, what is the recommended best path forward?
Yeah this 2 are different. If you used 1 st method it creates more cols. That means more features for model to get fit. If you use second way it create only 1 feature for model to get fit.In machine learning both ways have set of own pros and cons.
Recommending 1 path is depend on the ml algorithm you use, feature importance, etc...
Go the dummy variable route.
Say you have a column that consists of 5 job types: construction worker, data scientist, retail associate, machine learning engineer, and bartender. If you use a label encoder (0-4) to keep your data narrow, your model is going to interpret the job title of "data scientist" as 1 greater than the job title of "construction worker". It would also interpret the job title of "bartender" is 4 greater than "construction worker".
The problem here is that these job types really have no relation to each other as they are purely categorical variables. If you dummy out the column, it does widen your data but you have a far more accurate representation of what the data actually represents.
Use dummy variable, thereby creating more columns/features for fitting your data. As your data will be scaled beforehand it will not create problems in the future.
Overall, the accuracy of any model depends on the no. of features involved and the more features we have, the more accurately we can predict
I have a dataset that contains a number of different of Subject IDs. They are each part of a specific Chain and were assigned a Generation number in an experiment. I have the mean StructureScore (how much a language is structured) for each participant, but I also want to see what the mean StructureScore is for each generation.
For example, Generation 7 of Chain E exists 4 times, so I want to have the mean score for those 4 participants. I'm not sure how to make a new dataset of just those mean StructureScores? Any suggestion is appreciated.
Check this out!
This type of operation is exactly what aggregate was designed for:
Mean per group in a data.frame
Let's say I have a large data from an online gaming platform (like steam) which has 'date, user_id, number_of_hours_played, no_of_games' and I have to write a model to predict how many hours a user will play in future for a given date. Now, user_id has a large number of unique values (in millions). I know for class data we can use one hot encoding, but not sure what to do when I have millions of unique classes. Also, suggest if we can use any other method to preprocess the data.
Using directly the user id in the model is not a good idea, since that would result like you said into a large number of features, but also in overfitting since you would get one id per line (If I understood correctly your data). It would also make your model useless in case of a new user id and you would have to retrain your model each time you have a new user.
What I would recommand in the first place is to drop this variable and try to build a model with only the other variables.
Another Idea that you could try is to perform a clustering on the users you have based on other features, and then pass the cluster as a feature instead of the user id, but I don't know if this is a good idea since I don't know the kind of data you have.
Also, you are talking about making a prediction on a given date. The data you described doesn't suggest that but if you have the number of hours per multiple dates, this is closer to a time series prediction problem, which is different from a 'classic' regression problem.
I am designing a system for anomaly detection.
There are multiple approaches for building such system. I choose to implement one facet of such system by detection of features shared by the majority of samples. I acknowledge the possible insufficiencies of such method but for my specific use-case: (1) It suffices to know that a new sample contains (or lacks) features shared by the majority of past data to make a quick decision.(2) I'm interested in the insights such method will offer to the data.
So, here is the problem:
Consider a large data set with M data points, where each data point may include any number of {key:value} features. I choose to model a training dataset by grouping all the features observed in the data (the set of all unique keys) and setting it as the model's feature space. I define each sample by setting its values for existing keys and None for values in features it does not include.
Given this training data set I want to determine which features reoccur in the data; and for such reoccurring features, do they mostly share a single value.
My question:
A simple solution would be to count everything - for each of the N features calculate the distribution of values. However as M and N are potentially large, I wonder if there is a more compact way to represent the data or more sophisticated method to make claims about features' frequencies.
Am I reinventing an existing wheel? If there's an online approach for accomplishing such task it would be even better.
If I understand correctly your question,
you need to go over all the data anyway, so why not using hash?
Actually two hash tables:
Inner hash table for the distribution of feature values.
Outer hash table for feature existence.
In this way, the size of the inner hash table will indicate how is the feature common in your data, and the actual values will indicate how they differ one another. Another thing to notice is that you go over your data only once, and the time complexity for every operation (almost) on hash tables (if you allocate enough space from the beginning) is O(1).
Hope it helps
I have a question:
I have a domain : LoanAccount. We have different product of loans but they just different on how to calculate the interest.
for example:
1. Regular Loan calculate interest rate using Annuity Interest Rate formula
2. Vehicle Loan calculate interest rate using Flat Interest Rate formula
3. Temporary Loan calculate interest rate with another formula (i have no idea what is that).
We also could change the rule every year ... we use different formula as well ...
My Question:
Should I put all the logic formula in services ?
Should I make every loan in different domain class ?
or should I make 1 domain class but it has different interest rate calculation methods ?
Any example would be good :)
Thank you in advance !
My suggestion is to separate interest calculating logic from the domain objects.
Hard-wiring the domain object and it's interest calculation is likely to lead you in trouble.
It would be more complicated to change the type of interest calculation for existing account type (which could be expected business request)
When new account type is created you can easily use all the calculation methods you have already implemented for it
It's likely that interest-calculating algorithm will grow in complexity in the future and it may need properties that should not be part of Account domain object, like some business constants, list of transactions etc.
Grails (because Spring) naturally supports to have business logic in services (declarative transactions etc.) rather than in the domain objects. You will always have less pain when going along with the framework than otherwise.