Can someone explain when doing multiple regression modelling is necessary to follow normal distribution both dependent variable and all other independent variables? or just enough dependent variable(y) follow as normal distribution..
Related
In the regression problem I'm working with, there are five independent columns and one dependent column. I cannot share the data set details directly due to privacy, but one of the independent variables is an ID field which is unique for each example.
I feel like I should not be using ID field in estimating the dependent variable. But this is just a gut feeling. I have no strong reason to do this.
What shall I do? Is there any way I decide which variables to use and which to ignore?
Well, I agree with #desertnaut. Id attribute does not seem relevant when creating a model and provides no help in prediction.
The term you are looking for is feature selection. Since it's a comprehensive section so I would just tell you the methods that are mostly used by data scientists.
As for regression problems you can try correlation heatmap to find the features that are highly correlated with the target.
sns.heatmap(df.corr())
There are several other ways too like PCA,using tree inbuilt feature selection methods to find the right features for your model.
You can also try James Phillips method. This approach is limited since model time complexity will increase linearly with the features. But in your case where you've only four features to compare you can try it out. You can compare the regression model trained with all the four features with the model trained with only three features by dropping one of the four features recursively. This would mean training four regression models and comparing them.
According to you, the ID variable is unique for each example. So the model won't be able to learn anything from this variable as with every example, you get a new ID and hence no general patterns to learn as every ID only occurs once.
Regarding feature elimination, it depends. If you have domain knowledge, based on that alone you can engineer/ remove features as needed. If you don't know much about the domain, you can try out some basic techniques like Backward Selection, Forward Selection, etc via Cross Validation to get the model with the best value of the metric that you're working with.
While going through one kernel on Kaggle regarding Regression in that it was mentioned that the data should look like a normal distribution. But I am not getting why?
I know this question might be very basic But please help me to understand this concept.
Thanks in Advance!!
Regression models make a number of assumptions, one of which is normality. When this assumption is violated then your p-values and confidence intervals around your coefficient estimate could be wrong, leading to incorrect conclusions about the statistical significance of your predictors
However, a common misconception is that the data (i.e. the variables/predictors) needs to be normally distributed, but this is not true. These models don't make any assumptions about the distribution of predictors.
For example, imagine a case where you have a binary predictor in regression (Male/Female; Slow/Fast etc.) - it would be impossible for this variable to be normally distributed and yet it is still a valid predictor to use in a regression model. The normality assumption actually refers to the distribution of the residuals, not the predictors themselves
I have just started the data science journey so please pardon me if the question is silly. If in a dataset there are two columns which are dependent on each other like 'fare' and 'type_of_seat'. So, should we include both the features in the training set or including only one of them will do the job?
How strong is the dependency? Put another way, do you gain any information by including both features? If one can be directly computed from the other (as in your example, I suspect fare can be directly inferred from type_of_seat) then you don't gain anything and can leave one of the features out, since they are redundant.
I am quite new to machine learning with small experience and I did some projects.
Now I have a project relates to insurance. So I have databases about clients that I will merge to get all possible information about the clients and I have one database for the claims. I need to build a model to identify how risky the client based on ranks.
My question: I need to build my target variable that ranks the clients based on how risky they are, counting on the claims. I could have different strategies to do that, but I am confused about how I will deal with the following:
- Shall I do a specific type of analysis before building the ranks such as clustering, or I need to have a strong theoretical assumption matching with the project provider vision.
- If I use some variables in the claims database to build up the ranks, how shall I deal with them later. In other words, shall I remove them from the final data set for training, to avoid correlation with target variable, or I can treat them in a different way and keep them.
- If I will keep them, is there a special treatment for them depending on whether they are categorical or continuous variables.
Every machine learning project's starting point is EDA. First create some feature, like how often do they get bad claims or how many do they get. Then do some EDA to find which features are more useful. Secondly, the problem looks like classification. Clustering is usually harder to evaluate.
In data sciences when you make a business model, EDA exploratory data analytics play a major role which includes data cleaning, feature engineering, filtering data. As mentioned how to build target variable, it all depends on the attributes you have and what model do you want to apply say linear regression or logistic or make a decision tree. You need to use those algorithms. But most importantly you need to find out the impacting variable. That's probably the core elation between the output and the given input and priority must be given accordingly. Also attributes which add no value must be removed as that would contribute to overfitting.
You can do clustering too. And interesting thing is any unspervisoned learning could be converted to a form of supervised learning. Probably you can try to do logistic regression or do linear regression etc... And find out which model fits best to your project.
I don't know how I should approach this problem:
I have a data set. A user may or may not be part of a funded scheme.
I want to use machine learning to deduce that users that are not part of the scheme were susceptible to certain conditions e.g. 1,2,3 and 4. Those in the scheme were susceptible to 1,2 and 4. Therefore it can be deduced that if you are part of the scheme you won't be susceptible to condition 3.
I have a second related problem as well. Within the funded scheme the user can have two plans (cost different amounts). I would like to see whether those on the cheaper plan were susceptible to more conditions than those on the more expensive plan.
Can anyone help me as to whether this a recommendation or a classification problem and what specific algorithms I should look at?
Thanks.
Neither. It's a statistics problem. Your dataset is complete and you don't mention any need to predict attributes of future subjects or schemes, so training a classifier or a recommender wouldn't seem to serve it's usual goals.
You could use a person's conditions as features and their scheme stats as the target, classify them with SVM and then use the classification performance/accuracy as a measure of the separability of the classes. You could also consider clustering. However, a t-test would do the same thing and is a much more accepted tool to justify the validity of claims like this.
It looks like you are trying to build a system that would classify a user as funded or not funded, and if not funded, reason why they were not funded.
If this is the case, what you need is a machine learning classifier that is interpretable, i.e., the reasoning behind why a classifier makes a certain decision can be conveyed to users. You may want to look at Decisions trees and (to a lesser extent) RandomForest and Gradient Boosted Trees.