how to select the right features for my regression model? - machine-learning

I'm trying to increase accuracy of my model for blog feedback dataset problem
in UCI machine learning repository. I need help with feature selection.
I'm new to machine learning. I have selected features by using p value and adjusted r squared method but still accuracy of my model is very low . What should i do now?

Related

How azure ML give an output for a value which is not used when training the model?

I am trying to predict the price of a house. Therefore I added no-of-rooms as one variable to get the prediction. Previous values for that variable was (3,2,1) when I was training the model. Now I am adding no-of-rooms as "6" to get an output(which was not use before to get the predicted value). How will it give the output for a new value?Is it only consider the variables except no-of-rooms ? I used Boosted decision tree regression as the model.
The short answer is that when you train your model on a set of features and then use a test set to run predictions, yes it will be able to utilize/understand feature values that the model hasn't previously seen during training. If you have large outliers in your test set that would differ significantly from what the model saw during training, it will affect accuracy, but it will still attempt a prediction.
This is less of a Azure Machine Learning question and more machine learning basics (or really just the basics of how regression works). I would do some research on both "linear regression", and the concept of "over-fitting in machine learning". These are two very basic conceptual topics that will help with your understanding. Understanding regression will help you see why a model can use a value it hasn't previously seen to create a prediction.

In stacking for machine learning which order should you train the models in?

I am currently learning to do stacking in a machine learning problem. I am going to get the outputs of the first model and use these outputs as features for the second model.
My question is: Does the order matter? I am using a lasso regression model and a boosted tree. In my problem the regression model outperforms the boosted tree. I am thinking therefore that I should use the regression tree second and the boosted tree first.
What are the factors I need to think about when making this decision?
Why don't you try feature engineering to create more features?
Don't try to use predictions from one model as features for another model.
You can try using K-means to cluster similar training samples.
For stacking, just use different models and then average the results (assuming that you have a continuous y variable).

Machine Learning - Features selection and Modeling (Connection between samples)

I am new to Machine Learning, and I have got several questions with my data now.
Let's say I have X samples, Y features and I also have the connection between x1 and x2 (e.g. the interaction count)
As most of the tutorial of Machine Learning start with labels specifically labelled at the sample itself...
I would like to ask how I should build the model? I want to have a model that it can predict two specific samples to see how high the interaction counts would be.
Giving me a direction/ keywords to learn would be good enough, thanks!
I have got others suggestion for the approach:
Formulate the problem as z=f(x1,x2), i.e. label depends on tuple of sample. if a dataset of ((x1,x2)=>z) is prepared, it can then be used to train regression, decision trees or networks.

Creating supervised model in machine learning

I have recently learned how supervised learning works. It learns labeled dataset and predict unlabeled datum.
But, I have a question that is it fine to teach the created model with the predicted datum and then predict unlabeled datum again. And repeat the process.
For example, Model M was created by 10 labeled dataset D, then Model M predicts datum A. Then, data A is added into dataset D and creates Model M again. The process is repeated with the amount of unpredicted data.
What you are describing here is a well known technique known as (among other names) "selftraining" or "self semi-supervised training". See for example slides https://www.cs.utah.edu/~piyush/teaching/8-11-print.pdf. There are hundreads of modifications around this idea. Unfortunately, in general it is hard to prove that it should help, so while it will help for some datasets it will hard the other ones. The main criterion here is the quality of the very first model, since selftraining is based on the assumption, that your original model is really good, thus you can trust it enough to label new examples. It might help with slow concept drift with a strong model, but will fail misserably with weak models.
What you describe is called online machine learning, incremental supervised learning, Updateable Classifiers... There are bunch of algorithms that accomplish these behavior. See for example weka toolbox Updateable Classifiers.
I suggest to look following ones.
HoeffdingTree
IBk
NaiveBayesUpdateable
SGD

In online machine learning algorithm linear regression stochastic gradient , when new training data comes do we have to mix it with prevoius data?

Suppose I have 1 billion data set points, with which we already trained our machine learning model and obtained our parameters / weights . Now i receive another 100 data set points , how i train this new data set ? Deviating from linear regression , how do we train new examples of spam/not spam in spam filtering , if we had already trained let say 2 billion mails ?
It seems to me that you should use a different algorithm (i.e. an online algorithm).
I've never tried this in practice, but here's a paper from NIPS (a well-respected ML conference) that you may find useful: Online Linear Regression and Its Application to Model-Based Reinforcement Learning. (This same algorithm was suggested in an answer to a similar question on Cross Validated.)

Resources