Dependent variables role in Kmeans Clustering - machine-learning

I have two variables in my data which are dependent on each other.I need to perform K means clustering on my data set.Do I need to discard one variable before performing k means clustering or both the variables can be fed as input to the algorithm.Any help would be highly appreciable.

If the relationship is very strong it literally should not make a difference.
Why don't you just try, and compare the results? Does it make a difference?

Related

Random Forest using one variable

I am trying to determine the optimal group of variables for a classification task. Sometimes instead of a group of variables, only a single variable should be selected (but the data was pretty weak looking at each variable alone).
I used several classifiers (Random Forest, Logistic regression, SVM) and I have a small problem in understanding the results (the best results were achieved by using RF).
Can someone with a deeper conceptual understanding of random forest than me please explain what a random forest using one variable is doing? Since it is only one variable, it is hard for me to see how the random forest can achieve a better sens/spec than that single variable can ever achieve alone (which it does). Is (in this case) the RF a decision tree? I was thinking that it might be the case, and after testing I observed that all the scores (accuracy, F1, precision, recall) were the same for the two of them.
Thanks for the help.

What's an approach to ML problem with multiple data sets?

What's your approach to solving a machine learning problem with multiple data sets with different parameters, columns and lengths/widths? Only one of them has a dependent variable. Rest of the files contain supporting data.
Your query is too generic and irrelevant to some extent as well. The concern around columns length and width is not justified when building a ML model. Given the fact that only one of the datasets has a dependent variable, there will be a need to merge the datasets based on keys that are common across datasets. Typically, the process followed before doing modelling is :
step 0: Identify the dependent variable and decide whether to do regression or classification (assuming you are predicting variable value)
Clean up the provided data by handling duplicates, spelling mistakes
Scan through the categorical variables to handle any discrepancies.
Merge the datasets and create a single dataset that has all the independent variables and the dependent variable for which prediction has to be done.
Do exploratory data analysis in order to understand the dependent variable's behavior with other independent variables.
Create model and refine the model based on VIF (Variance Inflation factor) and p-value.
Iterate and keep reducing the variables till you get a model which has all the
significant variables, stable R^2 value. Finalize the model.
Apply the trained model on the test dataset and see the predicted value against the variable in test dataset.
Following these steps at high level will help you to build models.

Use k-means test results for training set SPSS

I am student working with SPSS (statistics) for the first time. I used 1,000 rows of test data to run k-means cluster tool and obtained the results. I now want to take those results and run against a test set (another 1,000) to see how my model did.
I am not sure how to do this; any help is greatly appreciated!
Thanks
For clustering model (or any unsupervised model), there really is no right or wrong result. As such, there is no target variable that you can compare the cluster model result (the cluster allocation) to and the idea of splitting the data set into a training and a testing partition does not apply to these types of models.
The best you can do is to review the output of the model and explore the cluster allocations and determine whether these appear to be useful for the intended purpose.

Query about variable selection in Random Forest

I have a small doubt about variable selection in Random forest. I am aware of the fact that it chooses "m" random variables out of "M" variables for splitting and keeps the value (m) constant throughout.
My question is why these m variables are not same at each node. What is the reason behind it? Can someone help on this.
Thanks,
Fact that it is using different set (randomly chosen) of m features for each tree is actually advantage for RF. That way final model is more robust and accurate. It also helps in identifying which features are contributing most and have best predictive power.
btw that's why it is called Random Forest after all...

The Role of the Training & Tests Sets in Building a Decision Tree and Using it to Classify

I've been working weka for couple of months now.
Currently, I'm working on my machine learning course here in Ostfold University College.
I need a better way to construct a decision tree based on separated training and test sets.
Anybody come up with good idea can be of very great relief.
Thanx in advance.
-Neo
You might be asking for something more specific, but in general:
You build the decision tree with the training set, and you evaluate the performance of that tree using the test set. In other words, on the test data, you call a function usually named something like c*lassify*, passing in the newly-built tree and a data point (within your test set) you wish to classify.
This function returns the leaf (terminal) node from your tree to which that data point belongs--and assuming that the contents of that leaf is homogeneous (populated with data from a single class, not a mixture) then you have in essence assigned a class label to that data point. When you compare that class label assigned by the tree to the data point's actual class label, and repeat for all instances in your test set, you have a metric to evaluate the performance of your tree.
A rule of thumb: shuffle your data, then assign 90% to the training set and the other 10% to a test set.
actually i was looking for something like this - http://weka.wikispaces.com/Saving+and+loading+models
to save a model, load it and use it in the training set.
This is exactly what i was searching for. Hope it might be useful for anyone who had similar problem as mine.
cheers
-Neo182

Resources