I always have trouble understanding the significance of chi-squared test and how to use it for feature selection. I tried reading the wiki page but I didn't get a practical understanding. Can anyone explain?
chi-squared test helps you to determine the most significant features among a list of available features by determining the correlation between feature variables and the target variable.
Example below is taken from https://chrisalbon.com/machine-learning/chi-squared_for_feature_selection.html
The below test will select two best features (since we are assigning 2 to the "k" parameter) among the 4 available features initially.
# Load libraries
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# Load iris data
iris = load_iris()
# Create features and target
X = iris.data
y = iris.target
# Convert to categorical data by converting data to integers
X = X.astype(int)
# Select two features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=2)
X_kbest = chi2_selector.fit_transform(X, y)
type(X_kbest)
# Show results
print('Original number of features:', X.shape[1])
print('Reduced number of features:', X_kbest.shape[1])
Chi-squared feature selection is a uni-variate feature selection technique for categorical variables. It can also be used for continuous variable, but the continuous variable needs to be categorized first.
How it works?
It tests the null hypothesis that the outcome class depends on the categorical variable by calculating chi-squared statistics based on contingency table. For more details on contingency table and chi-squared test check the video: https://www.youtube.com/watch?v=misMgRRV3jQ
To categorize the continuous data, there are range of techniques available from simplistic frequency based binning to advance approaches such as Minimum Description Length and entropy based binning methods.
Advantage of using chi-squared test on continuous variable is that it can capture the non-linear relation with outcome variable.
Related
I want to check if a clustering would be helpful or not on my coordinates.
I'm dealing with trajectories and want to check if all of them are starting on a same area (the trajectories are different). Thus the aim here is to characterise the most frequent departure points.
However, sometimes there is no need for clustering. I'm using K-means here. I had thought of using the Silhouette Score but I don't see if it is mathematically correct for the case where there is only one cluster. DBScan will not be a good clustering as density are not similar in the clusters I wanted to build.
Would you have an idea to create a kind of check between k=1 and k=3, which would be the best split for my data? I'm dealing here with data with coordinates (latitude/longitude) where the starting point is not 100% fixed but can vary within 2km around a kind of barycentre.
Simple extract with k=2 :
from pyspark.ml.feature import VectorAssembler
vecAssembler = VectorAssembler(inputCols=["lat", "lon"], outputCol="features")
df1= vecAssembler.transform(df)
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
# Loads data.
# Trains a k-means model.
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(df1.select('features'))
# Make predictions
transformed = model.transform(df1)
evaluator = ClusteringEvaluator(predictionCol='prediction', featuresCol='features', \
metricName='silhouette', distanceMeasure='squaredEuclidean')
evaluator.evaluate(transformed)
Is there a way to compute in pySpark a case with k=1 ? in order to derive Elbow or gap statistics ?
sklearn contains Implementation of different feature selection methods (filter/wrapper/embedded).
All those methods designed for static systems.
Does sklearn supports feature selection on dynamic data ? (Data which vary with time)
In dynamic data, we need to improve the efficiency of feature selection, in order to be more effective.
I found some methods on IEEE (Incremental approaches for feature selection),
So is there any implementation at sklearn or other open-source library ?
Couldn't you just re-run your process on a scheduled basis and load your data dynamically? I wouldn't expect the dependent variable to change at all, but I suppose the independent variables could change somewhat.
#1) load your dataframe
#2) copy your target variable into a new dataframe
y = df[['SeriousDlqin2yrs']]
#3) drop your target variable
x = df[df.columns[df.columns!='SeriousDlqin2yrs']]
Finally, run this.
from sklearn.ensemble import RandomForestClassifier
features = np.array(x)
clf = RandomForestClassifier()
clf.fit(x, y)
# from the calculated importances, order them from most to least important
# and make a barplot so we can visualize what is/isn't important
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)
padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()
I just tried that, and got this result.
If you expect to get non-numeric features, you will need to use one hot encoding to handle these.
import pandas as pd
pd.get_dummies(df)
http://queirozf.com/entries/one-hot-encoding-a-feature-on-a-pandas-dataframe-an-example
I have a set of inputs that has 5000ish features with values that vary from 0.005 to 9000000. Each of the features has similar values (a feature with a value of 10ish will not also have a value of 0.1ish)
I am trying to apply linear regression to this data set, however, the wide range of input values is inhibiting effective gradient descent.
What is the best way to handle this variance? If normalization is best, please include details on the best way to implement this normalization.
Thanks!
Simply perform it as a pre-processing step. You can do it as following:
1) Calculate mean values for each of the features in the training set and store it. Be careful, do not mess up feature mean and sample mean, so you will have a vector of size [number_of_features (5000ish)].
2) Calculate std. for each feature in the training set and store it. Size of [number_of_feature] as well
3) Update each training and testing entry as:
updated = (original_vector - mean_vector)/ std_vector
That's it!
The code will look like:
# train_data shape [train_length,5000]
# test_data [test_length, 5000]
mean = np.mean(train_data,1)
std = np.std(train_data,1)
normalized_train_data = (train_data - mean)/ std
normalized_test_data = (test_data - mean)/ std
I think I understand that until recently people used the attribute coef_ to extract the most informative features from linear models in python's machine learning library sklearn. Now users get pointed to SelectFromModel instead. SelectFromModel allows to reduce the features based on a threshold. So something like the following code reduces the features down to those features which have an importance > 0.5. My question now: Is there any way to determine whether a feature is positivly or negatively discriminating for a class?
I have my data in a pandas dataframe called data, first column a list of filenames of text files, second column the label.
count_vect = CountVectorizer(input="filename", analyzer="word")
X_train_counts = count_vect.fit_transform(data["filenames"])
print(X_train_counts.shape)
tf_transformer = TfidfTransformer(use_idf=True)
traindata = tf_transformer.fit_transform(X_train_counts)
print(traindata.shape) #report size of the training data
clf = LogisticRegression()
model = SelectFromModel(clf, threshold=0.5)
X_transform = model.fit_transform(traindata, data["labels"])
print("reduced features: ", X_transform.shape)
#get the names of all features
words = np.array(count_vect.get_feature_names())
#get the names of the important features using the boolean index from model
print(words[model.get_support()])
To my knowledge you need to stick back to the .coef_ method and see which coefficients are negative or positive. a negative coefficient obviously decreases the odds of that class to happen (so negative relationship), while a positive coefficient increases the odds the class to happen (so positive relationship).
However this method will not give you the significance, only the direction. You will need the SelectFromModel method to extract that.
I know of a couple of classification algorithms such as decision trees, but I can't use any of them to the problem I have at hands.
I have a dataset in which each row contains information about a purchase. It's columns are:
- customer id
- store id where the purchase took place
- date and time of the event
- amount of money spent
I'm trying to make a prediction that, given the information of who, where and when, predicts how much money is going to be spent.
What are some possible ways of doing this? Are there any well-known algorithms?
Also, I'm currently learning RapidMiner, and I'm experimenting with some of its features. Everything that I've tried there doesn't allow me to have a real number (amount spent) as a label. Maybe I'm doing something wrong?
You could use a Decision Tree Regressor for this. Using a toolkit like scikit-learn, you could use the DecisionTreeRegressor algo where your features would be store id, date and time, and customer id, and your target would be the amount spent.
You could turn this into a supervised learning problem. This is untested code, but it could probably get you started
# Load libraries
import numpy as np
import pylab as pl
from sklearn import datasets
from sklearn.tree import DecisionTreeRegressor
from sklearn import cross_validation
from sklearn import metrics
from sklearn import grid_search
def fit_predict_model(data_import):
"""Find and tune the optimal model. Make a prediction on housing data."""
# Get the features and labels from your data
X, y = data_import.data, data_import.target
# Setup a Decision Tree Regressor
regressor = DecisionTreeRegressor()
parameters = {'max_depth':(4,5,6,7), 'random_state': [1]}
scoring_function = metrics.make_scorer(metrics.mean_absolute_error, greater_is_better=False)
## fit your data to it ##
reg = grid_search.GridSearchCV(estimator = regressor, param_grid = parameters, scoring=scoring_function, cv=10, refit=True)
fitted_data = reg.fit(X, y)
print "Best Parameters: "
print fitted_data.best_params_
# Use the model to predict the output of a particular sample
x = [## input a test sample in this list ##]
y = reg.predict(x)
print "Prediction: " + str(y)
fit_predict_model(##your data in here)
I took this from a project I was working on almost directly to predict housing prices so there are probably some unnecessary libraries and without doing validation you have no clue how accurate this case would be, but this should get you started.
Check out this link:
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
Yes, as comments have pointed out it's regression that you need. Linear regression does sound like a good starting point as you don't have a huge number of variables.
In RapidMiner type regression into the Operators menu and you'll see several options under Modelling-> Functions. Linear Regression, Polynomical, Vector, etc. (There's more, but as a beginner let's start here).
Right click any of these operators and press Show Operator Info and you'll see numerical labels are allowed.
Next scroll through the help documentation of the operator and you'll see a link to a tutorial process. It's really simple to use, but it's good to get you started with an example.
Let me know if you need any help.