How to handle alphanumeric values in machine learning - machine-learning

I am trying to the find the best algorithm for my claims data. The claims data include some diagnosis code which are alphanumeric like 'EA43454' . when i run the below code to evaluate the models
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=None)
cv_results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
i get the error
ValueError: could not convert string to float: 'U0003'
How to handle these alphanumeric values?

You need to convert your strings to an indicator variable (dummy variables). Each value of the string variable has to be associated with a number so that the models can train on that data.
Scikit-learn has several preprocessors to help you with this such as OneHotEncoder. You can also use pandas.get_dummies, but using sklearn's own classes is more composable - for example, you can use them as part of a pipeline.
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
rng = np.random.default_rng()
animals = pd.DataFrame({"animal": rng.choice(["cat", "dog"], size=10),
"age": rng.integers(1, 20, size=10)})
animals_ohe = OneHotEncoder().fit_transform(animals.drop(columns=["age"]))

Related

How to fine tune a masked language model?

I'm trying to follow the huggingface tutorial on fine tuning a masked language model (masking a set of words randomly and predicting them). But they assume that the dataset is in their system (can load it with from datasets import load_dataset; load_dataset("dataset_name")). However, my input dataset is a long string:
text = "This is an attempt of a great example. "
dataset = text * 3000
I followed their approach and tokenized each it:
from transformers import AutoTokenizer
from transformers import AutoModelForMaskedLM
import torch
from transformers import DataCollatorForLanguageModeling
model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
def tokenize_long_text(tokenizer, long_text):
individual_sentences = long_text.split('.')
tokenized_sentences_list = tokenizer(individual_sentences)['input_ids']
tokenized_sequence = [x for xs in tokenized_sentences_list for x in xs]
return tokenized_sequence
tokenized_sequence = tokenize_long_text(tokenizer, long_text)
Following by chunking it into equal length segments:
def chunk_long_tokenized_text(tokenizer_text, chunk_size):
# Compute length of long tokenized texts
total_length = len(tokenizer_text)
# We drop the last chunk if it's smaller than chunk_size
total_length = (total_length // chunk_size) * chunk_size
return [tokenizer_text[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
chunked_sequence = chunk_long_tokenized_text(tokenized_sequence, 30)
Created a data collator for random masking:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15) # expects a list of dicts, where each dict represents a single chunk of contiguous text
Example of how it works:
d = {}
d['input_ids'] = chunked_sequence[0]
d
>>>{'input_ids': [101,
2023,
2003,
1037,
2307,
103,...
for chunk in data_collator([ d ])["input_ids"]:
print(f"\n'>>> {tokenizer.decode(chunk)}'")
>>>'>>> [CLS] this is a great [MASK] [SEP] [CLS] this is a great [MASK] [SEP] [CLS] this is a great [MASK] [SEP] [CLS] this is a great [MASK] [SEP] [CLS] this'
However, the remaining steps (which I believe is just the training component) seem to only work using their trainer method, which can only take their dataset.
How can this work with a dataset in the form of a string?

Where can I get the pretrained word embeddinngs for BERT?

I know that BERT has total vocabulary size of 30522 which contains some words and subwords. I want to get the initial input embeddings of BERT. So, my requirement is to get the table of size [30522, 768] to which I can index by token id to get its embeddings. Where can I get this table?
The BertModels have get_input_embeddings():
import torch
from transformers import BertModel, BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert = BertModel.from_pretrained('bert-base-uncased')
token_embedding = {token: bert.get_input_embeddings()(torch.tensor(id)) for token, id in tokenizer.get_vocab().items()}
print(len(token_embedding))
print(token_embedding['[CLS]'])
Output:
30522
tensor([ 1.3630e-02, -2.6490e-02, -2.3503e-02, -7.7876e-03, 8.5892e-03,
-7.6645e-03, -9.8808e-03, 6.0184e-03, 4.6921e-03, -3.0984e-02,
1.8883e-02, -6.0093e-03, -1.6652e-02, 1.1684e-02, -3.6245e-02,
8.3482e-03, -1.2112e-03, 1.0322e-02, 1.6692e-02, -3.0354e-02,
-1.2372e-02, -2.5173e-02, -8.9602e-03, 8.1994e-03, -2.0011e-02,
-1.5901e-02, -3.8394e-03, 1.4241e-03, 7.0500e-03, 1.6092e-03,
-2.7764e-03, 9.4931e-03, -2.2768e-02, 1.9317e-02, -1.3442e-02,
-2.3763e-02, -1.4617e-02, 9.7735e-03, -2.2428e-03, 3.0642e-02,
6.7829e-03, -2.6471e-03, -1.8553e-02, -1.2363e-02, 7.6489e-03,
-2.5461e-03, -3.1498e-01, 6.3761e-03, 4.8914e-02, -7.7636e-03,
6.0919e-02, 2.1346e-02, -3.9741e-02, 2.2853e-01, 2.6502e-02,
-1.0144e-03, -7.8480e-03, -1.9995e-03, 1.7057e-02, -3.3270e-02,
4.5421e-03, 6.1751e-03, -1.0077e-01, -2.0973e-02, -1.4512e-04,
-9.6657e-03, 1.0871e-02, -1.4786e-02, 2.6437e-04, 2.1166e-02,
1.6492e-02, -5.1928e-03, -1.1857e-02, -9.9159e-03, -1.4363e-02,
-1.2405e-02, -1.2973e-02, 2.6778e-02, -1.0986e-02, 1.0572e-02,
-2.5566e-02, 5.2494e-03, 1.5890e-02, -5.1504e-03, -7.5859e-03,
2.0259e-02, -7.0155e-03, 1.6359e-02, 1.7487e-02, 5.4297e-03,
-8.6403e-03, 2.8821e-02, -7.8964e-03, 1.9259e-02, 2.3868e-02,
-4.3472e-03, 5.5662e-02, -2.1940e-02, 4.1779e-03, -5.7216e-03,
2.6712e-02, -5.0371e-03, 2.4923e-02, -1.3429e-02, -8.4337e-03,
9.8188e-02, -1.2940e-03, 1.2865e-02, -1.5930e-03, 3.6437e-03,
1.5569e-02, 1.8620e-02, -9.0643e-03, -1.9740e-02, 1.0530e-02,
-2.7359e-03, -7.5283e-03, 1.1492e-03, 2.6162e-03, -6.2757e-03,
-8.6096e-03, 6.6221e-01, -3.2235e-03, -4.1309e-02, 3.3047e-03,
-2.5040e-03, 1.2838e-04, -6.8073e-03, 6.0291e-03, -9.8468e-03,
8.0641e-03, -1.9815e-03, 2.5801e-02, 5.7429e-03, -1.0712e-02,
2.9176e-02, 5.9414e-03, 2.4795e-02, -1.7887e-02, 7.3183e-01,
1.0964e-02, 5.9942e-03, -4.6157e-02, 4.0131e-02, -9.7481e-03,
-8.9496e-01, 1.6385e-02, -1.9816e-03, 1.4691e-02, -1.9837e-02,
-1.7611e-02, -4.5263e-04, -1.8605e-02, -1.5660e-02, -1.0709e-02,
1.8016e-02, -3.4149e-03, -1.2632e-02, 4.2877e-03, -3.9169e-01,
1.0016e-02, -1.0955e-02, 4.5133e-03, -5.1150e-03, 4.9968e-03,
1.7852e-02, 1.1313e-02, 2.6519e-03, 3.3658e-01, -1.8168e-02,
1.3170e-02, 7.3927e-03, 5.2521e-03, -9.6230e-03, 1.2844e-02,
4.1554e-01, -9.7247e-03, -4.2439e-03, 5.5287e-04, 1.8271e-02,
-1.3889e-03, -2.0502e-03, -8.1946e-03, -6.5979e-06, -7.2764e-04,
-1.4625e-03, -6.9872e-03, -6.9633e-03, -8.0701e-03, 1.9936e-02,
4.8370e-03, 8.6883e-03, -4.9246e-02, -2.0028e-02, 1.4124e-03,
1.0444e-02, -1.1236e-02, -4.4654e-03, -2.0491e-02, -2.7654e-02,
-3.7079e-02, 1.3215e-02, 6.9498e-02, -3.1109e-02, 7.0562e-03,
1.0887e-02, -7.8090e-03, -1.0501e-02, -4.8735e-03, -6.8399e-04,
1.4717e-02, 4.4342e-03, 1.6012e-02, -1.0427e-02, -2.5767e-02,
-2.2699e-01, 8.6569e-02, 2.3453e-02, 4.6362e-02, 3.5609e-03,
2.1353e-02, 2.3703e-02, -2.0252e-02, 2.1580e-02, 7.2652e-03,
2.0933e-01, 1.2108e-02, 1.0869e-02, 7.0568e-03, -3.1132e-02,
2.0505e-02, 3.2248e-03, -2.2724e-03, 5.5342e-03, 3.0563e-03,
1.9542e-02, 1.2827e-03, 1.5952e-02, -1.5458e-02, -3.8455e-03,
-4.9417e-03, -1.0446e-02, 7.0516e-03, 2.2467e-03, -9.3643e-03,
1.9163e-02, 1.4239e-02, -1.5816e-02, 8.7413e-03, 2.4737e-02,
-7.3777e-03, -4.0975e-02, 9.4948e-03, 1.4700e-02, 2.6819e-02,
1.0706e-02, 1.0621e-02, -7.1816e-03, -8.5402e-03, 1.2261e-02,
-4.8679e-03, -9.6136e-03, 7.8765e-04, 3.8504e-02, -7.7485e-03,
-6.5018e-03, 3.4352e-03, 2.2931e-04, 5.7456e-03, -4.8441e-03,
-9.0898e-03, 8.6298e-03, 5.4740e-03, 2.2274e-02, -2.1218e-02,
-2.6795e-02, -3.5337e-03, 1.0785e-02, 1.2475e-02, -6.1160e-03,
1.0729e-02, -9.7955e-03, 1.8543e-02, -6.0488e-03, -4.5744e-03,
2.7089e-03, 1.5632e-02, -1.2928e-02, -3.0778e-03, -1.0325e-02,
-7.9550e-03, -6.3065e-02, 2.1062e-02, -6.6717e-03, 8.4616e-03,
1.4475e-02, 1.1477e-01, -2.2838e-02, -3.7491e-02, -3.6218e-02,
-3.1994e-02, -8.9252e-03, 3.1720e-02, -1.1260e-02, -1.2980e-01,
-1.0315e-03, -4.7242e-03, -2.0092e-02, -9.4521e-01, -2.2178e-02,
-4.4297e-04, 1.9711e-02, 3.3402e-02, -1.0513e-02, 1.4492e-02,
-1.9697e-02, -9.8452e-03, -1.7347e-02, 2.3472e-02, 7.6570e-02,
1.9504e-02, 9.3617e-03, 8.2672e-03, -1.0471e-02, -1.9932e-03,
2.0000e-02, 2.0485e-02, 1.0977e-02, 1.7720e-02, 1.3532e-02,
7.3682e-03, 3.4906e-04, 1.8772e-03, 1.9976e-02, -3.2041e-02,
-8.9169e-03, 1.2900e-02, -1.3331e-02, 6.6207e-03, -5.7063e-03,
-1.1482e-02, 8.3907e-03, -6.4162e-03, 1.5816e-02, 7.8921e-03,
4.4177e-03, 2.2568e-02, 1.0239e-02, -3.0194e-04, 1.3294e-02,
-2.1606e-02, 3.8832e-03, 2.4475e-02, 4.3808e-02, -2.1031e-03,
-1.2163e-02, -4.0786e-02, 1.5565e-02, 1.4750e-02, 1.6645e-02,
2.8083e-02, 1.8920e-03, -1.4733e-04, -2.6208e-02, 2.3780e-02,
1.8657e-04, -2.2931e-03, 3.0334e-03, -1.7294e-02, -2.3001e-02,
8.6004e-03, -3.3497e-02, 2.5660e-02, -1.9225e-02, -2.7186e-02,
-2.1020e-02, -3.5213e-02, -1.8228e-03, -8.2840e-03, 1.1212e-02,
1.0387e-02, -3.4194e-01, -1.9705e-03, 1.1558e-02, 5.1976e-03,
7.4498e-03, 5.7142e-03, 2.8401e-02, -7.7551e-03, 1.0682e-02,
-1.2657e-02, -1.8065e-02, 2.6681e-03, 3.3947e-03, -4.5565e-02,
-2.1170e-02, -1.7830e-02, 3.4679e-03, -2.2051e-02, -5.4176e-03,
-1.1517e-02, -3.4155e-02, -3.0335e-03, -1.3915e-02, 6.2173e-03,
-1.1101e-02, -1.5308e-02, 9.2188e-03, -7.5665e-03, 6.5685e-03,
8.0935e-03, 3.1139e-03, -5.5047e-03, -3.1347e-02, 2.2140e-02,
1.0865e-02, -2.7849e-02, -4.9580e-03, 1.8804e-03, 1.0007e-01,
-1.8013e-03, -4.8792e-03, 1.5534e-02, -2.0179e-02, -1.2351e-02,
-1.3871e-02, 1.1439e-02, -9.0208e-03, 1.2580e-02, -2.5973e-02,
-2.0398e-02, -1.9464e-03, 4.3189e-03, 2.0707e-02, 5.0029e-03,
-1.0679e-02, 1.2298e-02, 1.0269e-02, 2.2228e-02, 2.9754e-02,
-2.6392e-03, 1.9286e-02, -1.5137e-02, 2.1914e-01, 1.3030e-02,
-7.4460e-03, -9.6818e-04, 2.9736e-02, 9.8722e-03, -5.6688e-03,
4.2518e-03, 1.8941e-02, -6.3909e-03, 8.0590e-03, -6.7893e-03,
6.0878e-03, -5.3970e-03, 7.5776e-04, 1.1374e-03, -5.0035e-03,
-1.6159e-03, 1.6764e-02, 9.1251e-03, 1.3020e-02, -1.0368e-02,
2.2141e-02, -2.5411e-03, -1.5227e-02, 2.3444e-02, 8.4076e-04,
-1.1465e-01, 2.7017e-03, -4.4961e-03, 2.9762e-04, -3.9612e-03,
8.9038e-05, 2.8683e-02, 5.0068e-03, 1.6509e-02, 7.8983e-04,
5.7728e-03, 3.2685e-02, -1.0457e-01, 1.2989e-02, 1.1278e-02,
1.1943e-02, 1.5258e-02, -6.2411e-04, 1.0682e-04, 1.2087e-02,
7.2984e-03, 2.7758e-02, 1.7572e-02, -6.0345e-03, 1.7211e-02,
1.4121e-02, 6.4663e-02, 9.1813e-03, 3.2555e-03, -3.2667e-02,
2.9132e-02, -1.7770e-02, 1.5302e-03, -2.9944e-02, -2.0706e-02,
-3.6528e-03, -1.5497e-02, 1.5223e-02, -1.4751e-02, -2.2381e-02,
6.9636e-03, -8.0838e-03, -2.4583e-03, -2.0677e-02, 8.8132e-03,
-6.9554e-04, 1.6965e-02, 1.8535e-01, 3.5843e-04, 1.0812e-02,
-4.2391e-03, 8.1779e-03, 3.4144e-02, -1.8996e-03, 2.9939e-03,
3.6898e-04, -1.0144e-02, -5.7416e-03, -5.7676e-03, 1.7565e-01,
-1.5793e-03, -2.6617e-02, -1.2572e-02, 3.0421e-04, -1.2132e-02,
-1.4168e-02, 1.2154e-02, 8.4700e-03, -1.6284e-02, 2.6983e-03,
-6.8554e-03, 2.7829e-01, 2.4060e-02, 1.1130e-02, 7.6095e-04,
3.1341e-01, 2.1668e-02, 1.0277e-02, -3.0065e-02, -8.3565e-03,
5.2488e-03, -1.1287e-02, -1.8266e-02, 1.1814e-02, 1.2662e-02,
2.9036e-04, 7.0254e-04, -1.4084e-02, 1.2925e-02, 3.9504e-03,
-7.9568e-03, 3.2794e-02, 7.3839e-03, 2.4609e-02, 9.6109e-03,
-8.7206e-03, 9.2571e-03, -3.5850e-03, -8.9996e-03, 2.3120e-03,
-1.8475e-02, -1.9610e-02, 1.1994e-02, 6.7156e-03, 1.9903e-02,
3.0703e-02, -4.9538e-03, -6.1673e-02, -6.4986e-03, -2.1317e-02,
-3.3650e-03, 2.3200e-03, -6.2224e-03, 3.7458e-03, 1.1542e-02,
-1.0181e-02, -8.4711e-03, 1.1603e-02, -5.6247e-03, -1.0220e-02,
-8.6501e-04, -1.2285e-02, -8.7487e-03, -1.1265e-02, 1.6322e-02,
1.5160e-02, 1.8882e-02, 5.1557e-03, -8.8616e-03, 4.2153e-03,
-1.9450e-02, -8.7365e-03, -9.7867e-03, 1.1667e-02, 5.0613e-03,
2.8221e-03, -7.1795e-03, 9.3306e-03, -4.9663e-02, 1.7708e-02,
-2.0959e-02, -3.3989e-02, 2.2581e-03, 5.1748e-03, -1.0133e-01,
2.1052e-03, 5.5644e-03, 1.3607e-03, 8.8388e-03, 1.0244e-02,
-3.8072e-03, 5.9209e-03, 6.7993e-03, 1.1594e-02, -1.1802e-02,
-2.4233e-03, -5.1504e-03, -1.1903e-02, 1.4075e-02, -4.0701e-03,
-2.9465e-02, -1.7579e-03, 4.3654e-03, 1.0429e-02, 3.7096e-02,
8.6493e-03, 1.5871e-02, 1.8034e-02, -3.2165e-03, -2.1941e-02,
2.6274e-02, -7.6941e-03, -5.9618e-03, -1.4179e-02, 8.0281e-03,
1.1293e-02, -6.6936e-05, 1.2899e-02, 1.0056e-02, -6.3919e-04,
2.0299e-02, 3.1528e-03, -4.8988e-03, 3.2754e-03, -1.1003e-01,
1.8414e-02, 2.2272e-03, -2.2185e-02, -4.8672e-03, 1.9643e-03,
3.0928e-02, -8.9599e-03, -1.1446e-02, -1.3794e-02, 7.1943e-03,
-5.8965e-03, 2.2605e-03, -2.6114e-02, -5.6616e-03, 6.5073e-03,
9.2219e-02, -6.7243e-03, 4.4427e-04, 7.2846e-03, -1.1021e-02,
7.8802e-04, -3.8878e-03, 1.0489e-02, 9.2883e-03, 1.8895e-02,
2.1808e-02, 6.2590e-04, -2.6519e-02, 7.0343e-04, -2.9067e-02,
-9.1515e-03, 1.0418e-03, 8.3222e-03, -8.7548e-03, -2.0637e-03,
-1.1450e-02, -8.8985e-04, -4.4062e-03, 2.3629e-02, -2.7221e-02,
3.2008e-02, 6.6325e-03, -1.1302e-02, -1.0138e-03, -1.6902e-01,
-8.4473e-03, 2.8536e-02, 1.4117e-03, -1.2136e-02, -1.4781e-02,
4.9960e-03, 3.3916e-02, 5.2710e-03, 1.7382e-02, -4.6315e-03,
1.1680e-02, -9.1395e-03, 1.8310e-02, 1.2321e-02, -2.4871e-02,
1.1535e-02, 5.0308e-03, 5.5028e-03, -7.2184e-03, -5.5210e-03,
1.7085e-02, 5.7236e-03, 1.7463e-03, 1.9969e-03, 6.1670e-03,
2.9347e-03, 1.3946e-02, -1.9984e-03, 1.0091e-02, 1.0388e-03,
-6.1902e-03, 3.0905e-02, 6.6038e-03, -9.1223e-02, -1.8411e-02,
5.4185e-03, 2.4396e-02, 1.5696e-02, -1.2742e-02, 1.8126e-02,
-2.6138e-02, 1.1170e-02, -1.3058e-02, -1.9386e-02, -5.9828e-03,
1.9176e-02, 1.9962e-03, -2.1538e-03, 3.3003e-02, 1.8407e-02,
-5.9498e-03, -3.2533e-03, -1.8917e-02, -1.5897e-02, -4.7057e-03,
5.4162e-03, -3.0037e-02, 8.6773e-03, -1.7942e-03, 6.6826e-03,
-1.1929e-02, -1.4076e-02, 1.6709e-02, 1.6860e-03, -3.3842e-03,
8.6805e-03, 7.1340e-03, 1.5147e-02], grad_fn=<EmbeddingBackward>)
To get context-sensitive word embedding for given input sentence/text, here is the code,
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModel
def get_word_idx(sent: str, word: str):
return sent.split(" ").index(word)
def get_hidden_states(encoded, token_ids_word, model, layers):
"""Push input IDs through model. Stack and sum `layers` (last four by default).
Select only those subword token outputs that belong to our word of interest
and average them."""
with torch.no_grad():
output = model(**encoded)
# Get all hidden states
states = output.hidden_states
# Stack and sum all requested layers
output = torch.stack([states[i] for i in layers]).sum(0).squeeze()
# Only select the tokens that constitute the requested word
word_tokens_output = output[token_ids_word]
return word_tokens_output.mean(dim=0)
def get_word_vector(sent, idx, tokenizer, model, layers):
"""Get a word vector by first tokenizing the input sentence, getting all token idxs
that make up the word of interest, and then `get_hidden_states`."""
encoded = tokenizer.encode_plus(sent, return_tensors="pt")
# get all token idxs that belong to the word of interest
token_ids_word = np.where(np.array(encoded.word_ids()) == idx)
return get_hidden_states(encoded, token_ids_word, model, layers)
def main(layers=None):
# Use last four layers by default
layers = [-4, -3, -2, -1] if layers is None else layers
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
model = AutoModel.from_pretrained("bert-base-cased", output_hidden_states=True)
sent = "I like cookies ."
idx = get_word_idx(sent, "cookies")
word_embedding = get_word_vector(sent, idx, tokenizer, model, layers)
return word_embedding
if __name__ == '__main__':
main()
More details can be found here.

Getting the column names chosen after a feature selection method

Given a simple feature selection code below, I want to know the selected columns after the feature selection (The dataset includes a header V1 ... V20)
import pandas as pd
from sklearn.feature_selection import SelectFromModel, SelectKBest, f_regression
def feature_selection(data):
y = data['Class']
X = data.drop(['Class'], axis=1)
fs = SelectKBest(score_func=f_regression, k=10)
# Applying feature selection
X_selected = fs.fit_transform(X, y)
# TODO: determine the columns being selected
return X_selected
data = pd.read_csv("../dataset.csv")
new_data = feature_selection(data)
I appreciate any help.
I have used the iris dataset for my example but you can probably easily modify your code to match your use case.
The SelectKBest method has the scores_ attribute I used to sort the features.
Feel free to ask for any clarifications.
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectFromModel, SelectKBest, f_regression
from sklearn.datasets import load_iris
def feature_selection(data):
y = data[1]
X = data[0]
column_names = ["A", "B", "C", "D"] # Here you should use your dataframe's column names
k = 2
fs = SelectKBest(score_func=f_regression, k=k)
# Applying feature selection
X_selected = fs.fit_transform(X, y)
# Find top features
# I create a list like [[ColumnName1, Score1] , [ColumnName2, Score2], ...]
# Then I sort in descending order on the score
top_features = sorted(zip(column_names, fs.scores_), key=lambda x: x[1], reverse=True)
print(top_features[:k])
return X_selected
data = load_iris(return_X_y=True)
new_data = feature_selection(data)
I don't know the in-build method, but it can be easily coded.
n_columns_selected = X_new.shape[0]
new_columns = list(sorted(zip(fs.scores_, X.columns))[-n_columns_selected:])
# new_columns order is perturbed, we need to restore it. We use the names of the columns of X as a reference
new_columns = list(sorted(cols_new, key=lambda x: list(X.columns).index(x)))

How to use the best parameter as parameter of a classifier in GridSearchCV?

I have a function called svc_param_selection(X, y, n) which returns best_param_.
Now I want to use the best_params returned as the parameter of a classifier like:
.
parameters = svc_param_selection(X, y, 2)
from sklearn.model_selection import ParameterGrid
from sklearn.svm import SVC
param_grid = ParameterGrid(parameters)
for params in param_grid:
svc_clf = SVC(**params)
print (svc_clf)
classifier2=SVC(**svc_clf)
It seems parameters is not a grid here..
You can use GridSearchCV to do this. There is a example here:
# Applying GridSearch to find best parameters
from sklearn.model_selection import GridSearchCV
parameters = [{ 'criterion' : ['gini'], 'splitter':['best','random'], 'min_samples_split':[0.1,0.2,0.3,0.4,0.5],
'min_samples_leaf': [1,2,3,4,5]},
{'criterion' : ['entropy'], 'splitter':['best','random'], 'min_samples_split':[0.1,0.2,0.3,0.4,0.5],
'min_samples_leaf': [1,2,3,4,5]} ]
gridsearch = GridSearchCV(estimator = classifier, param_grid = parameters,refit= False, scoring='accuracy', cv=10)
gridsearch = gridsearch.fit(x,y)
optimal_accuracy = gridsearch.best_score_
optimal_parameters = gridsearch.best_params_
But for param_grid of GridSearchCV, you should pass a dictionary of parameter name and value for you classifier. For example a classifier like this:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(random_state=0, presort=True,
criterion='entropy')
classifier = classifier.fit(x_train,y_train)
Then after finding best parameters by GridSearchCV you apply them on you model.
#Ben At the start of gridsearch, you either specify the classifier outside the param_grid (if you have only one classification method to check) or inside the param_grid. I have made a check for the 'inside' case only.
First, I set the 'classifier' key in the param_grid. That is the key which you need to ask for in the end.
param_grid = [
{'classifier' : [LogisticRegression()],
...
},
{'classifier' : [RandomForestClassifier()],
}
]
As an example, the result of gridsearch.best_params_ is:
{'classifier': RandomForestClassifier(criterion='entropy', max_depth=2, n_estimators=2),
'classifier__criterion': 'entropy',
'classifier__max_depth': 2,
'classifier__min_samples_leaf': 1,
'classifier__n_estimators': 2}
Then ask this dictionary gridsearch.best_params_ for the key that you called the 'classifier'.
clfBest = clfGridSearchBest.best_params_['classifier']
clfBest:
RandomForestClassifier(criterion='entropy', max_depth=2, n_estimators=2)
Now just fit clfBest.

Polynomial regression in spark/ or external packages for spark

After investing good amount of searching on net for this topic, I am ending up here if I can get some pointer . please read further
After analyzing Spark 2.0 I concluded polynomial regression is not possible with spark (spark alone), so is there some extension to spark which can be used for polynomial regression?
- Rspark it could be done (but looking for better alternative)
- RFormula in spark does prediction but coefficients are not available (which is my main requirement as I primarily interested in coefficient values)
Polynomial regression is just another case of a linear regression (as in Polynomial regression is linear regression and Polynomial regression). As Spark has a method for linear regression, you can call that method changing the inputs in such a way that the new inputs are the ones suited to polynomial regression. For instance, if you only have one independent variable x, and you want to do quadratic regression, you have to change your independent input matrix for [x x^2].
I would like to add some information to #Mehdi Lamrani’s answer :
If you want to do a polynomial linear regression in SparkML, you may use the class PolynomialExpansion.
For information check the class in the SparkML Doc
or in the Spark API Doc
Here is an implementation example:
Let's assume we have a train and test datasets, stocked in two csv files, with headers containing the neames of the columns (features, label).
Each data set contains three features named f1,f2,f3, each of type Double (this is the X matrix), as well as a label feature (the Y vector) named mylabel.
For this code I used Spark+Scala:
Scala version : 2.12.8
Spark version 2.4.0.
We assume that SparkML library was already downloaded in build.sbt.
First of all, import librairies :
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.mllib.evaluation.RegressionMetrics
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions.udf
import org.apache.spark.{SparkConf, SparkContext}
Create Spark Session and Spark Context :
val ss = org.apache.spark.sql
.SparkSession.builder()
.master("local")
.appName("Read CSV")
.enableHiveSupport()
.getOrCreate()
val conf = new SparkConf().setAppName("test").setMaster("local[*]")
val sc = new SparkContext(conf)
Instantiate the variables you are going to use :
val f_train:String = "path/to/your/train_file.csv"
val f_test:String = "path/to/your/test_file.csv"
val degree:Int = 3 // Set the degree of your choice
val maxIter:Int = 10 // Set the max number of iterations
val lambda:Double = 0.0 // Set your lambda
val alpha:Double = 0.3 // Set the learning rate
First of all, let's create first several udf-s, which will be used for the data reading and pre-processing.
The arguments' types of the udf toFeatures will be Vector followed by the type of the arguments of the features: (Double,Double,Double)
val toFeatures = udf[Vector, Double, Double, Double] {
(a,b,c) => Vectors.dense(a,b,c)
}
val encodeIntToDouble = udf[Double, Int](_.toDouble)
Now let's create a function which extracts data from CSV and creates, new features from the existing ones, using PolynomialExpansion:
def getDataPolynomial(
currentfile:String,
sc:SparkSession,
sco:SparkContext,
degree:Int
):DataFrame =
{
val df_rough:DataFrame = sc.read
.format("csv")
.option("header", "true") //first line in file has headers
.option("mode", "DROPMALFORMED")
.option("inferSchema", value=true)
.load(currentfile)
.toDF("f1", "f2", "f3", "myLabel")
// you may add or not the last line
val df:DataFrame = df_rough
.withColumn("featNormTemp", toFeatures(df_rough("f1"), df_rough("f2"), df_rough("f3")))
.withColumn("label", Tools.encodeIntToDouble(df_rough("myLabel")))
val polyExpansion = new PolynomialExpansion()
.setInputCol("featNormTemp")
.setOutputCol("polyFeatures")
.setDegree(degree)
val polyDF:DataFrame=polyExpansion.transform(df.select("featNormTemp"))
val datafixedWithFeatures:DataFrame = polyDF.withColumn("features", polyDF("polyFeatures"))
val datafixedWithFeaturesLabel = datafixedWithFeatures
.join(df,df("featNormTemp") === datafixedWithFeatures("featNormTemp"))
.select("label", "polyFeatures")
datafixedWithFeaturesLabel
}
Now, run the function both for the train and test datasets, using the chosen degree for the Polynomial expansion.
val X:DataFrame = getDataPolynomial(f_train,ss,sc,degree)
val X_test:DataFrame = getDataPolynomial(f_test,ss,sc,degree)
Run the algorithm in order to get a model of linear regression, using a pipeline :
val assembler = new VectorAssembler()
.setInputCols(Array("polyFeatures"))
.setOutputCol("features2")
val lr = new LinearRegression()
.setMaxIter(maxIter)
.setRegParam(lambda)
.setElasticNetParam(alpha)
.setFeaturesCol("features2")
.setLabelCol("label")
// Fit the model:
val pipeline:Pipeline = new Pipeline().setStages(Array(assembler,lr))
val lrModel:PipelineModel = pipeline.fit(X)
// Get prediction on the test set :
val result:DataFrame = lrModel.transform(X_test)
Finally, evaluate the result using mean squared error measure :
def leastSquaresError(result:DataFrame):Double = {
val rm:RegressionMetrics = new RegressionMetrics(
result
.select("label","prediction")
.rdd
.map(x => (x(0).asInstanceOf[Double], x(1).asInstanceOf[Double])))
Math.sqrt(rm.meanSquaredError)
}
val error:Double = leastSquaresError(result)
println("Error : "+error)
I hope this might be useful !

Resources