How to pass user input list to the prediction model - machine-learning

I am trying to implement the prediction of heart disease problem. Upon asking the user to enter the values, I am having trouble sending it to the prediction model, thus to get an output.
age = input('Enter age: ')
sex = input('Enter sex: ')
cp = input('Enter chest pain type: ')
trestbps = input('Enter resting systolic blood pressure: ')
How do I send these user input value to get the outcome in LR which is already trained?
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train.T,y_train.T)
y_pred = lr.predict(x_test.T)
print(accuracy_score(y_test.T,y_pred)*100)

You will have to first convert these values to float point numbers since the input function takes the input as strings. This can be done as follows.
age = float(input('Enter age: '))
sex = float(input('Enter sex: '))
cp = float(input('Enter chest pain type: '))
trestbps = float(input('Enter resting systolic blood pressure: '))
Now the LR model expects x_test to be a list of list for lr.predict function. You can concatenate these values into a Python list and wrap around another list if you just have one sample to test.
x_test = [[age, sex, cp, trestbps]] # there could be multiple lists within the outer list if there were more than one samples
Depending on how your model was trained, the order of listing these variables can be different.

Related

Predict type="response" to type="terms conversion in R

Can someone please help me with the math? I need to convert the output of my GLM from response to terms to understand the math.
Let's say I am using gender (female(1), male(0)) to predict the college admission rate (0 to 1).
model <- glm(admission_rate ~ gender, data = data,family = quasipoisson(link="log"))
Model coefficients are
intercept 0.24918
genderFemale -0.23229
Now when I run
predict.glm(model, data = data, type = "response")
the values I will get will have the equation y= 0.24918 + (-0.23229) * 1 for female and y= 0.24918 for male. Since it is a link GLM, we take an exponent of each and what we get is our fitted values produced by type=response.
female = 1.017
male = 1.283
I have tried so many things to convert it to fitted values produced by type=terms, but did not get it to match.
The fitted values produced by terms should be
female = 0.152984
male = -0.07
constant = 0.096198
If you can explain the math behind, I would really really appreciate it!

LSTM sequence prediction overfits on one specific value only

hello guys i am new in machine learning. I am implementing federated learning on with LSTM to predict the next label in a sequence. my sequence looks like this [2,3,5,1,4,2,5,7]. for example, the intention is predict the 7 in this sequence. So I tried a simple federated learning with keras. I used this approach for another model(Not LSTM) and it worked for me, but here it always overfits on 2. it always predict 2 for any input. I made the input data so balance, means there are almost equal number for each label in last index (here is 7).I tested this data on simple deep learning and greatly works. so it seems to me this data mybe is not suitable for LSTM or any other issue. Please help me. This is my Code for my federated learning. Please let me know if more information is needed, I really need it. Thanks
def get_lstm(units):
"""LSTM(Long Short-Term Memory)
Build LSTM Model.
# Arguments
units: List(int), number of input, output and hidden units.
# Returns
model: Model, nn model.
"""
model = Sequential()
inp = layers.Input((units[0],1))
x = layers.LSTM(units[1], return_sequences=True)(inp)
x = layers.LSTM(units[2])(x)
x = layers.Dropout(0.2)(x)
out = layers.Dense(units[3], activation='softmax')(x)
model = Model(inp, out)
optimizer = keras.optimizers.Adam(lr=0.01)
seqLen=8 -1;
global_model = Mymodel.get_lstm([seqLen, 64, 64, 15]) # 14 categories we have , array start from 0 but never can predict zero class
global_model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=tf.keras.metrics.SparseTopKCategoricalAccuracy(k=1))
def main(argv):
for comm_round in range(comms_round):
print("round_%d" %( comm_round))
scaled_local_weight_list = list()
global_weights = global_model.get_weights()
np.random.shuffle(train)
temp_data = train[:]
# data divided among ten users and shuffled
for user in range(10):
user_data = temp_data[user * userDataSize: (user+1)*userDataSize]
X_train = user_data[:, 0:seqLen]
X_train = np.asarray(X_train).astype(np.float32)
Y_train = user_data[:, seqLen]
Y_train = np.asarray(Y_train).astype(np.float32)
local_model = Mymodel.get_lstm([seqLen, 64, 64, 15])
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
local_model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=tf.keras.metrics.SparseTopKCategoricalAccuracy(k=1))
local_model.set_weights(global_weights)
local_model.fit(X_train, Y_train)
scaling_factor = 1 / 10 # 10 is number of users
scaled_weights = scale_model_weights(local_model.get_weights(), scaling_factor)
scaled_local_weight_list.append(scaled_weights)
K.clear_session()
average_weights = sum_scaled_weights(scaled_local_weight_list)
global_model.set_weights(average_weights)
predictions=global_model.predict(X_test)
for i in range(len(X_test)):
print('%d,%d' % ((np.argmax(predictions[i])), Y_test[i]),file=f2 )
I could find some reasons for my problem, so I thought I can share it with you:
1- the proportion of different items in sequences are not balanced. I mean for example I have 1000 of "2" and 100 of other numbers, so after a few rounds the model fitted on 2 because there are much more data for specific numbers.
2- I changed my sequences as there are not any two items in a sequence while both have same value. so I could remove some repetitive data from the sequences and make them more balance. maybe it is not the whole presentation of activities but in my case it makes sense.

How to get feature_names after encoding text avg_word to vec?

I am performing analysis on donor_choose data-set.Created a glove words file for essays and encoded essays using average word to vec .
Now, i want to get feature_names.How to perform that?
I performed BOW on essays and extracted feature names too.But couldn't perform the same with average word to vec.I extracted features fro BOW through model and get_feature_names().But,how to apply the same on average word to vec,where we are not using any model but the vector of that word.
"""Encoding Essay- Bow"""
vectorizer = CountVectorizer()
vectorizer.fit(essay_train)
clean_essay_bow_X_train = vectorizer.transform(essay_train)
clean_essay_bow_X_test = vectorizer.transform(essay_test)
for i in vectorizer.get_feature_names():
feature_names_bow.append(i)
"""Encoding Essay- avgw2v"""
import pickle
with open('glove_vectors', 'rb') as f:
model = pickle.load(f)
glove_words = set(model.keys())
def avgvectorizer(data):
avw2v_data = []
for sentance in tqdm(data.values):
vector=np.zeros(300)
cnt_words=0;
for word in sentance.split():
if word in glove_words:
vector+=model[word]
cnt_words+=1
if cnt_words!=0:
vector/=cnt_words
avw2v_data.append(vector)
return avw2v_data

How to "remember" categorical encodings for actual predictions after training?

Suppose wanted to train a machine learning algorithm on some dataset including some categorical parameters. (New to machine learning, but my thinking is...) Even if converted all the categorical data to 1-hot-encoded vectors, how will this encoding map be "remembered" after training?
Eg. converting the initial dataset to use 1-hot encoding before training, say
universe of categories for some column c is {"good","bad","ok"}, so convert rows to
[1, 2, "good"] ---> [1, 2, [1, 0, 0]],
[3, 4, "bad"] ---> [3, 4, [0, 1, 0]],
...
, after training the model, all future prediction inputs would need to use the same encoding scheme for column c.
How then during future predictions will data inputs remember that mapping (where "good" maps to index 0, etc.) (Specifically, when planning on using a keras RNN or LSTM model)? Do I need to save it somewhere (eg. python pickle)(if so, how do I get the explicit mapping)? Or is there a way to have the model automatically handle categorical inputs internally so can just input the original label data during training and future use?
If anything in this question shows any serious confusion on my part about something, please let me know (again, very new to ML).
** Wasn't sure if this belongs in https://stats.stackexchange.com/, but posted here since specifically wanted to know how to deal with the actual code implementation of this problem.
What I've been doing is the following:
After you use StringIndexer.fit(), you can save its metadata (includes the actual encoder mapping, like "good" being the first column)
This is the following code I use (using java, but can be adjusted to python):
StringIndexerModel sim = new StringIndexer()
.setInputCol(field)
.setOutputCol(field + "_INDEX")
.setHandleInvalid("skip")
.fit(dataset);
sim.write().overwrite().save("IndexMappingModels/" + field + "_INDEX");
and later, when trying to make predictions on a new dataset, you can load the stored metadata:
StringIndexerModel sim = StringIndexerModel.load("IndexMappingModels/" + field + "_INDEX");
dataset = sim.transform(dataset);
I imagine you have already solved this issue, since it was posted in 2018, but I've not found this solution anywhere else, so I believe its worth sharing.
My thought would be to do something like this on the training/testing dataset D (using a mix of python and plain psudo-code):
Do something like
# Before: D.schema == {num_col_1: int, cat_col_1: str, cat_col_2: str, ...}
# assign unique index for each distinct label for categorical column annd store in a new column
# http://spark.apache.org/docs/latest/ml-features.html#stringindexer
label_indexer = StringIndexer(inputCol="cat_col_i", outputCol="cat_col_i_index").fit(D)
D = label_indexer.transform(D)
# After: D.schema == {num_col_1: int, cat_col_1: str, cat_col_2: str, ..., cat_col_1_index: int, cat_col_2_index: int, ...}
for all the categorical columns
Then for all of these categorical name and index columns in D, make a map of form
map = {}
for all categorical column names colname in D:
map[colname] = []
# create mapping dict for all categorical values for all
# see https://spark.apache.org/docs/latest/sql-programming-guide.html#untyped-dataset-operations-aka-dataframe-operations
for all rows r in D.select(colname, '%s_index' % colname).drop_duplicates():
enc_from = r['%s' % colname]
enc_to = r['%s_index' % colname]
map[colname].append((enc_from, enc_to))
# for cats that may appear later that have yet to be seen
# (IDK if this is best practice, may be another way, see https://medium.com/#vaibhavshukla182/how-to-solve-mismatch-in-train-and-test-set-after-categorical-encoding-8320ed03552f)
map[colname].append(('NOVEL_CAT', map[colname].len))
# sort by index encoding
map[colname].sort(key = lamdba pair: pair[1])
to end up with something like
{
'cat_col_1': [('orig_label_11', 0), ('orig_label_12', 1), ...],
'cat_col_2': [(), (), ...],
...
'cat_col_n': [(orig_label_n1, 0), ...]
}
which can then be used to generate 1-hot-encoded vectors for each categorical column in any later data sample row ds. Eg.
for all categorical column names colname in ds:
enc_from = ds[colname]
# make zero vector for 1-hot for category
col_onehot = zeros.(size = map[colname].len)
for label, index in map[colname]:
if (label == enc_from):
col_onehot[index] = 1
# make new column in sample for 1-hot vector
ds['%s_onehot' % colname] = col_onehot
break
Can then save this structure as pickle pickle.dump( map, open( "cats_map.pkl", "wb" ) ) to use to compare against categorical column values when making actual predictions later.
** There may be a better way, but I think would need to better understand this article (https://medium.com/#satnalikamayank12/on-learning-embeddings-for-categorical-data-using-keras-165ff2773fc9). Will update answer if anything.

different clustering labels

I am trying to cluster new data that have not been seen during the training and only including in the testing data. The training file has five classes whereas the testing data has 7 classes (5 +2) where the 2 are new classes. Now, I want to run k-mean to find a the proper cluster to the new add classes or create new cluster for each if they are not close to any cluster.
This is a part of my code:
print("Reading training data...")
#mydata = pd.read_csv('.\KDDTrain.csv', header=0)
mydata = pd.read_csv('.\PTraining.csv', header=0)
# select all but the last column as data
X_train = mydata.ix[1:, :-1]
X_train = np.array(X_train)
n_samples, n_features = np.shape(X_train)
# print np.shape(X_train)
# select last column as target/class
y_train = mydata.ix[1:, n_features]
y_train = np.array(y_train)
# encode target labels with numeric values from 0 to no of classes
# print "Encoding class labels..."
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(y_train)
# print list(label_encoder.classes_)
# print 'total no of classes in dataset=' + str(len(label_encoder.classes_))
y_train = label_encoder.transform(y_train)
# n_samples, n_features = data.shape
n_digits = len(np.unique(y_train))
print("Training data statistics")
print("n_attack_catagories: %d, \t n_samples %d, \t n_features %d"
% (n_digits, n_samples, n_features))
sample_size = 300
# Read test data
mytestdata = pd.read_csv('.\KDDTest+.csv', header=0)
print("Reading test data...")
# select all but the last column as data
X_test = mytestdata.ix[1:, :-1]
X_test = np.array(X_test)
# print np.shape(X_test)
# select last column as target/class
y_test = mytestdata.ix[1:, n_features]
# print "actual labels"
# print y_test
y_test = label_encoder.transform(y_test)
# print "Encoded labels"
# print y_test
y_test = np.array(y_test)
n_samples_test, n_features_test = np.shape(X_test)
n_digits_test = len(np.unique(y_test))
print("Test data statistics")
print("n_attack_catagories: %d, \t n_samples %d, \t n_features %d"
% (n_digits_test, n_samples_test, n_features_test))
print(79 * '_')
and giving this error
File "C:/Users/aalsham4/PycharmProjects/clusteringtask/clustering.py", line 87, in <module>
y_test = label_encoder.transform(y_test)
File "C:\Users\aalsham4\AppData\Local\Continuum\Miniconda3\lib\site-packages\sklearn\preprocessing\label.py", line 153, in transform
raise ValueError("y contains new labels: %s" % str(diff))
ValueError: y contains new labels: ['calss6' 'class7' ]
Now, I'm not sure If I am doing this correctly to cluster labeled classes or not.
Any suggestion
As #Anony-Mousse already said, this is not a k-means problem. k-means is to find the "natural" groupings, given the number of classes you want. Once you assign those labels, further updates are no longer a k-means problem.
You can use a variety of statistical analysis heuristics to decide whether a new class is "sufficiently close" to an existing class. This usually uses measures of mean and deviation (which you already have for the k-means classes), density, and anything else you find pertinent to your problem.
I suggest that you research spectral clustering algorithms, and try them on the entire data set; those are better-suited at finding gaps, reacting to density, etc. (depending on the algorithm you choose for this application).

Resources