How Can I save And reuse One hot encoding in keras? - machine-learning

I'm working on a project that related to NLP. then i use One hot encode for text representation in google colab Then i fit it into LSTM.
This is my code:
from tensorflow.keras.preprocessing.text import one_hot
voc_size=13000
onehot_repr=[one_hot(words,voc_size)for words in X1]
the model seem good but when i want to save it for making prediction with new text i save it using pickle:
import pickle
with open("one_hot", "wb") as f:
pickle.dump(one_hot, f)
but when i restart the colab and load the saved one_hot again the number that represent a word is difference.
Is there any possible way that i can save Onehot and get the same result in colab?
Because I can not save one hot encode for using another time that why i save one hot representation as list and access it by index later:
## load save model
from tensorflow.keras.models import load_model
my_model=load_model("model9419.h5")
##load oneHot representation
with open('/content/drive/MyDrive/last_model/on_hot.json', 'rb') as f:
oneHot=json.load(f)
In order to predict A word i used simple array access element to find one hot representation of a words.
Is This a correct way to make a prediction ? Is there any better way than that?
And If I can save OneHot function how can i use in flask server?
Also can anyone recommend word representation that is easy, can save to use in flask and better?

First, create a one-hot dict and then convert it to pandas DataFrame and save a .csv of that DataFrame. ex.
import pandas as pd
from tensorflow.keras.preprocessing.text import one_hot
onehot_dict = {}
voc_size = 3
for words in ['this', 'that', 'then']:
onehot_dict[words] = one_hot(words, voc_size)
onehot_df = pd.DataFrame(onehot_dict)
onehot_df.to_csv('./onehot.csv', index=False)

Related

I created a TF-IDF code to analyze an annual report, I want to know the importance of specific keywords

import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
import path
import re
with open(r'C:\Users\maxim\PycharmProjects\THESIS\data\santander2020_1.txt', 'r') as file:
data = file.read()
dataset = [data]
tfIdfVectorizer=TfidfVectorizer(use_idf=True, stop_words="english"
, lowercase=True,max_features=100,ngram_range=(1,3))
tfIdf = tfIdfVectorizer.fit_transform(dataset)
df = pd.DataFrame(tfIdf[0].T.todense(), index=tfIdfVectorizer.get_feature_names(), columns=["TF-IDF"])
df = df.sort_values('TF-IDF', ascending=False)
print (df.head(25))
The above code is what ive created to do a TF-IDF analysis on an annual report, however currently it is giving me the values of the most important words within the report. However, I only need the TFIDF values for the keywords
["digital","hardware","innovation","software","analytics","data","digitalisation","technology"], is there a way I can specify to only look for the tfidf values of these terms?
I'm very new to programming with little experience, I'm doing this for my thesis.
Any help is greatly appreciated.
You have defined tfIdf as tfIdf = tfIdfVectorizer.fit_transform(dataset).
So tfIdf.toarray() would be a 2-D array, where each row refers to a document and each element in the row refers to the TF-IDF score of the corresponding word. To know what word each element is representing, you could use the .get_feature_names() function which would print a list of words. Then you can use this information to create a mapping (dict) from words to scores, like this:
wordScores = dict(zip(tfIdfVectorizer.get_feature_names(), tfIdf.toarray()[0]))
Now suppose your document contains the word "digital" and you want to know its TF-IDF score, you could simply print the value of wordScores["digital"].

EEG data preprocessing with mne python

I have Physiological EEG emotion dataset named "Deap". I want to analyze and visualize the data through MNE but it has its own format.
How can I load my personal data for pre-processing, data format is (.dat)?
import pickle
with open('s01.dat', 'rb') as f:
y = pickle.load(f, encoding='latin1')
This one works for me.
Of course, the ".dat" file is in the same directory as this code.

LSTM machine learning panda

I am actually trying to use TensorFlow and use the LSTM.
For that, I have data in the text file (10MB).
When I try to copy the data in numpy I get memory full Error.
Any suggestions how to get the data ready so that I can use in LSTM?
Reading the data from File before processing tensor flow with this function:
def read_data(fname):
with open(fname,encoding="utf8") as f:
content = f.readlines()
content = [x.strip() for x in content]
content = [word for i in range(len(content)) for word in content[i].split()]
content = np.array(content)
return content
At the np.array(content), it is giving memory full Error. How can I get around this so that I can use this data in LSTM in TensorFlow?
Please also suggest if there is any LSTM which can read large amounts of data
Memory error indeed means that you cannot fit the numpy array into your memory because of the overhead of indexing string lists in numpy. The problem you are not creating a single matrix of words. Each word list of content has a different length, so calling np.array will create an array for each line and then add them into one large numpy array. This what numpy is for. Numpy is efficient why dealing with numerical tensors, not lists of list of strings.
Here is a related question.
If you plan to use TensforFlow, you can use tf.Dataset API. It can load file line by line and you can then apply all the stuff you need within TensorFlow, e.g., applying (calling the map method) tf.string_split and padding + batching the data.
You will end up with something like this:
tf.TextLineDataset(fname).map(lambda s: tf.strings.split([s])[0])
Note that before batching and passing it into LSTM you need to convert the strings to vocabulary indices and call embedding lookup on the indices.

error while passing data-frame through k-means

Although my data-frame as all the float values everywhere. While passing the data frame through k-means it shows that couldn't convert the string to float.
How to convert nan values if any to float values in the entire data-frame?
This would do your job and convert all the columns in string format to categorical codes or use one hot encoding of the variables in these columns.
import numpy as np
from sklearn.cluster import KMeans
import pandas
df = pandas.read_csv('zipIncome.csv')
print(df)
df[col_name]= df[col_name].astype('category')
df[col_name] = df[col_name].cat.codes
kmeans = KMeans(n_clusters=4,init='k-means++', max_iter=600, algorithm = 'auto').fit(df)
print (kmeans.labels_)
print(kmeans.cluster_centers_)
Based on your code, it would seem that you only instantiated the KMeans but haven't used it.
You'll need input data X that is clean (i.e. no strings etc), let's call it X
kmeans = KMeans(n_clusters=4,init='k-means++', max_iter=600, algorithm = 'auto')
clusters = kmeans.fit_predict(X)
now clusters has the cluster number for each sample in X.
(alternatively, you can do the fit(X) and then later predict(X) separately, but ultimately it is the predict that will output the cluster labels that you will need)
If you want to later get clusters on data, you should use kmeans.predict(new_data) rather than fit_predict() so that KMeans uses the learning from X, and applies it to your new_data (or depending on your needs, you might want to retrain it).
Hope this helps.
Finally, you can add another column to your pandas DataFrame by doing:
df['cluster'] = clusters
where 'cluster' is a string for your new column name, you can of course call it whatever you want

Is there a way to save the preprocessing objects in scikit-learn? [duplicate]

This question already has answers here:
Save MinMaxScaler model in sklearn
(5 answers)
Saving StandardScaler() model for use on new datasets
(3 answers)
Closed 1 year ago.
I am building a neural net with the purpose of make predictions on new data in the future. I first preprocess the training data using sklearn.preprocessing, then train the model, then make some predictions, then close the program. In the future, when new data comes in I have to use the same preprocessing scales to transform the new data before putting it into the model. Currently, I have to load all of the old data, fit the preprocessor, then transform the new data with those preprocessors. Is there a way for me to save the preprocessing objects objects (like sklearn.preprocessing.StandardScaler) so that I can just load the old objects rather than have to remake them?
I think besides pickle, you can also use joblib to do this. As stated in Scikit-learn's manual 3.4. Model persistence
In the specific case of scikit-learn, it may be better to use joblib’s replacement of pickle (dump & load), which is more efficient on objects that carry large numpy arrays internally as is often the case for fitted scikit-learn estimators, but can only pickle to the disk and not to a string:
from joblib import dump, load
dump(clf, 'filename.joblib')
Later you can load back the pickled model (possibly in another Python process) with:
clf = load('filename.joblib')
Refer to other posts for more information, Saving StandardScaler() model for use on new datasets, Save MinMaxScaler model in sklearn.
As mentioned by lejlot, you can use the library pickle to save the trained network as a file in your hard drive, then you just need to load it to start to make predictions.
Here is an example on how to use pickle to save and load python objects:
import pickle
import numpy as np
npTest_obj = np.asarray([[1,2,3],[6,5,4],[8,7,9]])
strTest_obj = "pickle example XXXX"
if __name__ == "__main__":
# store object information
pickle.dump(npTest_obj, open("npObject.p", "wb"))
pickle.dump(strTest_obj, open("strObject.p", "wb"))
# read information from file
str_readObj = pickle.load(open("strObject.p","rb"))
np_readObj = pickle.load(open("npObject.p","rb"))
print(str_readObj)
print(np_readObj)

Resources