I'm dabbling with ML and was able to take a tutorial and get it to work for my needs. It's a simple recommender system using TfidfVectorizer and linear_kernel. I run into a problem with how I go about deploying it through Sagemaker with an end point.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import json
import csv
with open('data/big_data.json') as json_file:
data = json.load(json_file)
ds = pd.DataFrame(data)
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(ds['content'])
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
results = {}
for idx, row in ds.iterrows():
similar_indices = cosine_similarities[idx].argsort()[:-100:-1]
similar_items = [(cosine_similarities[idx][i], ds['id'][i]) for i in similar_indices]
results[row['id']] = similar_items[1:]
def item(id):
return ds.loc[ds['id'] == id]['id'].tolist()[0]
def recommend(item_id, num):
print("Recommending " + str(num) + " products similar to " + item(item_id) + "...")
print("-------")
recs = results[item_id][:num]
for rec in recs:
print("Recommended: " + item(rec[1]) + " (score:" + str(rec[0]) + ")")
recommend(item_id='129035', num=5)
As a starting point I'm not sure if the output from tf.fit_transform(ds['content']) is considered the model or the output from linear_kernel(tfidf_matrix, tfidf_matrix).
I came to the conclusion that I didn't need to deploy this through SageMaker. Since the final linear_kernel output was a Dictionary I could do quick ID lookups to find correlations.
I have it working on AWS with API Gateway/Lambda, DynamoDB and an EC2 server to collect, process and plug the data into DynamoDB for fast lookups. No expensive SageMaker endpoint needed.
Related
I have a problem where I have to predict a buyer using machine learning (created a dummy dataset). I need to transform the data first before I can use it for machine learning. I am aggregating information per id,visit level which gives me a list of food and cloths bought. This list needs to be one hot encoded before applying classifier model.
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingClassifier
def preprocess(df):
# Only keep rows till buyer=1
df = df.groupby(["id1", "visit"], group_keys=False).apply(
lambda g: g.loc[: g["Buyer"].idxmax()]
)
# Form lists on each id1,visit level
df1 = df.groupby(["id1", "visit"], as_index=False).agg(
is_Pax=("Buyer", "max"),
fruits=("fruits", lambda x: x.dropna().unique().tolist()),
cloths=("cloths", lambda x: x.dropna().unique().tolist()),
)
col = ["fruits", "cloths"]
df_transformed = onehot(df1, col)
return df_transformed
def onehot(df, col):
"""
This function does one hot encoding of a list column.
"""
onehot_list_encoder = MultiLabelBinarizer()
for cl in col:
print("One hot encoding ", cl)
newd = pd.DataFrame(
onehot_list_encoder.fit_transform(df[cl]),
columns=onehot_list_encoder.classes_,
).add_prefix(cl + "_")
df = df.join(newd)
return df
df = pd.DataFrame(np.array([['a', 'a', 'b', 'b','a','a'], [1, 2, 2, 2,1,1],
['Apple', 'Apple', 'Banana', None,'Orange','Pear'],[1,2,1,3,4,5],
[0, 0, 1, 0,1,0]]).T,
columns=['id1', 'visit', 'fruits','cloths','Buyer'])
df['Buyer'] = df['Buyer'].astype('int')
How to create a simple ML model now that does this preprocessing to data (both fit and predict) since in test data, I want the same transformation (i.e. 0 for all columns not present in the test rows), Can pipeline solve this? I am not so good with writing pipelines and am getting errors.
droplist=['id1', 'visit', 'fruits','cloths']
pipe=Pipeline(steps=[
("preprocess",preprocess(df)),
("coltrans",ColumnTransformer([("drop",'drop',droplist)])),
("model",GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)),
])
Can someone help?
I did embeddings with fasttext and I have clusters thanks to KMeans.
I would like to calculate similarities inside each cluster to check if the sentences inside are well clustered. I want to keep sentences with good similarities in each clusters. If the similarity is not good, I want to exit sentence that not belong to a cluster, and next group similar sentences not belonging to clusters.
How can I do it in a good manner ? I thought using cosine similarity but don't know how to compare all sentences inside a cluster
Maybe something like this...
# clustering words into similar groups:
import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
words = 'XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL'.split(',') #Replace this line
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])
affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
cluster_str = ", ".join(cluster)
print(" - *%s:* %s" % (exemplar, cluster_str))
Result:
- *LDPELDKSL:* LDPELDKSL
- *DFKLKSLFD:* DFKLKSLFD
- *XYZ:* ABC, XYZ
- *DLFKFKDLD:* DLFKFKDLD
See these links for additional guidance on how to cluster text.
https://towardsdatascience.com/applying-machine-learning-to-classify-an-unsupervised-text-document-e7bb6265f52
https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html
https://pythonprogramminglanguage.com/kmeans-text-clustering/
http://brandonrose.org/clustering
Here are a couple examples using Cosine Similarity.
d1 = "plot: two teen couples go to a church party, drink and then drive."
d2 = "films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . "
d3 = "every now and then a movie comes along from a suspect studio , with every indication that it will be a stinker , and to everybody's surprise ( perhaps even the studio ) the film becomes a critical darling . "
d4 = "damn that y2k bug . "
documents = [d1, d2, d3, d4]
import nltk, string, numpy
nltk.download('punkt') # first-time use only
stemmer = nltk.stem.porter.PorterStemmer()
def StemTokens(tokens):
return [stemmer.stem(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def StemNormalize(text):
return StemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
nltk.download('wordnet') # first-time use only
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
from sklearn.feature_extraction.text import CountVectorizer
LemVectorizer = CountVectorizer(tokenizer=LemNormalize, stop_words='english')
LemVectorizer.fit_transform(documents)
print(LemVectorizer.vocabulary_)
tf_matrix = LemVectorizer.transform(documents).toarray()
print(tf_matrix)
tf_matrix.shape
from sklearn.feature_extraction.text import TfidfTransformer
tfidfTran = TfidfTransformer(norm="l2")
tfidfTran.fit(tf_matrix)
print(tfidfTran.idf_)
import math
def idf(n,df):
result = math.log((n+1.0)/(df+1.0)) + 1
return result
print("The idf for terms that appear in one document: " + str(idf(4,1)))
print("The idf for terms that appear in two documents: " + str(idf(4,2)))
tfidf_matrix = tfidfTran.transform(tf_matrix)
print(tfidf_matrix.toarray())
cos_similarity_matrix = (tfidf_matrix * tfidf_matrix.T).toarray()
print(cos_similarity_matrix)
from sklearn.feature_extraction.text import TfidfVectorizer
TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
def cos_similarity(textlist):
tfidf = TfidfVec.fit_transform(textlist)
return (tfidf * tfidf.T).toarray()
cos_similarity(documents)
https://sites.temple.edu/tudsc/2017/03/30/measuring-similarity-between-texts-in-python/
# Define the documents
doc_trump = "Mr. Trump became president after winning the political election. Though he lost the support of some republican friends, Trump is friends with President Putin"
doc_election = "President Trump says Putin had no political interference is the election outcome. He says it was a witchhunt by political parties. He claimed President Putin is a friend who had nothing to do with the election"
doc_putin = "Post elections, Vladimir Putin became President of Russia. President Putin had served as the Prime Minister earlier in his political career"
documents = [doc_trump, doc_election, doc_putin]
# Scikit Learn
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
# Create the Document Term Matrix
count_vectorizer = CountVectorizer(stop_words='english')
count_vectorizer = CountVectorizer()
sparse_matrix = count_vectorizer.fit_transform(documents)
# OPTIONAL: Convert Sparse Matrix to Pandas Dataframe if you want to see the word frequencies.
doc_term_matrix = sparse_matrix.todense()
df = pd.DataFrame(doc_term_matrix,
columns=count_vectorizer.get_feature_names(),
index=['doc_trump', 'doc_election', 'doc_putin'])
df
# Compute Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity
print(cosine_similarity(df, df))
https://www.machinelearningplus.com/nlp/cosine-similarity/
I want to train an XGBoost classifier with coiled and dask.
The problem is that my training data is really big and is stored in an h5py file on my computer. Is there a way to upload the h5py file directly to the workers?
To show my problem I created an example. For this example, I create some random data and store it in an h5py file so you can see what my data looks like. In my real work case, the data has 7245346 features and 2157 samples.
import coiled
import h5py
import numpy as np
import dask.array as da
from dask.distributed import Client
import xgboost as xgb
input_path = "test.h5"
# create some random data
n_features = 500
n_samples = 200
X = np.random.randint(0,3,size=[n_samples, n_features])
y = np.random.randint(0,5,size=[n_samples])
with h5py.File(input_path, mode='w') as file:
file.create_dataset('X', data=X)
file.create_dataset('y', data=y)
rows_per_chunk = 100
coiled.create_software_environment(
name="xgboost-on-coiled",
pip=["coiled", "h5py", "dask", "xgboost"])
with coiled.Cluster(
name="xgboost-cluster",
n_workers=2,
worker_cpu=8,
worker_memory="16GiB",
software="xgboost-on-coiled") as cluster:
with Client(cluster) as client:
file = h5py.File(input_path, mode='r')
n_features = file["X"].shape[1]
X = da.from_array(file["X"], chunks=(rows_per_chunk, n_features))
X = X.rechunk(chunks=(rows_per_chunk, n_features))
X.astype("int8")
X.persist()
y = da.from_array(file["y"], chunks=rows_per_chunk)
n_class = np.unique(y.compute()).size
y = y.astype("int8")
y.persist()
dtrain = xgb.dask.DaskDMatrix(
client,
X,
y,
feature_names=['%i' % i for i in range(n_features)])
model_params = {
'objective': 'multi:softprob',
'eval_metric': 'mlogloss',
'num_class': n_class}
# train model
output = xgb.dask.train(
client,
params=model_params,
dtrain=dtrain)
booster = output["booster"]
The Error message:
FileNotFoundError: [Errno 2] Unable to open file (unable to open file: name = 'test.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
For smaller amounts of data, I can load the data directly out of RAM. But for more data, this does not work anymore. Just that you know what I am talking about:
input_path = "test.h5"
n_features = 500
n_samples = 200
X = np.random.randint(0,3,size=[n_samples, n_features])
y = np.random.randint(0,5,size=[n_samples])
with h5py.File(input_path, mode='w') as file:
file.create_dataset('X', data=X)
file.create_dataset('y', data=y)
rows_per_chunk = 100
coiled.create_software_environment(
name="xgboost-on-coiled",
pip=["coiled", "h5py", "dask", "xgboost"])
with coiled.Cluster(
name="xgboost-cluster",
n_workers=2,
worker_cpu=8,
worker_memory="16GiB",
software="xgboost-on-coiled") as cluster:
with Client(cluster) as client:
file = h5py.File(input_path, mode='r')
n_features = file["X"].shape[1]
X = file["X"][:]
X = da.from_array(X, chunks=(rows_per_chunk, n_features))
y = file["y"][:]
n_class = np.unique(y).size
y = da.from_array(y, chunks=rows_per_chunk)
dtrain = xgb.dask.DaskDMatrix(
client,
X,
y,
feature_names=['%i' % i for i in range(n_features)])
model_params = {
'objective': 'multi:softprob',
'eval_metric': 'mlogloss',
'num_class': n_class}
# train model
output = xgb.dask.train(
client,
params=model_params,
dtrain=dtrain)
booster = output["booster"]
If this code is used with large amounts of data, no error message is displayed. In this case, simply nothing happens. I do not see the data being uploaded.
I have tried so many things and nothing has worked. I would be very grateful if you have some advice for me on how to do this.
(Just in case you are wondering why I am trying to train a model on 7 million features: I want to get the feature importance for feature selection)
Is there a way to upload the h5py file directly to the workers?
When using Coiled, the recommended way is to upload the data to an AWS S3 bucket (or similar), and read it directly from there. This is because Coiled provisions Dask clusters on the cloud, and there is a cost to moving data (e.g., from your local machine to the cloud). It's more effcient to have your data on the cloud, and if possible, in the same AWS region. Also, see the Coiled Documentation: How do I access my data from Coiled?.
I'm working with datashader and dask but I'm having problems when trying to plot with a cluster running. To make it more concrete, I have the following example (embedded in a bokeh plot):
import holoviews as hv
import pandas as pd
import dask.dataframe as dd
import numpy as np
from holoviews.operation.datashader import datashade
import datashader.transfer_functions as tf
#initialize the client/cluster
cluster = LocalCluster(n_workers=4, threads_per_worker=1)
dask_client = Client(cluster)
def datashade_plot():
hv.extension('bokeh')
#create some random data (in the actual code this is a parquet file with millions of rows, this is just an example)
delta = 1/1000
x = np.arange(0, 1, delta)
y = np.cumsum(np.sqrt(delta)*np.random.normal(size=len(x)))
df = pd.DataFrame({'X':x, 'Y':y})
#create dask dataframe
points_dd = dd.from_pandas(df, npartitions=3)
#create plot
points = hv.Curve(points_dd)
return hd.datashade(points)
dask_client.submit(datashade_plot,).result()
This raises a:
TypeError: can't pickle weakref objects
I have the theory that this happens because you can't distribute the datashade operations in the cluster. Sorry if this is a noob question, I'd be very grateful for any advice you could give me.
I think you want to go the other way. That is, pass datashader a dask dataframe instead of a pandas dataframe:
>>> from dask import dataframe as dd
>>> import multiprocessing as mp
>>> dask_df = dd.from_pandas(df, npartitions=mp.cpu_count())
>>> dask_df.persist()
...
>>> cvs = datashader.Canvas(...)
>>> agg = cvs.points(dask_df, ...)
XREF: https://datashader.org/user_guide/Performance.html
I am trying to merge dask with luigi,
and while business logic works fine by itself, code starts throwing errors when I run a Luigi task:
raise ValueError('url type not understood: %s' % urlpath)
ValueError: url type not understood: <_io.TextIOWrapper name='../data/2017_04_11_oldsource_geocoded.csv-luigi-tmp-1647603946' mode='wb' encoding='UTF-8'>
the code is here (I dropped the business model part to make it shorter):
import pandas as pd
import geopandas as gp
from geopandas.tools import sjoin
from dask import dataframe as dd
from shapely.geometry import Point
from os import path
import luigi
class geocode_tweets(luigi.Task):
boundaries = _load_geoboundaries()
nyc = boundaries[0].unary_union
def requires(self):
return []
def output(self):
self.path = '../data/2017_04_11_oldsource_geocoded.csv'
return luigi.LocalTarget(self.path)
def run(self):
df = dd.read_csv(path.join(data_dir, '2017_03_22_oldsource.csv'))
df['geometry'] = df.apply(_get_point, axis=1)
meta = _form_meta(df)
S = df.map_partitions(
distributed_sjoin, boundaries=self.boundaries,
nyc_border=self.nyc, meta=meta).drop('geometry', axis=1)
f = self.output().open('w')
S.to_csv(f)
f.close()
and the problem, it looks like, is in the output part
As far as I understand, problem is that dask does not like Luigi file objects as a substitution to the string.
Dask defines DataFrame.to_csv(filename, **kwargs) and you are sending it a file instead of a filename. Replace those last three lines with:
S.to_csv(self.output().path)