Preprocess and data transformation in machine learning - machine-learning

I have a problem where I have to predict a buyer using machine learning (created a dummy dataset). I need to transform the data first before I can use it for machine learning. I am aggregating information per id,visit level which gives me a list of food and cloths bought. This list needs to be one hot encoded before applying classifier model.
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingClassifier
def preprocess(df):
# Only keep rows till buyer=1
df = df.groupby(["id1", "visit"], group_keys=False).apply(
lambda g: g.loc[: g["Buyer"].idxmax()]
)
# Form lists on each id1,visit level
df1 = df.groupby(["id1", "visit"], as_index=False).agg(
is_Pax=("Buyer", "max"),
fruits=("fruits", lambda x: x.dropna().unique().tolist()),
cloths=("cloths", lambda x: x.dropna().unique().tolist()),
)
col = ["fruits", "cloths"]
df_transformed = onehot(df1, col)
return df_transformed
def onehot(df, col):
"""
This function does one hot encoding of a list column.
"""
onehot_list_encoder = MultiLabelBinarizer()
for cl in col:
print("One hot encoding ", cl)
newd = pd.DataFrame(
onehot_list_encoder.fit_transform(df[cl]),
columns=onehot_list_encoder.classes_,
).add_prefix(cl + "_")
df = df.join(newd)
return df
df = pd.DataFrame(np.array([['a', 'a', 'b', 'b','a','a'], [1, 2, 2, 2,1,1],
['Apple', 'Apple', 'Banana', None,'Orange','Pear'],[1,2,1,3,4,5],
[0, 0, 1, 0,1,0]]).T,
columns=['id1', 'visit', 'fruits','cloths','Buyer'])
df['Buyer'] = df['Buyer'].astype('int')
How to create a simple ML model now that does this preprocessing to data (both fit and predict) since in test data, I want the same transformation (i.e. 0 for all columns not present in the test rows), Can pipeline solve this? I am not so good with writing pipelines and am getting errors.
droplist=['id1', 'visit', 'fruits','cloths']
pipe=Pipeline(steps=[
("preprocess",preprocess(df)),
("coltrans",ColumnTransformer([("drop",'drop',droplist)])),
("model",GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)),
])
Can someone help?

Related

Dask dataframe parallel task

I want to create features(additional columns) from a dataframe and I have the following structure for many functions.
Following this documentation https://docs.dask.org/en/stable/delayed-best-practices.html I have come up with the code below.
However I get the error message: concurrent.futures._base.CancelledError and many times I get the warning: distributed.utils_perf - WARNING - full garbage collections took 10% CPU time recently (threshold: 10%)
I understand that the object I am appending to delay is very large(it works ok when I use the commented out df) which is why the program crashes but is there a better way of doing it?
import pandas as pd
from dask.distributed import Client, LocalCluster
import dask.dataframe as dd
import numpy as np
import dask
def main():
#df = pd.DataFrame({"col1": np.random.randint(1, 100, 100000), "col2": np.random.randint(101, 200, 100000), "col3": np.random.uniform(0, 4, 100000)})
df = pd.DataFrame({"col1": np.random.randint(1, 100, 100000000), "col2": np.random.randint(101, 200, 100000000), "col3": np.random.uniform(0, 4, 100000000)})
ddf = dd.from_pandas(df, npartitions=100)
ddf = ddf.set_index("col1")
delay = []
def create_col_sth():
group = ddf.groupby("col1")["col3"]
#dask.delayed
def small_fun(lag):
return f"col_{lag}", group.transform(lambda x: x.shift(lag), meta=('x', 'float64')).apply(lambda x: np.log(x), meta=('x', 'float64'))
for lag in range(5):
x = small_fun(lag)
delay.append(x)
create_col_sth()
delayed = dask.compute(*delay)
for data in delayed:
ddf[data[0]] = data[1]
ddf.to_parquet("test", engine="fastparquet")
if __name__ == "__main__":
cluster = LocalCluster(n_workers=6,
threads_per_worker=2,
memory_limit='8GB')
client = Client(cluster)
main()
Not sure if this will resolve all of your issues, but generally you don't need to (and shouldn't) mix delayed and dask.datafame operations like this. Additionally, you shouldn't pass large data objects into delayed functions through closures like group in your example. Instead, include them as explicit arguments, or in this case, don't use delayed at all and use dask.dataframe native operations or in-memory operations with dask.dataframe.map_partitions.
Implementing these, I would rewrite your main function as follows:
df = pd.DataFrame({
"col1": np.random.randint(1, 100, 100000000),
"col2": np.random.randint(101, 200, 100000000),
"col3": np.random.uniform(0, 4, 100000000),
})
ddf = dd.from_pandas(df, npartitions=100)
ddf = ddf.set_index("col1")
group = ddf.groupby("col1")["col3"]
# directly assign the dataframe operations as columns
for lag in range(5):
ddf[f"col_{lag}"] = (
group
.transform(lambda x: x.shift(lag), meta=('x', 'float64'))
.apply(lambda x: np.log(x), meta=('x', 'float64'))
)
# this triggers the operation implicitly - no need to call compute
ddf.to_parquet("test", engine="fastparquet")
After long periods of frustration with Dask, I think I hacked the holy grail of refactoring your pandas transformations wrapped with dask.
Learning points:
Index intelligently. If you are grouping by or merging you should consider indexing the columns you use for those.
Partition and repartition intelligently. If you have a dataframe of 10k rows and another of 1m rows, they should naturally have different partitions.
Don't use dask data frame transformation methods except for example merge. The others should be in pandas code wrapped around map_partitions.
Don't accumulate too large graphs so consider saving after for example indexing or after making a complex transformation.
 
If possible filter the data frame and work with smaller subset you can always merge this back to the bigger data set.
If you are working in your local machine set the memory limits within the boundaries of system specifications. This point is very important. In the example below I create one million rows of 3 columns one is an int64 and two are float64 which are 8bytes each and 24bytes in total this gives me 24 million bytes.
import pandas as pd
from dask.distributed import Client, LocalCluster
import dask.dataframe as dd
import numpy as np
import dask
# https://stackoverflow.com/questions/52642966/repartition-dask-dataframe-to-get-even-partitions
def _rebalance_ddf(ddf):
"""Repartition dask dataframe to ensure that partitions are roughly equal size.
Assumes `ddf.index` is already sorted.
"""
if not ddf.known_divisions: # e.g. for read_parquet(..., infer_divisions=False)
ddf = ddf.reset_index().set_index(ddf.index.name, sorted=True)
index_counts = ddf.map_partitions(lambda _df: _df.index.value_counts().sort_index()).compute()
index = np.repeat(index_counts.index, index_counts.values)
divisions, _ = dd.io.io.sorted_division_locations(index, npartitions=ddf.npartitions)
return ddf.repartition(divisions=divisions)
def main(client):
size = 1000000
df = pd.DataFrame({"col1": np.random.randint(1, 10000, size), "col2": np.random.randint(101, 20000, size), "col3": np.random.uniform(0, 100, size)})
# Select appropriate partitions
ddf = dd.from_pandas(df, npartitions=500)
del df
gc.collect()
# This is correct if you want to group by a certain column it is always best if that column is an indexed one
ddf = ddf.set_index("col1")
ddf = _rebalance_ddf(ddf)
print(ddf.memory_usage_per_partition(index=True, deep=False).compute())
print(ddf.memory_usage(deep=True).sum().compute())
# Always persist your data to prevent big task graphs actually if you omit this step processing will fail
ddf.to_parquet("test", engine="fastparquet")
ddf = dd.read_parquet("test")
# Dummy code to create a dataframe to be merged based on col1
ddf2 = ddf[["col2", "col3"]]
ddf2["col2/col3"] = ddf["col2"] / ddf["col3"]
ddf2 = ddf2.drop(columns=["col2", "col3"])
# Repartition the data
ddf2 = _rebalance_ddf(ddf2)
print(ddf2.memory_usage_per_partition(index=True, deep=False).compute())
print(ddf2.memory_usage(deep=True).sum().compute())
def mapped_fun(data):
for lag in range(5):
data[f"col_{lag}"] = data.groupby("col1")["col3"].transform(lambda x: x.shift(lag)).apply(lambda x: np.log(x))
return data
# Process the group by transformation in pandas but wrapped with Dask if you use the Dask functions to do this you will
# have a variety of issues.
ddf = ddf.map_partitions(mapped_fun)
# Additional... you can merge ddf with ddf2 but on an indexed column otherwise you run into a variety of issues
ddf = ddf.merge(ddf2, on=['col1'], how="left")
ddf.to_parquet("final", engine="fastparquet")
if __name__ == "__main__":
cluster = LocalCluster(n_workers=6,
threads_per_worker=2,
memory_limit='8GB')
client = Client(cluster)
main(client)

How to do a single value prediction in NLP

My dataset was restaurants review with two columns review and liked.
Based on the review it shows if they liked the restaurant or not
I cleaned up the data in NLP as the first step.Then as second step used bag of words model as below.
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values
This above gave X as 1500 columns with 0 and 1 with 1000 rows according to my dataset.
I predicted as below
y_pred = classifier.predict(X_test)
So now I have review as "Food was good",how do I predict if they like it or not.A single value to predict.
Please can you help me out.Please let me know if additional information is required.
Thanks
All you need is to apply cv.transform first just like so:
>>> test = ['Food was good']
>>> test_vec = cv.transform(test)
>>> classifier.predict(test_vec)
# returns predicted class
For training and testing here is simple example:
Training:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
text = ["This is good place","Hyatt is awesome hotel"]
count_vect = CountVectorizer()
tfidf_transformer = TfidfTransformer()
X_train_counts = count_vect.fit_transform(text)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
pd.DataFrame(X_train_tfidf.todense(), columns = count_vect.get_feature_names())
# Now apply any classification u want to on top of this data-set
Now Testing:
Note: use the same transformation as done in training:
new = ["I like the ambiance of this hotel "]
pd.DataFrame(tfidf_transformer.transform(count_vect.transform(new)).todense(),
columns = count_vect.get_feature_names())
Apply model.predict on top of this now.
you can also use sklearn pipeline.
from sklearn.pipeline import Pipeline
model_pipeline = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()), ('model', classifier())]) #call the Model which you want to use
model_pipeline.fit_transform(x,y) # here x is your text data, and y is going to be your target
model_pipeline.predict(['Food was good"']) # predict your new sentence

How to display categorical values on export tree image of decision tree classifier?

I am trying to export the decision tree as an image with the original labels of all categorical fields.
The current data I have is like so:
I transformed the categorical features into numerical:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('data.csv')
X = dataset.iloc[:, 0:4]
y = dataset.iloc[:, 4]
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
X['Outlook'] = lb.fit_transform(X['Outlook'])
X['Temp'] = lb.fit_transform(X['Temp'])
X['Humidity'] = lb.fit_transform(X['Humidity'])
X['Windy'] = lb.fit_transform(X['Windy'])
y = lb.fit_transform(y)
Afterwards, I applied the DecisionTreeClassifier:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(criterion="entropy")
dtc.fit(X, y)
At the end, I needed to check the tree generated from the model using the following:
# Import tools needed for visualization
from sklearn.tree import export_graphviz
import pydot
# Pull out one tree from the forest
# Export the image to a dot file
export_graphviz(dtc, out_file = 'tree.dot', feature_names = X.columns, rounded = True, precision = 1)
# Use dot file to create a graph
(graph, ) = pydot.graph_from_dot_file('tree.dot')
# Write graph to a png file
graph.write_png('tree.png')
The tree.png:
But what I really need, is to see the main labels of each feature inside the nodes or at each branch, instead of true or false or a numeric representation.
I tried the following:
y=lb.inverse_transform(y)
And the same for X features, but the tree is being generated the same as above.

How do you make a KMeans prediction more accurate?

I'm learning about clustering and KMeans and such, so my knowldge is very basic on the topic. What I have below is a bit of a self study on how it works. Basically, if 'a' shows up in any of the columns, 'Binary' will equal 1. Essentially I am trying to teach it a pattern. I learned the following from a tutorial using the Titanic dataset, but I've adapted to my own data.
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt
my constructed data
dataset = [
[0,'x','f','g'],[1,'a','c','b'],[1,'d','k','a'],[0,'y','v','w'],
[0,'q','w','e'],[1,'c','a','l'],[0,'t','x','j'],[1,'w','o','a'],
[0,'z','m','n'],[1,'z','x','a'],[0,'f','g','h'],[1,'h','a','c'],
[1,'a','r','e'],[0,'g','c','c']
]
df = pd.DataFrame(dataset, columns=['Binary','Col1','Col2','Col3'])
df.head()
df:
Binary Col1 Col2 Col3
------------------------
1 a b c
0 x t v
0 s q w
1 n m a
1 u a r
Encode non binary to binary:
labelEncoder = LabelEncoder()
labelEncoder.fit(df['Col1'])
df['Col1'] = labelEncoder.transform(df['Col1'])
labelEncoder.fit(df['Col2'])
df['Col2'] = labelEncoder.transform(df['Col2'])
labelEncoder.fit(df['Col3'])
df['Col3'] = labelEncoder.transform(df['Col3'])
Set clusters to two, because its either 1 or 0?
X = np.array(df.drop(['Binary'], 1).astype(float))
y = np.array(df['Binary'])
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
Test it:
correct = 0
for i in range(len(X)):
predict_me = np.array(X[i].astype(float))
predict_me = predict_me.reshape(-1, len(predict_me))
prediction = kmeans.predict(predict_me)
if prediction[0] == y[i]:
correct += 1
The result:
print(f'{round(correct/len(X) * 100)}% Accuracy')
>>> 71%
How can I get it more accurate to the point where it 99.99% knows that 'a' means binary column is 1? More data?
K-means does not even try to predict this value. Because it is an unsupervised method. Because it is not a prediction algorithm; it is a structure discovery task. Don't mistake clustering for classification.
The cluster numbers have no meaning. They are 0 and 1 because these are the first two integers. K-means is randomized. Run it a few times and you will also score just 29% sometimes.
Also, k-means is designed for continuous input. You can apply it on binary encoded data, but the results will be pretty poor.

Can I create a dask array with a delayed shape

Is it possible to create a dask array from a delayed value by specifying its shape with an other delayed value?
My algorithm won't give me the shape of the array until pretty late in the computation.
Eventually, I will be creating some blocks with shapes specified by the intermediate results of my computation, eventually calling da.concatenate on all the results (well da.block if it were more flexible)
I don't think it is too detrimental if I can't, but it would be cool if could.
Sample code
from dask import delayed
from dask import array as da
import numpy as np
n_shape = (3, 3)
shape = delayed(n_shape, nout=2)
d_shape = (delayed(n_shape[0]), delayed(n_shape[1]))
n = delayed(np.zeros)(n_shape, dtype=np.float)
# this doesn't work
# da.from_delayed(n, shape=shape, dtype=np.float)
# this doesn't work either, but I think goes a little deeper
# into the function call
da.from_delayed(n, shape=d_shape, dtype=np.float)
You can not provide a delayed shape, but you can state that the shape is unknown using np.nan as a value wherever you don't know a dimension
Example
import random
import numpy as np
import dask
import dask.array as da
#dask.delayed
def f():
return np.ones((5, random.randint(10, 20))) # a 5 x ? array
values = [f() for _ in range(5)]
arrays = [da.from_delayed(v, shape=(5, np.nan), dtype=float) for v in values]
x = da.concatenate(arrays, axis=1)
>>> x
dask.array<concatenate, shape=(5, nan), dtype=float64, chunksize=(5, nan)>
>>> x.shape
(5, np.nan)
>>> x.compute().shape
(5, 88)
Docs
See http://dask.pydata.org/en/latest/array-chunks.html#unknown-chunks

Resources