I am trying to test the dask commands found on this page...
https://extrapolations.dev/blog/2015/07/reproduceit-reddit-word-count-dask/
I got an error at this line:
words = bodies.map(nltk.word_tokenize).concat()
I guess the dask API has changed since the article was published. How do I rewrite it using this file...
aws s3 cp s3://reddit-comments/2007/RC_2007-10 .
I have managed to run this code so far:
import re
import json
import time
import nltk
import dask
import dask.bag as db
import nltk
from nltk.corpus import stopwords
data = db.read_text("RC_2007-10" ).map(json.loads)
no_stopwords = lambda x: x not in stopwords.words('english')
is_word = lambda x: re.search("^[0-9a-zA-Z]+$", x) is not None
subreddit = data.filter(lambda x: x['subreddit'] == 'movies')
bodies = subreddit.pluck('body')
I think that you're looking for the flatten method:
In [1]: import dask.bag as db
In [2]: b = db.from_sequence([[1, 2, 3], [4, 5, 6]])
In [3]: b.flatten().compute()
Out[3]: [1, 2, 3, 4, 5, 6]
https://docs.dask.org/en/latest/bag-api.html
Related
I have a problem where I have to predict a buyer using machine learning (created a dummy dataset). I need to transform the data first before I can use it for machine learning. I am aggregating information per id,visit level which gives me a list of food and cloths bought. This list needs to be one hot encoded before applying classifier model.
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingClassifier
def preprocess(df):
# Only keep rows till buyer=1
df = df.groupby(["id1", "visit"], group_keys=False).apply(
lambda g: g.loc[: g["Buyer"].idxmax()]
)
# Form lists on each id1,visit level
df1 = df.groupby(["id1", "visit"], as_index=False).agg(
is_Pax=("Buyer", "max"),
fruits=("fruits", lambda x: x.dropna().unique().tolist()),
cloths=("cloths", lambda x: x.dropna().unique().tolist()),
)
col = ["fruits", "cloths"]
df_transformed = onehot(df1, col)
return df_transformed
def onehot(df, col):
"""
This function does one hot encoding of a list column.
"""
onehot_list_encoder = MultiLabelBinarizer()
for cl in col:
print("One hot encoding ", cl)
newd = pd.DataFrame(
onehot_list_encoder.fit_transform(df[cl]),
columns=onehot_list_encoder.classes_,
).add_prefix(cl + "_")
df = df.join(newd)
return df
df = pd.DataFrame(np.array([['a', 'a', 'b', 'b','a','a'], [1, 2, 2, 2,1,1],
['Apple', 'Apple', 'Banana', None,'Orange','Pear'],[1,2,1,3,4,5],
[0, 0, 1, 0,1,0]]).T,
columns=['id1', 'visit', 'fruits','cloths','Buyer'])
df['Buyer'] = df['Buyer'].astype('int')
How to create a simple ML model now that does this preprocessing to data (both fit and predict) since in test data, I want the same transformation (i.e. 0 for all columns not present in the test rows), Can pipeline solve this? I am not so good with writing pipelines and am getting errors.
droplist=['id1', 'visit', 'fruits','cloths']
pipe=Pipeline(steps=[
("preprocess",preprocess(df)),
("coltrans",ColumnTransformer([("drop",'drop',droplist)])),
("model",GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)),
])
Can someone help?
I'm working with datashader and dask but I'm having problems when trying to plot with a cluster running. To make it more concrete, I have the following example (embedded in a bokeh plot):
import holoviews as hv
import pandas as pd
import dask.dataframe as dd
import numpy as np
from holoviews.operation.datashader import datashade
import datashader.transfer_functions as tf
#initialize the client/cluster
cluster = LocalCluster(n_workers=4, threads_per_worker=1)
dask_client = Client(cluster)
def datashade_plot():
hv.extension('bokeh')
#create some random data (in the actual code this is a parquet file with millions of rows, this is just an example)
delta = 1/1000
x = np.arange(0, 1, delta)
y = np.cumsum(np.sqrt(delta)*np.random.normal(size=len(x)))
df = pd.DataFrame({'X':x, 'Y':y})
#create dask dataframe
points_dd = dd.from_pandas(df, npartitions=3)
#create plot
points = hv.Curve(points_dd)
return hd.datashade(points)
dask_client.submit(datashade_plot,).result()
This raises a:
TypeError: can't pickle weakref objects
I have the theory that this happens because you can't distribute the datashade operations in the cluster. Sorry if this is a noob question, I'd be very grateful for any advice you could give me.
I think you want to go the other way. That is, pass datashader a dask dataframe instead of a pandas dataframe:
>>> from dask import dataframe as dd
>>> import multiprocessing as mp
>>> dask_df = dd.from_pandas(df, npartitions=mp.cpu_count())
>>> dask_df.persist()
...
>>> cvs = datashader.Canvas(...)
>>> agg = cvs.points(dask_df, ...)
XREF: https://datashader.org/user_guide/Performance.html
I am trying to implement a rolling average which resets whenever a '1' is encountered in a column labeled 'A'.
For example, the following functionality works in Pandas.
import pandas as pd
x = pd.DataFrame([[0,2,3], [0,5,6], [0,8,9], [1,8,9],[0,8,9],[0,8,9], [0,3,5], [1,8,9],[0,8,9],[0,8,9], [0,3,5]])
x.columns = ['A', 'B', 'C']
x['avg'] = x.groupby(x['A'].cumsum())['B'].rolling(2).mean().values
If I try an analogous code in Dask, I get the following:
import pandas as pd
import dask
x = pd.DataFrame([[0,2,3], [0,5,6], [0,8,9], [1,8,9],[0,8,9],[0,8,9], [0,3,5], [1,8,9],[0,8,9],[0,8,9], [0,3,5]])
x.columns = ['A', 'B', 'C']
x = dask.dataframe.from_pandas(x, npartitions=3)
x['avg'] = x.groupby(x['A'].cumsum())['B'].rolling(2).mean().values
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-189-b6cd808da8b1> in <module>()
7 x = dask.dataframe.from_pandas(x, npartitions=3)
8
----> 9 x['avg'] = x.groupby(x['A'].cumsum())['B'].rolling(2).mean().values
10 x
AttributeError: 'SeriesGroupBy' object has no attribute 'rolling'
After searching through the Dask API documentation I have not been able to find an implementation of what I am looking for.
Can anyone suggest an implementation of this algorithm in a Dask compatible way?
Thank you :)
Since then I found the following code snippet:
df1 = ddf.groupby('cumsum')['x'].apply(lambda x: x.rolling(2).mean(), meta=('x', 'f8')).compute()
at Dask rolling function by group syntax.
Here is an adapted toy example:
import pandas as pd
import dask.dataframe as dd
x = pd.DataFrame([[1,2,3], [2,3,4], [4,5,6], [2,3,4], [4,5,6], [4,5,6], [2,3,4]])
x['bool'] = [0,0,0,1,0,1,0]
x.columns = ['a', 'b', 'x', 'bool']
ddf = dd.from_pandas(x, npartitions=4)
ddf['cumsum'] = ddf['bool'].cumsum()
df1 = ddf.groupby('cumsum')['x'].apply(lambda x: x.rolling(2).mean(), meta=('x', 'f8')).compute()
df1
This has the correct functionality, but the order of the indices is now incorrect. Alternatively, if one knows how to preserve the order of the index, that would be a suitable solution.
You might want to construct your own rolling operation using the map_overlap or the _cum_agg methods (cum_agg is unfortunately not well documented).
I'm trying to tune my voting classifier. I wanted to use randomized search in Sklearn. However how could you set parameter lists for my voting classifier since I currently use two algorithms (different tree algorithms)?
Do I have to separately run randomized search and combine them together in voting classifier later?
Could someone help? Code examples would be highly appreciated :)
Thanks!
You can perfectly combine both, the VotingClassifier with RandomizedSearchCV. No need to run them separately. See the documentation: http://scikit-learn.org/stable/modules/ensemble.html#using-the-votingclassifier-with-gridsearch
The trick is to prefix your params list with your estimator name. For example, if you have created a RandomForest estimator and you created it as ('rf',clf2) then you can set up its parameters in the form <name__param>. Specific example: rf__n_estimators: [20,200], so you refer to a specific estimator and set values to test for a specific param.
Ready to test executable code example ;)
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.grid_search import RandomizedSearchCV
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])
clf1 = DecisionTreeClassifier()
clf2 = RandomForestClassifier(random_state=1)
params = {'dt__max_depth': [5, 10], 'rf__n_estimators': [20, 200],}
eclf = VotingClassifier(estimators=[('dt', clf1), ('rf', clf2)], voting='hard')
random_search = RandomizedSearchCV(eclf, param_distributions=params,n_iter=4)
random_search.fit(X, y)
print(random_search.grid_scores_)
I am trying to use dask instead of pandas since I have 2.6gb csv file.
I load it and I want to drop a column. but it seems that neither the drop method
df.drop('column') or slicing df[ : , :-1]
is implemented yet. Is this the case or am I just missing something ?
We implemented the drop method in this PR. This is available as of dask 0.7.0.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 2, 1]})
In [3]: import dask.dataframe as dd
In [4]: ddf = dd.from_pandas(df, npartitions=2)
In [5]: ddf.drop('y', axis=1).compute()
Out[5]:
x
0 1
1 2
2 3
Previously one could also have used slicing with column names; though of course this can be less attractive if you have many columns.
In [6]: ddf[['x']].compute()
Out[6]:
x
0 1
1 2
2 3
This should work:
print(ddf.shape)
ddf = ddf.drop(columns, axis=1)
print(ddf.shape)