Python Dask Apply Function and STore Result in Same Column - dask

Hello i am bit new on Dask and i am trying to do the following things
i have a CSV file I am reading file everything works fine
import pandas
import os
import json
import math
import numpy as np
import dask
from dask.distributed import Client
import dask.dataframe as df
import dask.multiprocessing
client = Client(n_workers=3, threads_per_worker=4, processes=False, memory_limit='2GB')
df = df.read_csv("netflix_titles.csv")
now i have function
def toupper(x):
return x.upper()
i would like to apply this to a column now the issue is want to save the result in same column seems like i cannot do that
df["title"].map(toupper).compute()
The following line works but i want
df["title"] = df["title"].map(toupper).compute()
ValueError: Not all divisions are known, can't align partitions. Please use set_index to set the index.
Image

Maybe try this after read_csv.
df.title = df.title.map(toupper)
df.to_csv("netflix_titles.csv", index=False, single_file=True)
to_csv has a optional argument with default valuecompute=True so you don't need to explicit do compute().

Related

Troubles using dask distributed with datashader: 'can't pickle weakref objects'

I'm working with datashader and dask but I'm having problems when trying to plot with a cluster running. To make it more concrete, I have the following example (embedded in a bokeh plot):
import holoviews as hv
import pandas as pd
import dask.dataframe as dd
import numpy as np
from holoviews.operation.datashader import datashade
import datashader.transfer_functions as tf
#initialize the client/cluster
cluster = LocalCluster(n_workers=4, threads_per_worker=1)
dask_client = Client(cluster)
def datashade_plot():
hv.extension('bokeh')
#create some random data (in the actual code this is a parquet file with millions of rows, this is just an example)
delta = 1/1000
x = np.arange(0, 1, delta)
y = np.cumsum(np.sqrt(delta)*np.random.normal(size=len(x)))
df = pd.DataFrame({'X':x, 'Y':y})
#create dask dataframe
points_dd = dd.from_pandas(df, npartitions=3)
#create plot
points = hv.Curve(points_dd)
return hd.datashade(points)
dask_client.submit(datashade_plot,).result()
This raises a:
TypeError: can't pickle weakref objects
I have the theory that this happens because you can't distribute the datashade operations in the cluster. Sorry if this is a noob question, I'd be very grateful for any advice you could give me.
I think you want to go the other way. That is, pass datashader a dask dataframe instead of a pandas dataframe:
>>> from dask import dataframe as dd
>>> import multiprocessing as mp
>>> dask_df = dd.from_pandas(df, npartitions=mp.cpu_count())
>>> dask_df.persist()
...
>>> cvs = datashader.Canvas(...)
>>> agg = cvs.points(dask_df, ...)
XREF: https://datashader.org/user_guide/Performance.html

dask.ml.xgboost raises UnboundLocalError: local variable 'result' referenced before assignment

I am using dask_xgboost and I don't understand the error stated in the subject. I have successfully trained a model and saved it with joblib.dump.
Later on, during the prediction step I use it like this:
import dask
import dask.dataframe as dd
import dask.distributed as ddst
from dask_jobqueue import PBSCluster
from dask.distributed import Client
import dask_xgboost as dxgb
import geopandas as gp
from sklearn.externals import joblib
def predict(zs_files: List[str], model_name: str, client) -> None:
delayed_dfs = [dask.delayed(gp.read_file)(zsf) for zsf in zs_files]
model = joblib.load(model_name)
delayed_predictions = [
dxgb.predict(client, model, df).to_parquet(f"{fn}_predicted.parquet")
for df, fn in zip(delayed_dfs, zs_files)
]
delayed_predictions.compute()
I read a set of GeoJSON files with geopandas and then just feed the model with them. I am using a client on a PBS cluster.
Any help would be appreciated.
Thanks.
I found the issue. I wass missing a from_delayed call to transform the geopandas dataframe to a dask one:
dxgb.predict(client, model, dd.from_delayed(df))

python pandas read_csv() from google spreadsheet url

I want to load the data of this link (a Google Spreadsheet data), in my Jupyter notebook, using python.
I tried different methods and pandas.read_csv() seems to be the easiest. But, I cannot load the data in a proper format. Here is the code that I am using:
import pandas as pd
url = 'https://docs.google.com/spreadsheets/d/1itaohdPiAeniCXNlntNztZ_oRvjh0HsGuJXUJWET008/edit#gid=0'
df = pd.read_csv(url, error_bad_lines=False)
df
The output does not look like the Spreadsheet:
Probably because of data type, that I don't know how to fix. I have tried different ways from other posts, but it didn't help. Here is one of them:
import pandas as pd
import requests
import io
url = requests.get('https://docs.google.com/spreadsheets/d/1itaohdPiAeniCXNlntNztZ_oRvjh0HsGuJXUJWET008/edit#gid=0').text
buffer = io.StringIO(url)
columns = ['ID','age', 'sex','city', 'province', 'country',
'wuhan(0)_not_wuhan(1)', 'latitude', 'longitude',
'geo_resolution', 'date_onset_symptoms', 'date_admission_hospital',
'date_confirmation', 'symptoms', 'lives_in_Wuhan', 'travel_history_dates',
'travel_history_location', 'reported_market_exposure', 'additional_information']
df = pd.read_csv(filepath_or_buffer=buffer, header=1, usecols=columns)
df

Tensorflow, object detection API

Is there a way to view the images that tensorflow object detection api trains on after all preprocessing/augmentation.
I'd like to verify that things look correctly. I was able to verify the resizing my looking at the graph post resize in inference but I obviously can't do that for augmentation options.
TIA
I answered a similar question here.
You can utilize the test script provided by the api and make some changes to fit your need.
I wrote a little test script called augmentation_test.py. It borrowed some code from input_test.py
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import functools
import os
from absl.testing import parameterized
import numpy as np
import tensorflow as tf
from scipy.misc import imsave, imread
from object_detection import inputs
from object_detection.core import preprocessor
from object_detection.core import standard_fields as fields
from object_detection.utils import config_util
from object_detection.utils import test_case
FLAGS = tf.flags.FLAGS
class DataAugmentationFnTest(test_case.TestCase):
def test_apply_image_and_box_augmentation(self):
data_augmentation_options = [
(preprocessor.random_horizontal_flip, {
})
]
data_augmentation_fn = functools.partial(
inputs.augment_input_data,
data_augmentation_options=data_augmentation_options)
tensor_dict = {
fields.InputDataFields.image:
tf.constant(imread('lena.jpeg').astype(np.float32)),
fields.InputDataFields.groundtruth_boxes:
tf.constant(np.array([[.5, .5, 1., 1.]], np.float32))
}
augmented_tensor_dict =
data_augmentation_fn(tensor_dict=tensor_dict)
with self.test_session() as sess:
augmented_tensor_dict_out = sess.run(augmented_tensor_dict)
imsave('lena_out.jpeg',augmented_tensor_dict_out[fields.InputDataFields.image])
if __name__ == '__main__':
tf.test.main()
You can put this script under models/research/object_detection/ and simply run it with python augmentation_test.py (Of course you need to install the API first). To successfully run it you should provide any image name 'lena.jpeg' and the output image after augmentation would be saved as 'lena_out.jpeg'.
I ran it with the 'lena' image and here is the result before augmentation and after augmentation.
.
Note that I used preprocessor.random_horizontal_flip in the script. And the result showed exactly what the input image looks like after random_horizontal_flip. To test it with other augmentation options, you can replace the random_horizontal_flip with other methods (which are all defined in preprocessor.py), all you can append other options to the data_augmentation_options list, for example:
data_augmentation_options = [(preprocessor.resize_image, {
'new_height': 20,
'new_width': 20,
'method': tf.image.ResizeMethod.NEAREST_NEIGHBOR
}),(preprocessor.random_horizontal_flip, {
})]

Dask dataframe get second highest value and column name

This code gives me the highest value and column name.
import numpy as np
import pandas as pd
import dask.dataframe as dd
cols=[0,1,2,3,4]
df = pd.DataFrame(np.random.randn(1000, len(cols)), columns=cols)
ddf = dd.from_pandas(df, npartitions=4)
ddf['max_col'] = ddf[cols].idxmax(axis=1)
ddf['max_val'] = ddf[cols].max(axis=1)
I want to get the second higest as well. Something like:
ddf['max2_col'] = ddf[cols].idxmax2(axis=1)
ddf['max2_val'] = ddf[cols].max2(axis=1)
Are there functions like idxmax2 or max2? Or any other optimized way for doing this?
You should normally try to figure out how to do what you want to do with pandas first. If you cannot, and pose that question instead, with the pandas tag, you will get a faster answer.
The following appears to work for pandas, although it may not be elegant
import numpy as np
import pandas as pd
import dask.dataframe as dd
cols=[0,1,2,3,4]
df = pd.DataFrame(np.random.randn(1000, len(cols)), columns=cols)
def make_cols(df):
df['max2_col'] = df[cols].values.argsort(axis=1)[:, -2]
df2 = df[cols].values.copy()
df2.sort(axis=1)
df['max2_val'] = df2[:, -2]
return df
so to apply it to the dask variant, you can do
ddf = dd.from_pandas(df, npartitions=4)
ddf.map_partitions(make_cols)
ddf.head()

Resources