Value Error on Dask DataFrames - dask

I am using dask to read a csv file. However, i couldn't apply or compute any operation on it because of this error:
Do you have ideas what is this error all about and how to fix it?

On reading csv file in dask, errors comes in upon not recognizing the correct dtype of columns.
For example, we read a csv file using dask as follows:
import dask.dataframe as dd
df = dd.read_csv('\data\file.txt', sep='\t', header='infer')
This prompts the error mentioned above.
To solve this problem, as suggested by #mrocklin on this comment, https://github.com/dask/dask/issues/1166, we need to determine the dtype of the columns. We can do this by reading the csv file in pandas and identify the data type and pass that as argument in reading csv using dask.
df_pd = pd.read_csv('\data\file.txt', sep='\t', header='infer')
dt = df_pd.dtypes.to_dict()
df = dd.read_csv('\data\file.txt', sep='\t', header='infer', dtype=dt)

Related

How can I create a Dask array from zipped .npy files?

I have a large dataset stored as zipped npy files. How can I stack a given subset of these into a Dask array?
I'm aware of dask.array.from_npy_stack but I don't know how to use it for this.
Here's a crude first attempt that uses up all my memory:
import numpy as np
import dask.array as da
data = np.load('data.npz')
def load(files):
list_ = [da.from_array(data[file]) for file in files]
return da.stack(list_)
x = load(['foo', 'bar'])
Well, you can't load a large npz file into memory, because then you're already out of memory. I would read each one in in a delayed fashion, and then call da.from_array and da.stack as you sort of are in your example.
Here are some docs that may help if you haven't seen them before: https://docs.dask.org/en/latest/array-creation.html#using-dask-delayed

How can I sort a big text file with Dask?

I have a text file which is way bigger than my memory. I want to sort the lines of that file lexicographically. I know how to do it manually:
Split into chunks which fit into memory
Sort the chunks
Merge the chunks
I wanted to do it with dask. I thought dealing with big amounts of data would be one use case of dask. How can I sort the whole data with Dask?
My Try
You can execute generate_numbers.py -n 550_000_000 which will take about 30 minutes and generate a 20 GB file.
import dask.dataframe as dd
filename = "numbers-large.txt"
print("Create ddf")
ddf = dd.read_csv(filename, sep = ',', header = None).set_index(0)
print("Compute ddf and sort")
df = ddf.compute().sort_values(0)
print("Write")
with open("numbers-large-sorted-dask.txt", "w") as fp:
for number in df.index.to_list():
fp.write(f"{number}\n")
when I execute this, I get
Create ddf
Compute ddf and sort
[2] 2437 killed python dask-sort.py
I guess the process is killed because it consumes too much memory?
Try the following code:
import dask
import dask.dataframe as dd
inpFn = "numbers-large.txt"
outFn = "numbers-large-sorted-dask.txt"
blkSize = 500 # For test on a small file - increase it
print("Create ddf")
ddf = dd.read_csv(inpFn, header = None, blocksize=blkSize)
print("Sort")
ddf_sorted = ddf.set_index(0)
print("Write")
fut = ddf_sorted.to_csv(outFn, compute=False, single_file=True, header=None)
dask.compute(fut)
print("Stop")
Note that I set so low blkSize parameter just for test purpose.
In the target version either increase its value or drop, along with
blocksize=blkSize, to accept the default value.
As set_index provides the sort, there is no need to call sort_values()
and other detail is that dask does not support this method.
As far as writing is concerned, I noticed that you want to generate a
single output file, instead of a sequence of files (one file for each
partition), so I passed single_file=True.
I also added header=None to block writing the column name, in this
case (not very meaningful) 0.
The last detail to mention is compute=False, so that dask
generates a sequence of future objects, without executing them
(computing it) - for now.
All operations so far only constructed the computation tree,
without its execution.
As late as now, compute(...) runs the whole computation tree.
Edit
Your code probably failed due to:
df = ddf.compute().sort_values(0)
Note that you:
first compute(), to generate the whole pandasonic DataFrame,
after that, at the Pandas level, you attempt to sort it.
The problem is probably that the memory in your computer is not
big enough to hold the whole result of compute().
So most likely your code failed just at this moment, without any
chance to sort this DataFrame.

Read HDF5 dataset of multiple data types

I have a HDF5 file dataset which contains different data types(int and float).
While reading it in numpy array, it detects it as array of type np.void.
import numpy as np
import h5py
f = h5py.File('Sample.h5', 'r')
array = np.array(f['/Group1/Dataset'])
print(array.dtype)
Image of the data types {print(array.dtype)}
How can I read this dataset into arrays with each column as the same data type as that of input? Thanks in advance for the reply
Here are 2 simple examples showing both ways to slice a subset of the dataset using the HDF5 Field/Column names.
The first method extracts a subset of the data to a record array by slicing when accessing the dataset.
The second method follows your current method. It extracts the entire dataset to a record array, then slices a new view to access a subset of the data.
Print statements are used liberally so you can see what's going on.
Method 1
real_array= np.array(f['/Group1/Dataset'][:,'XR','YR','ZR'])
print(real_array.dtype)
print(real_array.shape)
Method 2
cmplx_array = np.array(f['/Group1/Dataset'])
print(cmplx_array.dtype)
print(cmplx_array.shape)
disp_real = cmplx_array[['XR','YR','ZR']]
print(disp_real.dtype)
print(disp_real.shape)
Review this SO topic for additional insights into copying values from a recarray to a ndarray, and back.
copy-numpy-recarray-to-ndarray

Convertiing Avro to dask dataframe

I am pretty new to Dask and had most of my files in Avro (migrated from PySpark). I tried using intake_avro (sequence), managed to read and does a take(1) showing the right data.
As I have quite a number of columns with 'None' and when I convert to dataframe it always gives me
ValueError: Cannot convert non-finite values (NA or inf) to integer
I am not sure whether I am doing right?

Performance and data manipulation on Dask

I have imported a parquet file of approx. 800MB with ~50 millions rows into dask dataframe.
There are 5 columns: DATE, TICKER, COUNTRY, RETURN, GICS
Questions:
How can I specify data type in read_parquet or I have to do it with astype?
Can I parse date within read_parquet
I simply tried to do the follow:
import dask.dataframe as dd
dd.read_parquet('.\abc.gzip')
df['INDUSTRY'] = df.GICS.str[0:4]
n = df.INDUSTRY.unique().compute()
and it takes forever to return. Am I doing anything wrong here? partitions are automatically set to 1.
I am trying to do something like df[df.INDUSTRY == '4010'].compute(), it also takes forever to return or crash.
To answer your questions:
A parquet file has types stored, as noted in the Apache docs here, thus you won't be able to change the data type when you read the file in, meaning you'll have to use astype.
You can't convert a string to date within a the read, though if you use the map_partitions function, documented here you can convert the column to date, as in this example:
import dask.dataframe as dd
df = dd.read_parquet(your_file)
meta = ('date', 'datetime64[ns]')
# you can add your own date format, or just let pandas guess
to_date_time = lambda x: pd.to_datetime(x, format='%Y-%m-%d')
df['date_clean'] = df.date.map_partitions(to_date_time, meta=meta)
The map_partitions function will convert the dates on each chunk of the parquet when the file is computed, making it functionally the same as converting the date when the file is read in.
Here I think again you would benefit from using the map_partitions function, so you might try something like this
import dask.dataframe as dd
df = dd.read_parquet('.\abc.gzip')
df['INDUSTRY']df.GICS.map_partitions(lambda x: x.str[0:4], meta=('INDUSTRY', 'str'))
df[df.INDUSTRY == '4010']
Note that if you run compute the object is converted to pandas. If the file is too large than Dask won't be able to compute it, and thus nothing will be returned. Without seeing the data it's harder to say more, but do checkout these tools to profile your computations to see if you are leveraging all your CPUs.

Resources