Dask: outer join read from multiple csv files

Dask: outer join read from multiple csv files - dask

import dask.dataframe as dd
import numpy as np
from dask import delayed
df1 = pd.DataFrame({'a': np.arange(10), 'b': np.random.rand()})
df1 = df1.astype({'a':np.float64})
df2 = pd.DataFrame({'a': np.random.rand(5), 'c': 1})
df1.to_csv('df1.csv')
df2.to_csv('df2.csv')
dd.read_csv('*.csv').compute()
Gives inner join result:
Unnamed: 0 a b
0 0 0.000000 0.218319
1 1 1.000000 0.218319
2 2 2.000000 0.218319
...
And:
df1_delayed = delayed(lambda: df1)()
df2_delayed = delayed(lambda: df2)()
dd.from_delayed([df1_delayed, df2_delayed]).compute()
Gives outer join result:
a b c
0 0.000000 0.218319 NaN
1 1.000000 0.218319 NaN
2 2.000000 0.218319 NaN
...
How to make read_csv work in the same mode?
EDIT:
Even passing dtype schema down to pandas doesn't work:
dd.read_csv('*.csv', dtype={'a':np.float64, 'b': np.float64, 'c': np.float64}).compute()

Generally dask.dataframe assumes that all Pandas dataframes that form the dask.dataframe have the same columns and dtype. Behavior is ill-defined if this is not the case.
If your CSVs have different columns and dtypes then I recommend using dask.delayed as you've done in your second example and explicitly add the new empty columns before calling dask.dataframe.from_delayed.

Related

Dask Dataframe Greater than a Delayed Number

Is there a way to do this but with the threshold as a delayed number?
import dask
import pandas as pd
import dask.dataframe as dd
threshold = 3
df = pd.DataFrame({'something': [1,2,3,4]})
ddf = dd.from_pandas(df, npartitions=2)
ddf[ddf['something'] >= threshold]
What if threshold is:
threshold = dask.delayed(3)
Atm it gives me:
TypeError('Truth of Delayed objects is not supported')
I want to keep the ddf as a dask dataframe, and not turn it into a pandas dataframe. Wondering if there was combinator forms that also took delayed values.

Dask has no way to know that the concrete value in that Delayed object is an integer, so there's no way to know what to do with it in the operation (align, broadcast, etc.)
If you use something like a size-0 array, things seem OK
In [32]: df = dd.from_pandas(pd.DataFrame({"A": [1, 2, 3, 4]}), 2)
In [33]: threshold = da.from_array(np.array([3]))[0]
In [34]: df.A > threshold
Out[34]:
Dask Series Structure:
npartitions=2
0 bool
2 ...
3 ...
Name: A, dtype: bool
Dask Name: gt, 8 tasks
In [35]: df[df.A > threshold].compute()
Out[35]:
A
3 4

How do you make a KMeans prediction more accurate?

I'm learning about clustering and KMeans and such, so my knowldge is very basic on the topic. What I have below is a bit of a self study on how it works. Basically, if 'a' shows up in any of the columns, 'Binary' will equal 1. Essentially I am trying to teach it a pattern. I learned the following from a tutorial using the Titanic dataset, but I've adapted to my own data.
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt
my constructed data
dataset = [
[0,'x','f','g'],[1,'a','c','b'],[1,'d','k','a'],[0,'y','v','w'],
[0,'q','w','e'],[1,'c','a','l'],[0,'t','x','j'],[1,'w','o','a'],
[0,'z','m','n'],[1,'z','x','a'],[0,'f','g','h'],[1,'h','a','c'],
[1,'a','r','e'],[0,'g','c','c']
]
df = pd.DataFrame(dataset, columns=['Binary','Col1','Col2','Col3'])
df.head()
df:
Binary Col1 Col2 Col3
------------------------
1 a b c
0 x t v
0 s q w
1 n m a
1 u a r
Encode non binary to binary:
labelEncoder = LabelEncoder()
labelEncoder.fit(df['Col1'])
df['Col1'] = labelEncoder.transform(df['Col1'])
labelEncoder.fit(df['Col2'])
df['Col2'] = labelEncoder.transform(df['Col2'])
labelEncoder.fit(df['Col3'])
df['Col3'] = labelEncoder.transform(df['Col3'])
Set clusters to two, because its either 1 or 0?
X = np.array(df.drop(['Binary'], 1).astype(float))
y = np.array(df['Binary'])
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
Test it:
correct = 0
for i in range(len(X)):
predict_me = np.array(X[i].astype(float))
predict_me = predict_me.reshape(-1, len(predict_me))
prediction = kmeans.predict(predict_me)
if prediction[0] == y[i]:
correct += 1
The result:
print(f'{round(correct/len(X) * 100)}% Accuracy')
>>> 71%
How can I get it more accurate to the point where it 99.99% knows that 'a' means binary column is 1? More data?

K-means does not even try to predict this value. Because it is an unsupervised method. Because it is not a prediction algorithm; it is a structure discovery task. Don't mistake clustering for classification.
The cluster numbers have no meaning. They are 0 and 1 because these are the first two integers. K-means is randomized. Run it a few times and you will also score just 29% sometimes.
Also, k-means is designed for continuous input. You can apply it on binary encoded data, but the results will be pretty poor.

dask.dataframe.from_bcolz() starts to process immediately

I wonder why dd.from_bcolz() starts to do some processing (that grows alot when N columns goes up and there are string type columns) immediately when called.
And dd.read_hdf() doesn't do much processing when called, but only when dask.dataframe is used - then read_hdf() reads and process HDF5 chunk by chunk...
I like how read_hdf works now, the only problem that hdf5 table cannot have more then ~1200 columns, and dataframe does not support columns of arrays. And hdf5 format is not column based after all...
In [1]: import dask.dataframe as dd
In [2]: import pandas as pd
In [3]: import bcolz, random
In [4]: import numpy as np
In [5]: N = int(1e7)
In [6]: int_col = np.linspace(0, 1, N)
In [7]: ct_disk = bcolz.fromiter(((i,i) for i in range(N)), dtype="i8,i8",\
...: count=N, rootdir=r'/mnt/nfs/ct_.bcolz')
In [8]: for i in range(10): ct_disk.addcol(int_col)
In [9]: import dask.dataframe as dd
In [10]: %time dd.from_bcolz(r'/mnt/nfs/ct_.bcolz', chunksize=1000000, lock=False)
CPU times: user 8 ms, sys: 16 ms, total: 24 ms
Wall time: 32.6 ms
Out[10]: dd.DataFrame<from_bc..., npartitions=10, divisions=(0, 1000000, 2000000, ..., 9000000, 9999999)>
In [11]: str_col= [''.join(random.choice('ABCD1234') for _ in range(5)) for i in range(int(N/10))]*10
In [12]: ct_disk.addcol(str_col, dtype='S5')
In [13]: %time dd.from_bcolz(r'/mnt/nfs/ct_.bcolz', chunksize=1000000, lock=False)
CPU times: user 2.36 s, sys: 56 ms, total: 2.42 s
Wall time: 2.44 s
Out[13]: dd.DataFrame<from_bc..., npartitions=10, divisions=(0, 1000000, 2000000, ..., 9000000, 9999999)>
In [14]: for i in range(10): ct_disk.addcol(str_col, dtype='S5')
In [15]: %time dd.from_bcolz(r'/mnt/nfs/ct_.bcolz', chunksize=1000000, lock=False)
CPU times: user 25.3 s, sys: 511 ms, total: 25.8 s
Wall time: 25.9 s
Out[15]: dd.DataFrame<from_bc..., npartitions=10, divisions=(0, 1000000, 2000000, ..., 9000000, 9999999)>
And it's getting even worse when N (nrows) grows up.

It looks like as written today from_bcolz automatically categorizes object dtype columns. So it's doing a full read of all object dtype columns and calling unique on them. You can turn this off by setting categorize=False.
Please raise a github issue if you think that this behavior should be changed.

dask df.col.unique() vs df.col.drop_duplicates()

In dask what is the difference between
df.col.unique()
and
df.col.drop_duplicates()
Both return a series containing the unique elements of df.col.
There is a difference in the index, unique result is indexed by 1..N while drop_duplicates indexed by an arbitrary looking sequence of numbers.
What is the significance of the index returned by drop_duplicates?
Is there any reason to use one over the other if the index is not important?

Dask.dataframe has both because Pandas has both, and dask.dataframe mostly copies the Pandas API. Unique is a holdover from Pandas' history with Numpy.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1, 2, 1], 'y': [1., 2., 3.]}, index=pd.Index(['a', 'b', 'A'], name='I'))
In [3]: df.x.drop_duplicates()
Out[3]:
I
a 1
b 2
Name: x, dtype: int64
In [4]: df.x.unique()
Out[4]: array([1, 2])
In dask.dataframe we deviate slightly and choose to use a dask.dataframe.Series rather than a dask.array.Array because one can't precompute the length of the array and so can't act lazily.
In practice there is little reason to use unique over drop_duplicates

Can't drop columns or slice dataframe using dask?

I am trying to use dask instead of pandas since I have 2.6gb csv file.
I load it and I want to drop a column. but it seems that neither the drop method
df.drop('column') or slicing df[ : , :-1]
is implemented yet. Is this the case or am I just missing something ?

We implemented the drop method in this PR. This is available as of dask 0.7.0.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 2, 1]})
In [3]: import dask.dataframe as dd
In [4]: ddf = dd.from_pandas(df, npartitions=2)
In [5]: ddf.drop('y', axis=1).compute()
Out[5]:
x
0 1
1 2
2 3
Previously one could also have used slicing with column names; though of course this can be less attractive if you have many columns.
In [6]: ddf[['x']].compute()
Out[6]:
x
0 1
1 2
2 3

This should work:
print(ddf.shape)
ddf = ddf.drop(columns, axis=1)
print(ddf.shape)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Dask: outer join read from multiple csv files - dask

Related

Dask Dataframe Greater than a Delayed Number

How do you make a KMeans prediction more accurate?

dask.dataframe.from_bcolz() starts to process immediately

dask df.col.unique() vs df.col.drop_duplicates()

Can't drop columns or slice dataframe using dask?

Categories

Resources