I have 33 multi-partition dataframes. All have their metadata. They were all made with fastparquet. The structure looks something like:
- 20190101.parquet
- _common_metadata
- _metadata
- part.0.parquet
- ....
- part.n.parquet
- 20190102.parquet
- _common_metadata
- _metadata
- part.0.parquet
- ....
- part.n.parquet
- 20190103.parquet
- _common_metadata
- _metadata
- part.0.parquet
- ....
- part.n.parquet
I would like to join these all together.
I currently have:
dfs = []
for date in dates:
df = dd.read_parquet(f'{date}.parquet', engine='fastparquet')
dfs.append(df)
df = dd.concat(dfs)
This returns a dask dataframe called "concat" with 129,294 tasks.
I then am trying to write this out:
df.to_parquet('out.parquet', engine='fastparquet')
This last call never starts work. That is:
* my notebook cell is running
* dask system page shows a growing number of file descriptors and then flattens
* dask system page shows increasing memory and then still increasing but more slowly
* but tasks do not appear in the task stream
I have waited for up to 1 hour.
(Running on dask 2.3.0)
I sincerely hope that all of these have a sorted index column along which you are joining them. Otherwise this is likely to be very expensive.
If they do have such a column, you might want to call it out explicitly.
You can just pass an array of filenames to fastparquet and it will read them as one and you can load them into a dask or pandas dataframe.
this is how i read a directory of parquet files scattered onto a dask cluster
output = ["some list of files..."]
df = client.scatter(dd.read_parquet(output,engine="fastparquet").reset_index().compute())
Related
I'm trying to load a Dask dataframe from a SQL connection. Per the read_sql_table documentation, it is necessary to pass in an index_col. What should I do if there's a possibility that there are no good columns to act as index?
Could this be a suitable replacement?
# Break SQL Query into chunks
chunks = []
num_chunks = math.ceil(num_records / chunk_size)
# Run query for each chunk on Dask workers
for i in range(num_chunks):
query = 'SELECT * FROM ' + table + ' LIMIT ' + str(i * chunk_size) + ',' + str(chunk_size)
chunk = dask.delayed(pd.read_sql)(query, sql_uri)
chunks.append(chunk)
# Aggregate chunks
df = dd.from_delayed(chunks)
dfs[table] = df
Unfortunately, LIMIT/OFFSET is not in general a reliable way to partition a query in most SQL implementations. In particular, it is often the case that, to get to an offset and fetch later rows from a query, the engine must first parse through earlier rows, and thus the work to generate a number of partitions is much magnified. In some cases, you might even end up with missed or duplicated rows.
This was the reasoning behind requiring boundary values in the dask sql implementation.
However, there is nothing in principle wrong with the way you are setting up your dask dataframe. If you can show that your server does not suffer from the problems we were anticipating, then you are welcome to take that approach.
I have two dataframes. One is a lookup table consisting of key/value pairs. The other is my main dataframe. The main dataframe has many more records than the lookup table. I need to construct a 'key' from existing columns in my main dataframe and then lookup a value matching that key in my lookup table. Here they are:
lk = pd.DataFrame( { 'key': ['key10', 'key9'],'value': [100, 90]})
lk.set_index('key', inplace=True)
date_today = datetime.now()
df = pd.DataFrame({ 'date1':[date_today, date_today,date_today],
'year':[1999,2001,2003],
'month':[10,9,10],
'code':[10,4,5],
'date2':[None, date_today, None],
'keyed_value': [0,0,0]})
This is how i get a value:
df['constructed'] = "key" + df['month'].astype('str')
def getKeyValue(lk, k):
return lk.loc[k, 'value']
print(getKeyValue(lk, df['constructed']))
Here are my issues:
1) I don't want to use iteration or apply methods. My actual data is over 2 million rows and 200 columns. It was really slow (over 2 minutes) with apply. So i opted for an inner join and hence the need to created a new 'constructed' column. After the join i drop the 'constructed' column. The join has helped by bringing execution down to 48 seconds. But there has to be faster way (i am hoping).
2) How do i vectorize this? I don't know how to even approach it. Is it even possible? I tried this but just got an error:
df['keyed_values'] = getKeyValue(lk, df['constructed'])
Any help or pointers is much appreciated.
I am saving a partitioned parquet file on a S3 bucket using Dask as such :
dd.to_parquet(
dd.from_pandas(df, npartitions=1),
path='s3a://test/parquet',
engine='fastparquet',
partition_on='country',
object_encoding='utf8',
compression="gzip",
write_index=False,
)
Parquet files are successfuly created ; here is the directory structure :
directory structure
I am successfuly creating an Impala table from this parquet :
create external table tmp.countries_france
like parquet 's3a://test/parquet/_metadata'
partitioned by (country string)
stored as parquet location 's3a://test/parquet/'
As well as adding a partition to this table :
alter table tmp.countries_france add partition (sheet='belgium')
However when I do a select * from tmp.countries_france I get the following error :
File 's3a://test/parquet/sheet=france/part.0.parquet' is corrupt: metadata indicates a zero row count but there is at least one non-empty row group.
I guess the problem comes from Dask because when I create a non-partitioned parquet this works fine. I've tried setting write_index=True but no luck.
I am not seeing this
df = pd.DataFrame({'a': np.random.choice(['a', 'b', 'c'], size=1000),
'b': np.random.randint(0, 64000, size=1000),
'c': np.random.choice([True, False], size=1000)})
writer.write(tempdir, df, partition_on=['a', 'c'], file_scheme=scheme)
df = dd.from_pandas(df, npartitions=1)
df.to_parquet('.', partition_on=['a', 'c'], engine='fastparquet')
pf = fastparquet.ParquetFile('_metadata')
pf.count # 1000
len(pf.to_pandas()) # 1000
pf.row_groups[0].num_rows # 171
pf = fastparquet.ParquetFile('a=a/c=False/part.0.parquet')
pf.count # 171
pf.row_groups[0].num_rows # 171
Obviously, I cannot speak for what impala might be doing - but perhaps the "like" mechanism is expecting to find the data in the _metadata file?
Note that pandas can write to/from parquet without dask, with the same options.
I am trying to load data from a cassandra database into a Dask dataframe. I have tried querying the following with no success:
query="""SELECT * FROM document_table"""
df = man.session.execute(query)
df = dd.DataFrame(list(df))
TypeError Traceback (most recent call last)
<ipython-input-135-021507f6f2ab> in <module>()
----> 1 a = dd.DataFrame(list(df))
TypeError: __init__() missing 3 required positional arguments: 'name', 'meta', and 'divisions'
Does anybody know an easy way to load data directly from Cassandra into Dask? It is too much memory too load into pandas first.
Some problems with your code:
the line df = presumably loads the whole data-set into memory. Dask is not invoked here, it plays no part in this. Someone with knowledge of the Cassandra driver can confirm this.
list(df) produces a list of the column names of a dataframe and drops all the data
dd.DataFrame, if you read the docs is not constructed like this.
What you probably want to do is a) make a function that returns one partition of the data, b) delay this function and call with the various values of the partitions c) use dd.from_delayed to make the dask dataframe. E.g., assuming the table has a field partfield which handily has possible values 1..6 and similar number of rows for each partition:
#dask.delayed
def part(x):
session = # construct Cassandra session
q = "SELECT * FROM document_table WHERE partfield={}".format(x)
df = man.session.execute(query)
return dd.DataFrame(list(df))
parts = [part(x) for x in range(1, 7)]
df = dd.from_delayed(parts)
I am trying collapse rows of a dataframe based on a key. My file is big and pandas throws a memory error. I am currently trying to use dask. I am attaching the snippet of the code here.
def f(x):
p = x.groupby(id).agg(''.join).reset_index()
return p
metadf = pd.DataFrame(columns=['c1','p1','pd1','d1'])
df = df.groupby(idname).apply(f, meta=metadf).reset_index().compute()
p has the same structure as metadf. The shape of both the dataframes are same.
When I execute this, I get the following error:
"ValueError: Length mismatch: Expected axis has 6 elements, new values have 5 elements"
What am I missing here? Is there any other way to collapse rows based on a key in dask?
The task in hand, to do the following sample in a dask dataframe
Input csv file :
key,c1,c2,c3......,cn
1,car,phone,cat,.....,kite
2,abc,def,hij,.......,pot
1,yes,no,is,.........,hello
2,hello,yes,no,......,help
Output csv file:
key,c1,c2,c3,.......,cn
1,caryes,phoneno,catis,.....,kitehello
2,abchello,defyes,hijno,....,pothelp
In this case meta= corresponds to the output of df.groupby(...).apply(f) and not just to the output of f. Perhaps these differ in some subtle way?
I would address this by first not providing meta= at all. Dask.dataframe will give you a warning asking you to be explicit but things should hopefully progress anyway if it is able to determine the right dtypes and columns by running some sample data through your function.