Dask groupby and apply : Value error Expected axis has 6 elements, new values have 5 elements - dask

I am trying collapse rows of a dataframe based on a key. My file is big and pandas throws a memory error. I am currently trying to use dask. I am attaching the snippet of the code here.
def f(x):
p = x.groupby(id).agg(''.join).reset_index()
return p
metadf = pd.DataFrame(columns=['c1','p1','pd1','d1'])
df = df.groupby(idname).apply(f, meta=metadf).reset_index().compute()
p has the same structure as metadf. The shape of both the dataframes are same.
When I execute this, I get the following error:
"ValueError: Length mismatch: Expected axis has 6 elements, new values have 5 elements"
What am I missing here? Is there any other way to collapse rows based on a key in dask?
The task in hand, to do the following sample in a dask dataframe
Input csv file :
key,c1,c2,c3......,cn
1,car,phone,cat,.....,kite
2,abc,def,hij,.......,pot
1,yes,no,is,.........,hello
2,hello,yes,no,......,help
Output csv file:
key,c1,c2,c3,.......,cn
1,caryes,phoneno,catis,.....,kitehello
2,abchello,defyes,hijno,....,pothelp

In this case meta= corresponds to the output of df.groupby(...).apply(f) and not just to the output of f. Perhaps these differ in some subtle way?
I would address this by first not providing meta= at all. Dask.dataframe will give you a warning asking you to be explicit but things should hopefully progress anyway if it is able to determine the right dtypes and columns by running some sample data through your function.

Related

How to add missing values to a string column using SimpleImputer()?

I am having a dataset which I read into a pandas dataframe.
Most of them are string columns.
Column structure of my dataframe:
['id', 'currently working', column3, column4, ....]
The column that has missing data is 'currently working'. The column contains only two values -> YES, NO and there are null values as well.
I applied the SimpleImputer() in one of my previous learning and that is on an integer column which contain salaries, where I give strategy as mean to preprocess the dataset and replace nulls like below.
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
But in my current scenario, the column is of String type which I certainly can't apply any numeric function methods.
Could anyone let me know how can I preprocess the existing data and replace nulls in a string column of a pandas dataframe ?
What is the preprocessing method that should I follow when working on String columns ?
You can use most_frequent strategy. SimpleImputer will replace missing using the most frequent value. It may also be useful to use add_indicatorbool=True. In this case, the output of the imputer’s transform will stack an additional column with the value from the MissingIndicator. So, your model will have a clue that the value was missing before.
Code example:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan,
strategy='most_frequent',
add_indicator=True)

Python vectorizing a dataframe lookup table

I have two dataframes. One is a lookup table consisting of key/value pairs. The other is my main dataframe. The main dataframe has many more records than the lookup table. I need to construct a 'key' from existing columns in my main dataframe and then lookup a value matching that key in my lookup table. Here they are:
lk = pd.DataFrame( { 'key': ['key10', 'key9'],'value': [100, 90]})
lk.set_index('key', inplace=True)
date_today = datetime.now()
df = pd.DataFrame({ 'date1':[date_today, date_today,date_today],
'year':[1999,2001,2003],
'month':[10,9,10],
'code':[10,4,5],
'date2':[None, date_today, None],
'keyed_value': [0,0,0]})
This is how i get a value:
df['constructed'] = "key" + df['month'].astype('str')
def getKeyValue(lk, k):
return lk.loc[k, 'value']
print(getKeyValue(lk, df['constructed']))
Here are my issues:
1) I don't want to use iteration or apply methods. My actual data is over 2 million rows and 200 columns. It was really slow (over 2 minutes) with apply. So i opted for an inner join and hence the need to created a new 'constructed' column. After the join i drop the 'constructed' column. The join has helped by bringing execution down to 48 seconds. But there has to be faster way (i am hoping).
2) How do i vectorize this? I don't know how to even approach it. Is it even possible? I tried this but just got an error:
df['keyed_values'] = getKeyValue(lk, df['constructed'])
Any help or pointers is much appreciated.

disable errors while reading csv file

Does dask dataframe pass the error bad lines parameter to pandas DataFrame class?
In other words, this does not seem to work because I get an error when I try to run groupby query.
df = dd.read_csv('s3://todel162xx/some.csv' , error_bad_lines=False, storage_options = {'anon':False})
There are only 1 or 2 lines in the csv file that may have different datatypes.
Yes, dask.dataframe.read_csv passes through the error_bad_lines keyword argument

Loading Cassandra Data into Dask Dataframe

I am trying to load data from a cassandra database into a Dask dataframe. I have tried querying the following with no success:
query="""SELECT * FROM document_table"""
df = man.session.execute(query)
df = dd.DataFrame(list(df))
TypeError Traceback (most recent call last)
<ipython-input-135-021507f6f2ab> in <module>()
----> 1 a = dd.DataFrame(list(df))
TypeError: __init__() missing 3 required positional arguments: 'name', 'meta', and 'divisions'
Does anybody know an easy way to load data directly from Cassandra into Dask? It is too much memory too load into pandas first.
Some problems with your code:
the line df = presumably loads the whole data-set into memory. Dask is not invoked here, it plays no part in this. Someone with knowledge of the Cassandra driver can confirm this.
list(df) produces a list of the column names of a dataframe and drops all the data
dd.DataFrame, if you read the docs is not constructed like this.
What you probably want to do is a) make a function that returns one partition of the data, b) delay this function and call with the various values of the partitions c) use dd.from_delayed to make the dask dataframe. E.g., assuming the table has a field partfield which handily has possible values 1..6 and similar number of rows for each partition:
#dask.delayed
def part(x):
session = # construct Cassandra session
q = "SELECT * FROM document_table WHERE partfield={}".format(x)
df = man.session.execute(query)
return dd.DataFrame(list(df))
parts = [part(x) for x in range(1, 7)]
df = dd.from_delayed(parts)

dask dataframe to_parquet throws error

I am trying to save task dataframe to parquet on HDFS. However it fails with error :Exception: TypeError('expected list of bytes',)
I am also providing object_encoding argument as {"anomaly":"json","sensor_name":"json"}.
Here is the columns in dataframe: Index(['original_value', 'anomaly', 'anomaly_bin', 'sensor_name'], dtype='object')
Columns sensor_name and anomaly are string. Other columns are float.
eg: [18.0 'N' 0.0 'settemp']
I also tried to save it as CSV in HDFS but the api failed with error: Exception: ValueError('url type not understood:
Path to CSV as: hdfs://ip:port/some path
It will be great if some one can guide me in right direction.

Resources