disable errors while reading csv file - dask

Does dask dataframe pass the error bad lines parameter to pandas DataFrame class?
In other words, this does not seem to work because I get an error when I try to run groupby query.
df = dd.read_csv('s3://todel162xx/some.csv' , error_bad_lines=False, storage_options = {'anon':False})
There are only 1 or 2 lines in the csv file that may have different datatypes.

Yes, dask.dataframe.read_csv passes through the error_bad_lines keyword argument

Related

How to add missing values to a string column using SimpleImputer()?

I am having a dataset which I read into a pandas dataframe.
Most of them are string columns.
Column structure of my dataframe:
['id', 'currently working', column3, column4, ....]
The column that has missing data is 'currently working'. The column contains only two values -> YES, NO and there are null values as well.
I applied the SimpleImputer() in one of my previous learning and that is on an integer column which contain salaries, where I give strategy as mean to preprocess the dataset and replace nulls like below.
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
But in my current scenario, the column is of String type which I certainly can't apply any numeric function methods.
Could anyone let me know how can I preprocess the existing data and replace nulls in a string column of a pandas dataframe ?
What is the preprocessing method that should I follow when working on String columns ?
You can use most_frequent strategy. SimpleImputer will replace missing using the most frequent value. It may also be useful to use add_indicatorbool=True. In this case, the output of the imputer’s transform will stack an additional column with the value from the MissingIndicator. So, your model will have a clue that the value was missing before.
Code example:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan,
strategy='most_frequent',
add_indicator=True)

How to upload Polygons from GeoPandas to Snowflake?

I have a geometry column of a geodataframe populated with polygons and I need to upload these to Snowflake.
I have been exporting the geometry column of the geodataframe to file and have tried both CSV and GeoJSON formats, but so far I either always get an error the staging table always winds up empty.
Here's my code:
design_gdf['geometry'].to_csv('polygons.csv', index=False, header=False, sep='|', compression=None)
import sqlalchemy
from sqlalchemy import create_engine
from snowflake.sqlalchemy import URL
engine = create_engine(
URL(<Snowflake Credentials Here>)
)
with engine.connect() as con:
con.execute("PUT file://<path to polygons.csv> #~ AUTO_COMPRESS=FALSE")
Then on Snowflake I run
create or replace table DB.SCHEMA.DESIGN_POLYGONS_STAGING (geometry GEOGRAPHY);
copy into DB.SCHEMA."DESIGN_POLYGONS_STAGING"
from #~/polygons.csv
FILE_FORMAT = (TYPE = CSV FIELD_DELIMITER = '|' SKIP_HEADER = 1 compression = None encoding = 'iso-8859-1');
Generates the following error:
"Number of columns in file (6) does not match that of the corresponding table (1), use file format option error_on_column_count_mismatch=false to ignore this error File '#~/polygons.csv.gz', line 3, character 1 Row 1 starts at line 2, column "DESIGN_POLYGONS_STAGING"[6] If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more information on loading options, please run 'info loading_data' in a SQL client."
Can anyone identify what I'm doing wrong?
Inspired by #Simeon_Pilgrim's comment I went back to Snowflake's documentation. There I found an example of converting a string literal to a GEOGRAPHY.
https://docs.snowflake.com/en/sql-reference/functions/to_geography.html#examples
select to_geography('POINT(-122.35 37.55)');
My polygons looked like strings describing Polygons more than actual GEOGRAPHYs so I decided I needed to be treating them as strings and then calling TO_GEOGRAPHY() on them.
I quickly discovered that they needed to be explicitly enclosed in single quotes and copied into a VARCHAR column in the staging table. This was accomplished by modifying the CSV export code:
import csv
design_gdf['geometry'].to_csv(<path to polygons.csv>,
index=False, header=False, sep='|', compression=None, quoting=csv.QUOTE_ALL, quotechar="'")
The staging table now looks like:
create or replace table DB.SCHEMA."DESIGN_POLYGONS_STAGING" (geometry VARCHAR);
I ran into further problems copying into the staging table related to the presence of a polygons.csv.gz file I must have uploaded in a previous experiment. I deleted this file using:
remove #~/polygons.csv.gz
Finally, converting the staging table to GEOGRAPHY
create or replace table DB.SCHEMA."DESIGN_GEOGRAPHY_STAGING" (geometry GEOGRAPHY);
insert into DB.SCHEMA."DESIGN_GEOGRAPHY"
select to_geography(geometry)
from DB.SCHEMA."DESIGN_POLYGONS_STAGING"
and I wound up with a DESIGN_GEOGRAPHY table with a single column of GEOGRAPHYs in it. Success!!!

Loading Cassandra Data into Dask Dataframe

I am trying to load data from a cassandra database into a Dask dataframe. I have tried querying the following with no success:
query="""SELECT * FROM document_table"""
df = man.session.execute(query)
df = dd.DataFrame(list(df))
TypeError Traceback (most recent call last)
<ipython-input-135-021507f6f2ab> in <module>()
----> 1 a = dd.DataFrame(list(df))
TypeError: __init__() missing 3 required positional arguments: 'name', 'meta', and 'divisions'
Does anybody know an easy way to load data directly from Cassandra into Dask? It is too much memory too load into pandas first.
Some problems with your code:
the line df = presumably loads the whole data-set into memory. Dask is not invoked here, it plays no part in this. Someone with knowledge of the Cassandra driver can confirm this.
list(df) produces a list of the column names of a dataframe and drops all the data
dd.DataFrame, if you read the docs is not constructed like this.
What you probably want to do is a) make a function that returns one partition of the data, b) delay this function and call with the various values of the partitions c) use dd.from_delayed to make the dask dataframe. E.g., assuming the table has a field partfield which handily has possible values 1..6 and similar number of rows for each partition:
#dask.delayed
def part(x):
session = # construct Cassandra session
q = "SELECT * FROM document_table WHERE partfield={}".format(x)
df = man.session.execute(query)
return dd.DataFrame(list(df))
parts = [part(x) for x in range(1, 7)]
df = dd.from_delayed(parts)

Dask groupby and apply : Value error Expected axis has 6 elements, new values have 5 elements

I am trying collapse rows of a dataframe based on a key. My file is big and pandas throws a memory error. I am currently trying to use dask. I am attaching the snippet of the code here.
def f(x):
p = x.groupby(id).agg(''.join).reset_index()
return p
metadf = pd.DataFrame(columns=['c1','p1','pd1','d1'])
df = df.groupby(idname).apply(f, meta=metadf).reset_index().compute()
p has the same structure as metadf. The shape of both the dataframes are same.
When I execute this, I get the following error:
"ValueError: Length mismatch: Expected axis has 6 elements, new values have 5 elements"
What am I missing here? Is there any other way to collapse rows based on a key in dask?
The task in hand, to do the following sample in a dask dataframe
Input csv file :
key,c1,c2,c3......,cn
1,car,phone,cat,.....,kite
2,abc,def,hij,.......,pot
1,yes,no,is,.........,hello
2,hello,yes,no,......,help
Output csv file:
key,c1,c2,c3,.......,cn
1,caryes,phoneno,catis,.....,kitehello
2,abchello,defyes,hijno,....,pothelp
In this case meta= corresponds to the output of df.groupby(...).apply(f) and not just to the output of f. Perhaps these differ in some subtle way?
I would address this by first not providing meta= at all. Dask.dataframe will give you a warning asking you to be explicit but things should hopefully progress anyway if it is able to determine the right dtypes and columns by running some sample data through your function.

dask dataframe to_parquet throws error

I am trying to save task dataframe to parquet on HDFS. However it fails with error :Exception: TypeError('expected list of bytes',)
I am also providing object_encoding argument as {"anomaly":"json","sensor_name":"json"}.
Here is the columns in dataframe: Index(['original_value', 'anomaly', 'anomaly_bin', 'sensor_name'], dtype='object')
Columns sensor_name and anomaly are string. Other columns are float.
eg: [18.0 'N' 0.0 'settemp']
I also tried to save it as CSV in HDFS but the api failed with error: Exception: ValueError('url type not understood:
Path to CSV as: hdfs://ip:port/some path
It will be great if some one can guide me in right direction.

Resources