Convertiing Avro to dask dataframe - dask

I am pretty new to Dask and had most of my files in Avro (migrated from PySpark). I tried using intake_avro (sequence), managed to read and does a take(1) showing the right data.
As I have quite a number of columns with 'None' and when I convert to dataframe it always gives me
ValueError: Cannot convert non-finite values (NA or inf) to integer
I am not sure whether I am doing right?

Related

Problems plotting time-series interactively with Altair

Description of the problem
My goal is quite basic: to plot time series in an interactive plot. After some research I decided to give a try to Altair.
There are already QGIS plugins for time-series visualisation, but as far as I'm aware, none for plotting time-series at vector-level, interactively clicking on a map and selecting a Polygon. So that's why I decided to go for a self-made solution using Altair, maybe combining it with Folium to add functionalities later on.
I'm totally new to the Altair library (as well as Vega and Vega-lite), and quite new in datascience and data visualisation as well... so apologies in advance for my ignorance!
There are already well explained tutorials on how to plot time series with Altair (for example here, or in the official website). However, my study case has some particularities that, as far as I've seen, have not yet been approached altogether.
The data is produced using the Python API for Google Earth Engine and preprocessed with Python and the pandas/geopandas libraries:
In Google Earth Engine, a vegetation index (NDVI in the current case) is computed at pixel-level for a certain region of interest (ROI). Then the function image.reduceRegions() is mapped across the ImageCollection to compute the mean of the ndvi in every polygon of a FeatureCollection element, which represent agricultural parcels. The resulting vector file is exported.
Under a Jupyter-lab environment, the data is loaded into a geopandas GeoDataFrame object and preprocessed, transposing the DataFrame and creating a datetime column, among others, in order to have the data well-shaped for time-series representation with Altair.
Data overview after preprocessing:
My "final" goal would be to show, in the same graphic, an interactive line plot with a set of lines representing each one an agricultural parcel, with parcels categorized by crop types in different colours, e.g. corn in green, wheat in yellow, peer trees in brown... (the information containing the crop type of each parcel can be added to the DataFrame making a join with another DataFrame).
I am thinking of something looking more or less like the following example, with legend's years being the parcels coloured by crop types:
But so far I haven't managed to make my data look this way... at all.
As you can see there are many nulls in the data (this is due to the application of a cloud masking function and to the fact that there are several Sentinel-2 orbits intersecting the ROI). I would like to just omit the non-null values for earch column/parcel, but I don't know if this data configuration can pose problems (any advice on that?).
So far I got:
The generation of the preceding graphic, for a single parcel, takes already around 23 seconds. Which is something maybe shoud/cloud be improved (how?)
And more importantly, the expected line representing the item/polygon/parcel's values (NDVI) is not even shown in the plot (note that I chose a parcel containing rather few non-null values).
For sure I am doing many things wrong. Would be great to get some advice to solve (some of) them.
Sample of the data and code to reproduce the issue
Here's a text sample of the data in JSON format, and the code used to reproduce the issue is the following:
import pandas as pd
import geopandas as gpd
import altair as alt
df= pd.read_json(r"path\to\json\file.json")
df['date']= pd.to_datetime(df['date'])
print(gdf.dtypes)
df
Output:
lines=alt.Chart(df).mark_line().encode(
x='date:O',
y='17811:Q',
color=alt.Color(
'17811:Q', scale=alt.Scale(scheme='redyellowgreen', domain=(-1, 1)))
)
lines.properties(width=700, height=600).interactive()
Output:
Thanks in advance for your help!
If I understand correctly, it is mostly the format of your dataframe that needs to be changed from wide to long, which you can do either via .melt in pandas or .transform_fold in Altair. With melt, the default names are 'variable' (the previous columns name) and 'value' (the value for each column) for the melted columns:
alt.Chart(df.melt(id_vars='date'), width=500).mark_line().encode(
x='date:T',
y='value',
color=alt.Color('variable')
)
The gaps comes from the NaNs; if you want Altair to interpolate missing values, you can drop the NaNs:
alt.Chart(df.melt(id_vars='date').dropna(), width=500).mark_line().encode(
x='date:T',
y='value',
color=alt.Color('variable')
)
If you want to do it all in Altair, the following is equivalent to the last pandas example above (the transform uses 'key' instead of 'variable' as the name for the former columns). I also use and ordinal instead of nominal type for the color encoding to show how to make the colors more similar to your example.:
alt.Chart(df, width=500).mark_line().encode(
x='date:T',
y='value:Q',
color=alt.Color('key:O')
).transform_fold(
df.drop(columns='date').columns.tolist()
).transform_filter(
'isValid(datum.value)'
)

Read HDF5 dataset of multiple data types

I have a HDF5 file dataset which contains different data types(int and float).
While reading it in numpy array, it detects it as array of type np.void.
import numpy as np
import h5py
f = h5py.File('Sample.h5', 'r')
array = np.array(f['/Group1/Dataset'])
print(array.dtype)
Image of the data types {print(array.dtype)}
How can I read this dataset into arrays with each column as the same data type as that of input? Thanks in advance for the reply
Here are 2 simple examples showing both ways to slice a subset of the dataset using the HDF5 Field/Column names.
The first method extracts a subset of the data to a record array by slicing when accessing the dataset.
The second method follows your current method. It extracts the entire dataset to a record array, then slices a new view to access a subset of the data.
Print statements are used liberally so you can see what's going on.
Method 1
real_array= np.array(f['/Group1/Dataset'][:,'XR','YR','ZR'])
print(real_array.dtype)
print(real_array.shape)
Method 2
cmplx_array = np.array(f['/Group1/Dataset'])
print(cmplx_array.dtype)
print(cmplx_array.shape)
disp_real = cmplx_array[['XR','YR','ZR']]
print(disp_real.dtype)
print(disp_real.shape)
Review this SO topic for additional insights into copying values from a recarray to a ndarray, and back.
copy-numpy-recarray-to-ndarray

Performance and data manipulation on Dask

I have imported a parquet file of approx. 800MB with ~50 millions rows into dask dataframe.
There are 5 columns: DATE, TICKER, COUNTRY, RETURN, GICS
Questions:
How can I specify data type in read_parquet or I have to do it with astype?
Can I parse date within read_parquet
I simply tried to do the follow:
import dask.dataframe as dd
dd.read_parquet('.\abc.gzip')
df['INDUSTRY'] = df.GICS.str[0:4]
n = df.INDUSTRY.unique().compute()
and it takes forever to return. Am I doing anything wrong here? partitions are automatically set to 1.
I am trying to do something like df[df.INDUSTRY == '4010'].compute(), it also takes forever to return or crash.
To answer your questions:
A parquet file has types stored, as noted in the Apache docs here, thus you won't be able to change the data type when you read the file in, meaning you'll have to use astype.
You can't convert a string to date within a the read, though if you use the map_partitions function, documented here you can convert the column to date, as in this example:
import dask.dataframe as dd
df = dd.read_parquet(your_file)
meta = ('date', 'datetime64[ns]')
# you can add your own date format, or just let pandas guess
to_date_time = lambda x: pd.to_datetime(x, format='%Y-%m-%d')
df['date_clean'] = df.date.map_partitions(to_date_time, meta=meta)
The map_partitions function will convert the dates on each chunk of the parquet when the file is computed, making it functionally the same as converting the date when the file is read in.
Here I think again you would benefit from using the map_partitions function, so you might try something like this
import dask.dataframe as dd
df = dd.read_parquet('.\abc.gzip')
df['INDUSTRY']df.GICS.map_partitions(lambda x: x.str[0:4], meta=('INDUSTRY', 'str'))
df[df.INDUSTRY == '4010']
Note that if you run compute the object is converted to pandas. If the file is too large than Dask won't be able to compute it, and thus nothing will be returned. Without seeing the data it's harder to say more, but do checkout these tools to profile your computations to see if you are leveraging all your CPUs.

GLM Poisson thinks I have negative values in my dataset, throws error

I am trying to do a poisson GLM, and yet I continue to get this error
Poisson1 <- glm(Number.Flowers ~ Site, data = Flowering2, family="poisson")
Error in eval(family$initialize) :negative values not allowed for the 'Poisson' family
My data is count data and so is all positive values and zeros. What could be going on?
Is it possible for my CSV file to contain hidden negative values?
It's possible your CSV might be flawed in some way. Try a different method of importing it into R (fread, read.table, etc). Check for NA or NaN issues. Compare the number of rows.

Value Error on Dask DataFrames

I am using dask to read a csv file. However, i couldn't apply or compute any operation on it because of this error:
Do you have ideas what is this error all about and how to fix it?
On reading csv file in dask, errors comes in upon not recognizing the correct dtype of columns.
For example, we read a csv file using dask as follows:
import dask.dataframe as dd
df = dd.read_csv('\data\file.txt', sep='\t', header='infer')
This prompts the error mentioned above.
To solve this problem, as suggested by #mrocklin on this comment, https://github.com/dask/dask/issues/1166, we need to determine the dtype of the columns. We can do this by reading the csv file in pandas and identify the data type and pass that as argument in reading csv using dask.
df_pd = pd.read_csv('\data\file.txt', sep='\t', header='infer')
dt = df_pd.dtypes.to_dict()
df = dd.read_csv('\data\file.txt', sep='\t', header='infer', dtype=dt)

Resources