python pandas read_csv() from google spreadsheet url - url

I want to load the data of this link (a Google Spreadsheet data), in my Jupyter notebook, using python.
I tried different methods and pandas.read_csv() seems to be the easiest. But, I cannot load the data in a proper format. Here is the code that I am using:
import pandas as pd
url = 'https://docs.google.com/spreadsheets/d/1itaohdPiAeniCXNlntNztZ_oRvjh0HsGuJXUJWET008/edit#gid=0'
df = pd.read_csv(url, error_bad_lines=False)
df
The output does not look like the Spreadsheet:
Probably because of data type, that I don't know how to fix. I have tried different ways from other posts, but it didn't help. Here is one of them:
import pandas as pd
import requests
import io
url = requests.get('https://docs.google.com/spreadsheets/d/1itaohdPiAeniCXNlntNztZ_oRvjh0HsGuJXUJWET008/edit#gid=0').text
buffer = io.StringIO(url)
columns = ['ID','age', 'sex','city', 'province', 'country',
'wuhan(0)_not_wuhan(1)', 'latitude', 'longitude',
'geo_resolution', 'date_onset_symptoms', 'date_admission_hospital',
'date_confirmation', 'symptoms', 'lives_in_Wuhan', 'travel_history_dates',
'travel_history_location', 'reported_market_exposure', 'additional_information']
df = pd.read_csv(filepath_or_buffer=buffer, header=1, usecols=columns)
df

Related

Python Dask Apply Function and STore Result in Same Column

Hello i am bit new on Dask and i am trying to do the following things
i have a CSV file I am reading file everything works fine
import pandas
import os
import json
import math
import numpy as np
import dask
from dask.distributed import Client
import dask.dataframe as df
import dask.multiprocessing
client = Client(n_workers=3, threads_per_worker=4, processes=False, memory_limit='2GB')
df = df.read_csv("netflix_titles.csv")
now i have function
def toupper(x):
return x.upper()
i would like to apply this to a column now the issue is want to save the result in same column seems like i cannot do that
df["title"].map(toupper).compute()
The following line works but i want
df["title"] = df["title"].map(toupper).compute()
ValueError: Not all divisions are known, can't align partitions. Please use set_index to set the index.
Image
Maybe try this after read_csv.
df.title = df.title.map(toupper)
df.to_csv("netflix_titles.csv", index=False, single_file=True)
to_csv has a optional argument with default valuecompute=True so you don't need to explicit do compute().

dask.ml.xgboost raises UnboundLocalError: local variable 'result' referenced before assignment

I am using dask_xgboost and I don't understand the error stated in the subject. I have successfully trained a model and saved it with joblib.dump.
Later on, during the prediction step I use it like this:
import dask
import dask.dataframe as dd
import dask.distributed as ddst
from dask_jobqueue import PBSCluster
from dask.distributed import Client
import dask_xgboost as dxgb
import geopandas as gp
from sklearn.externals import joblib
def predict(zs_files: List[str], model_name: str, client) -> None:
delayed_dfs = [dask.delayed(gp.read_file)(zsf) for zsf in zs_files]
model = joblib.load(model_name)
delayed_predictions = [
dxgb.predict(client, model, df).to_parquet(f"{fn}_predicted.parquet")
for df, fn in zip(delayed_dfs, zs_files)
]
delayed_predictions.compute()
I read a set of GeoJSON files with geopandas and then just feed the model with them. I am using a client on a PBS cluster.
Any help would be appreciated.
Thanks.
I found the issue. I wass missing a from_delayed call to transform the geopandas dataframe to a dask one:
dxgb.predict(client, model, dd.from_delayed(df))

convert dask.bag of dictionaries to dask.dataframe using dask.delayed and pandas.DataFrame

I am struggling to convert a dask.bag of dictionaries into dask.delayed pandas.DataFrames into a final dask.dataframe
I have one function (make_dict) that reads files into a rather complex nested dictionary structure and another function (make_df) to turn these dictionaries into a pandas.DataFrame (resulting dataframe is around 100 mb for each file). I would like to append all dataframes into a single dask.dataframe for further analysis.
Up to now I was using dask.delayed objects to load, convert and append all data which works fine (see example below). However for future work I would like to store the loaded dictionaries in a dask.bag using dask.persist().
I managed to load the data into dask.bag, resulting in a list of dicts or list of pandas.DataFrame that I can use locally after calling compute(). When I tried turning the dask.bag into a dask.dataframe using to_delayed() however, I got stuck with an error (see below).
It feels like I am missing something rather simple here or maybe my approach to dask.bag is wrong?
The below example shows my approach using simplified functions and throws the same error. Any advice on how to tackle this is appreciated.
import numpy as np
import pandas as pd
import dask
import dask.dataframe
import dask.bag
print(dask.__version__) # 1.1.4
print(pd.__version__) # 0.24.2
def make_dict(n=1):
return {"name":"dictionary","data":{'A':np.arange(n),'B':np.arange(n)}}
def make_df(d):
return pd.DataFrame(d['data'])
k = [1,2,3]
# using dask.delayed
dfs = []
for n in k:
delayed_1 = dask.delayed(make_dict)(n)
delayed_2 = dask.delayed(make_df)(delayed_1)
dfs.append(delayed_2)
ddf1 = dask.dataframe.from_delayed(dfs).compute() # this works as expected
# using dask.bag and turning bag of dicts into bag of DataFrames
b1 = dask.bag.from_sequence(k).map(make_dict)
b2 = b1.map(make_df)
df = pd.DataFrame().append(b2.compute()) # <- I would like to do this using delayed dask.DataFrames like above
ddf2 = dask.dataframe.from_delayed(b2.to_delayed()).compute() # <- this fails
# error:
# ValueError: Expected iterable of tuples of (name, dtype), got [ A B
# 0 0 0]
what I ultimately would like to do using the distributed scheduler:
b = dask.bag.from_sequence(k).map(make_dict)
b = b.persist()
ddf = dask.dataframe.from_delayed(b.map(make_df).to_delayed())
In the bag case the delayed objects point to lists of elements, so you have a list of lists of pandas dataframes, which is not quite what you want. Two recommendations
Just stick with dask.delayed. It seems to work well for you
Use the Bag.to_dataframe method, which expects a bag of dicts, and does the dataframe conversion itself

xarray merge slow seems not working to conbine different lat lon netcdf

I have this code below, use xarray to merge a bunch of different variables and different lat.lon together of netcdfs. it ruuning forever and never ends but no error messages.
import xarray as xr
import glob
import numpy as np
datasets_WFHS= (xr.open_mfdataset(fname) for fname in
glob.glob(r'Z:\travelers\shp\test\*WFHS.nc'))
datasets_WFHS=xr.align(*datasets_WFHS, join='outer')
dataset_WFHS =np.maximum.reduce([d['AREASCORE']
for d in datasets_WFHS])
datasets_AF= (xr.open_mfdataset(fname)
for fname in glob.glob(r'Z:\travelers\shp\test\*AF.nc'))
datasets_AF=xr.align(*datasets_AF, join='outer')
dataset_AF=np.maximum.reduce([d['AREAFUEL']
for d in datasets_AF])
datasets = xr.merge([dataset_WFHS, dataset_AF], join='outer')
datasets.to_netcdf(r"Z:\travelers\shp\test\results.nc")

Dask dataframe get second highest value and column name

This code gives me the highest value and column name.
import numpy as np
import pandas as pd
import dask.dataframe as dd
cols=[0,1,2,3,4]
df = pd.DataFrame(np.random.randn(1000, len(cols)), columns=cols)
ddf = dd.from_pandas(df, npartitions=4)
ddf['max_col'] = ddf[cols].idxmax(axis=1)
ddf['max_val'] = ddf[cols].max(axis=1)
I want to get the second higest as well. Something like:
ddf['max2_col'] = ddf[cols].idxmax2(axis=1)
ddf['max2_val'] = ddf[cols].max2(axis=1)
Are there functions like idxmax2 or max2? Or any other optimized way for doing this?
You should normally try to figure out how to do what you want to do with pandas first. If you cannot, and pose that question instead, with the pandas tag, you will get a faster answer.
The following appears to work for pandas, although it may not be elegant
import numpy as np
import pandas as pd
import dask.dataframe as dd
cols=[0,1,2,3,4]
df = pd.DataFrame(np.random.randn(1000, len(cols)), columns=cols)
def make_cols(df):
df['max2_col'] = df[cols].values.argsort(axis=1)[:, -2]
df2 = df[cols].values.copy()
df2.sort(axis=1)
df['max2_val'] = df2[:, -2]
return df
so to apply it to the dask variant, you can do
ddf = dd.from_pandas(df, npartitions=4)
ddf.map_partitions(make_cols)
ddf.head()

Resources