Find the daily and monthly mean from daily data - mean

I am very new to python so please bare with me.
So far I have a loop that identifies my netcdf files within a date range.
I now need to calculate the daily averages and then the monthly averages for each month and add it to my dataframe so I can plot a time series.
Heres my code so far
# !/usr/bin/python
# MODIS LONDON TIMESERIES
print ('Initiating AQUA MODIS Time Series')
import pandas as pd
import xarray as xr
from glob import glob
import netCDF4
import numpy as np
import matplotlib.pyplot as plt
import os
print ('All packages imported')
#list of files
days = pd.date_range (start='4/7/2002', end='31/12/2002')
#create dataframe
df = final_data = pd.DataFrame(index = days, columns = ['day', 'night'])
print ('Data frame created')
#for loop iterating through %daterange stated in 'days' to find path using string
for day in days:
path = "%i/%02d/%02d/ESACCI-LST-L3C-LST-MODISA-LONDON_0.01deg_1DAILY_DAY-%i%02d%02d000000-fv2.00.nc" % (day.year, day.month, day.day, day.year, day.month, day.day)
print(path)

Welcome to SO! As suggested, please try to make a minimal reproducible example.
If you are able to create an Xarray dataset, here is how to take monthly avearges
import xarray as xr
# tutorial dataset with air temperature every 6 hours
ds = xr.tutorial.open_dataset('air_temperature')
# reasamples along time dimension
ds_monthly = ds.resample(time='1MS').mean()
resample() is used for upscaling and downscaling the temporal resolution. If you are familiar with Pandas, it effectively works the same way.
What resample(time='1MS') means is group along the time and 1MS is the frequency. 1MS means sample by 1 month (this is the 1M part) and have the new time vector begin at the start of the month (this is the S part). This is very powerful, you can supply different frequencies, see the Pandas offset documentation
.mean() takes the average of the data over our desired frequency. In this case, each month.
You could replace mean() with min(), max(), median(), std(), var(), sum(), and maybe a few others.
Xarray has wonderful documentation, the resample() doc is here

Related

SVD on huge dataset with dask and xarray

I use xarray and dask to open multiple netcdf4 files that all together are around 200Gb via
import xarray as xr
ds = xr.open_mfdataset('/path/files*.nc', parallel=True)
The dimensions of this dataset "ds" are (longitude, latitude, height, time).
The files are automatically concatenated along time, which is okay.
Now I would like to apply the "svd_compressed" function from the dask library.
I would like to reshape the longitude, latitude, and height dimension into one dimension, such that I have a 2-d matrix on which I can apply the svd.
I tried using the
dask.array.reshape
function, but I get "'Dataset' object has no attribute 'shape'".
I can convert the xarray dataset to an array and use stack, which makes it 2-d, but If I then use
Dataset.to_dask_dataframe
to convert my xarray to a dask dataframe, my memory runs out.
Somebody has an Idea how I can tackle this problem?
Should I chunk my data differently for the "to_dask_dataframe" function?
Or can I use somehow the "dask svd_compressed" function on the loaded netcdf4 dataset without a reshape?
Thanks for the help.
Edit:
Here a code example that is not working. I have donwloaded Data from the ERA5 (https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-pressure-levels?tab=overview), which I load from disk.
After that I take the temperature data and stack the longitude, latitude, and level values in one dimension to have a time-space 2d-array.
Then I would like to apply an SVD on the data.
from dask.distributed import Client, progress
import xarray as xr
import dask
import dask.array as da
client = Client(processes=False, threads_per_worker=4,
n_workers=1, memory_limit='9GB')
ds = xr.open_mfdataset('/home/user/Arbeit/ERA5/Data/era_5_m*.nc', parallel=True)
ds = ds['t']
ds = ds.stack(z=("longitude", "latitude", "level"))
u, s, v = da.linalg.svd_compressed(ds, k=5, compute=True)
I get an error "dot only operates on DataArrays."
I assume its because I need to convert it to a dask array, so I do.
da = ds.to_dask_dataframe()
which gives me "DataArray' object has no attribute 'to_dask_dataframe".
So I try
ds = ds.to_dataset(name="temperature")
da = ds.to_dask_dataframe()
which results in "Unable to allocate 89.4 GiB for an array with shape".
I guess I need to rechunk it?

How to convert time domain data into frequency domain data using python

I have a dataframe named dataTime with 4335 rows with 20 secondes between every two rows.
dataTime.shape
Out[630]: (4335,)
enter image description here
I want to plot my data into frequency domain
from scipy.fftpack import fft
import matplotlib.pyplot as plt
dataT = dataTime.values
xf=fft(dataT)
freq = geek.fft.fftfreq(len(dataT))
plt.plot(freq,xf)
enter image description here
I'm not sure if it's the result that should be or not and how to interpret the plot in the frequency domain

buffering points and overlapping them to calculate areas

Say if I have some drivers' geolocations, they are points with lat/lon, I have two tasks for this data.
I want to calculate each of the drivers' driving areas by buffering those points (for example, with distance = 10) and overlapping them
After that, I also want to obtain the shared areas between each of the drivers based on the buffered areas calculated from the above
I made some data for those tasks:
#------------------------------------------------------------------------------
# import libraries
#------------------------------------------------------------------------------
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
#------------------------------------------------------------------------------
# df to gdf
#------------------------------------------------------------------------------
d = {'driver':['a','a','a','a','a','b','b','b','b','b','c','c','c','c','c'],
'lat':[41,46,39,43,51,43,45,58,49,42,40,42,48,50,46],
'lon':[-78,-73,-66,-75,-80,-78,-70,-76,-68,-80,-72,-60,-62,-74,-72]}
df = pd.DataFrame(data=d)
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.lat, df.lon))
gdf.head()
Specifically, my questions are: 1. how to calculate driving areas for the driver a, b, and c using buffer (for example, distance=10) and overlay? 2. how to calculate the overlapping areas between driver a and b, driver a and c, and driver b and c based on their driving areas from last step? Much appreciated.

dask dataframe: merge two dataframes, impute missing value and write to csv only use partial CPUs (20% in each CPU)

I want to merge two dask dataframes, impute missing values with column median and export the merged dataframe to csv files.
I got one problem: my current code cannot utilize all the 8 CPUs (~20% of each CPU)
I am not sure which part limits the CPU usage. Here is the repeatable code
import numpy as np
import pandas as pd
df1 = pd.DataFrame(
np.c_[(np.random.randint(100, size=(10000, 1)), np.random.randn(10000, 3))],
columns=['id', 'a', 'b', 'c'])
df2 = pd.DataFrame(
np.c_[(np.array(range(100)), np.random.randn(100, 10000))],
columns=['id'] + ['d_' + str(i) for i in range(10000)])
df1.id=df1.id.astype(int).astype(object)
df2.id=df2.id.astype(int).astype(object)
## some cells are missing in df2
df2.iloc[:, 1:] = df2.iloc[:,1:].mask(np.random.random(df2.iloc[:, 1:].shape) < .05)
## dask codes starts here
import dask.dataframe as dd
from dask.distributed import Client
ddf1 = dd.from_pandas(df1, npartitions=3)
ddf2 = dd.from_pandas(df2, npartitions=3)
ddf = ddf1.merge(ddf2, how='left', on='id')
ddf = ddf.fillna(ddf.quantile())
ddf.to_csv('train_*.csv', index=None, header=None)
Although all the 8 CPUs are invoked to use, only ~20% of each CPU is utilized. Can I code to improve the CPU usage?
Firstly, not that if you don't specify otherwise, Dask will use threads for execution. In threads, only one python operation can occur at a time (the "GIL"), except some lower-level code which explicitly releases the lock. The "merge" operation involves a lot of shuffling of data in memory, and I suspect releases the lock some of the time.
Secondly, all of the output is being written to the filesystem, so you will always have a bottleneck here: however fast other processing may be, you still need to feed all of it through the storage bus.
If the CPUs are working ~20%, I daresay this is still faster than a single-core version? Put simply, some workloads just parallelise better than others.

error while passing data-frame through k-means

Although my data-frame as all the float values everywhere. While passing the data frame through k-means it shows that couldn't convert the string to float.
How to convert nan values if any to float values in the entire data-frame?
This would do your job and convert all the columns in string format to categorical codes or use one hot encoding of the variables in these columns.
import numpy as np
from sklearn.cluster import KMeans
import pandas
df = pandas.read_csv('zipIncome.csv')
print(df)
df[col_name]= df[col_name].astype('category')
df[col_name] = df[col_name].cat.codes
kmeans = KMeans(n_clusters=4,init='k-means++', max_iter=600, algorithm = 'auto').fit(df)
print (kmeans.labels_)
print(kmeans.cluster_centers_)
Based on your code, it would seem that you only instantiated the KMeans but haven't used it.
You'll need input data X that is clean (i.e. no strings etc), let's call it X
kmeans = KMeans(n_clusters=4,init='k-means++', max_iter=600, algorithm = 'auto')
clusters = kmeans.fit_predict(X)
now clusters has the cluster number for each sample in X.
(alternatively, you can do the fit(X) and then later predict(X) separately, but ultimately it is the predict that will output the cluster labels that you will need)
If you want to later get clusters on data, you should use kmeans.predict(new_data) rather than fit_predict() so that KMeans uses the learning from X, and applies it to your new_data (or depending on your needs, you might want to retrain it).
Hope this helps.
Finally, you can add another column to your pandas DataFrame by doing:
df['cluster'] = clusters
where 'cluster' is a string for your new column name, you can of course call it whatever you want

Resources