SVD on huge dataset with dask and xarray - dask

I use xarray and dask to open multiple netcdf4 files that all together are around 200Gb via
import xarray as xr
ds = xr.open_mfdataset('/path/files*.nc', parallel=True)
The dimensions of this dataset "ds" are (longitude, latitude, height, time).
The files are automatically concatenated along time, which is okay.
Now I would like to apply the "svd_compressed" function from the dask library.
I would like to reshape the longitude, latitude, and height dimension into one dimension, such that I have a 2-d matrix on which I can apply the svd.
I tried using the
dask.array.reshape
function, but I get "'Dataset' object has no attribute 'shape'".
I can convert the xarray dataset to an array and use stack, which makes it 2-d, but If I then use
Dataset.to_dask_dataframe
to convert my xarray to a dask dataframe, my memory runs out.
Somebody has an Idea how I can tackle this problem?
Should I chunk my data differently for the "to_dask_dataframe" function?
Or can I use somehow the "dask svd_compressed" function on the loaded netcdf4 dataset without a reshape?
Thanks for the help.
Edit:
Here a code example that is not working. I have donwloaded Data from the ERA5 (https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-pressure-levels?tab=overview), which I load from disk.
After that I take the temperature data and stack the longitude, latitude, and level values in one dimension to have a time-space 2d-array.
Then I would like to apply an SVD on the data.
from dask.distributed import Client, progress
import xarray as xr
import dask
import dask.array as da
client = Client(processes=False, threads_per_worker=4,
n_workers=1, memory_limit='9GB')
ds = xr.open_mfdataset('/home/user/Arbeit/ERA5/Data/era_5_m*.nc', parallel=True)
ds = ds['t']
ds = ds.stack(z=("longitude", "latitude", "level"))
u, s, v = da.linalg.svd_compressed(ds, k=5, compute=True)
I get an error "dot only operates on DataArrays."
I assume its because I need to convert it to a dask array, so I do.
da = ds.to_dask_dataframe()
which gives me "DataArray' object has no attribute 'to_dask_dataframe".
So I try
ds = ds.to_dataset(name="temperature")
da = ds.to_dask_dataframe()
which results in "Unable to allocate 89.4 GiB for an array with shape".
I guess I need to rechunk it?

Related

Find the daily and monthly mean from daily data

I am very new to python so please bare with me.
So far I have a loop that identifies my netcdf files within a date range.
I now need to calculate the daily averages and then the monthly averages for each month and add it to my dataframe so I can plot a time series.
Heres my code so far
# !/usr/bin/python
# MODIS LONDON TIMESERIES
print ('Initiating AQUA MODIS Time Series')
import pandas as pd
import xarray as xr
from glob import glob
import netCDF4
import numpy as np
import matplotlib.pyplot as plt
import os
print ('All packages imported')
#list of files
days = pd.date_range (start='4/7/2002', end='31/12/2002')
#create dataframe
df = final_data = pd.DataFrame(index = days, columns = ['day', 'night'])
print ('Data frame created')
#for loop iterating through %daterange stated in 'days' to find path using string
for day in days:
path = "%i/%02d/%02d/ESACCI-LST-L3C-LST-MODISA-LONDON_0.01deg_1DAILY_DAY-%i%02d%02d000000-fv2.00.nc" % (day.year, day.month, day.day, day.year, day.month, day.day)
print(path)
Welcome to SO! As suggested, please try to make a minimal reproducible example.
If you are able to create an Xarray dataset, here is how to take monthly avearges
import xarray as xr
# tutorial dataset with air temperature every 6 hours
ds = xr.tutorial.open_dataset('air_temperature')
# reasamples along time dimension
ds_monthly = ds.resample(time='1MS').mean()
resample() is used for upscaling and downscaling the temporal resolution. If you are familiar with Pandas, it effectively works the same way.
What resample(time='1MS') means is group along the time and 1MS is the frequency. 1MS means sample by 1 month (this is the 1M part) and have the new time vector begin at the start of the month (this is the S part). This is very powerful, you can supply different frequencies, see the Pandas offset documentation
.mean() takes the average of the data over our desired frequency. In this case, each month.
You could replace mean() with min(), max(), median(), std(), var(), sum(), and maybe a few others.
Xarray has wonderful documentation, the resample() doc is here

Can I visualize the output values of my linear regression model, If I have got 3 predictor variables and 1 target variable?

I am trying to understand whether I can Visualize a 4-dimensional graph by breaking it down into smaller dimensions.
For example when we have a 2-d plane as a prediction for a 3-d graph, We can just chose a 2-d graph that shows our prediction as a line. Can I do the same for a 4-d graph? If yes then how?
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model
data = pd.read_csv('housing.csv')
data = data[:50] #taking just 50 rows from the excel file
model = linear_model.LinearRegression() #loading the model from the library
model.fit(data[['median_income','total_rooms','households']],data.median_house_value)
# Pls add code here for visualizations
Actually you can do one funny thing - since your object is a function from R^3->R, you could, in principle, take your input space as a 3d cube (I am guessing your data is somewhat bounded), and then use colour to code your prediction. This way you will get a 3d coloured point cloud. You will probably need transparency to see through it + some interactive investigation to rotate/move around, but 4d is the highest "visualisable" dimension (as long as one dimension is "special" and thus can be coded as a colour).

Best way to pick numerous slices from a Dask array

I'm generating a large (65k x 65k x 3) 3D signal distributed among several nodes using Dask arrays.
In the next step, I need to extract a few thousands tiles from this array using slices stored in a Dask bag. My code looks like this:
import dask.array as da
import dask.bag as db
from dask.distributed import Client
def pick_tile(window, signal):
return np.array(surface[window])
def computation_on_tile(signal_tile):
# do some rather short computation on a (n x n x 3) signal tile.
dask_client = Client(....)
signal_array = generate_signal(...) # returns a dask array
signal_slices = db.from_sequence(generate_slices(...)) # fixed size slices
signal_tiles = signal_slices.map(pick_tile, signal=signal_array)
result = dask_client.compute(signal_tile.map(computation_on_tile), sync=True)
My issue is that the computation takes a lot of time. I tried to scatter my signal array using:
signal_array = dask_client.scatter(generate_signal(...))
But it doesn't help performance (~12 min. to compute). In comparison, the computation of the full signal and the stdev of the first layer takes approximately 2 minutes.
Is there an efficient way to pick a lot of slices from a distributed Dask array ?
If you have only a few thousand slices then I recommend using a normal Python list rather than Dask Bag. It will likely be much faster and much simpler.
Then you can slice your array many times:
tiles = [dask_array[slc] for slc in slices]
And compute these if you want
tiles = dask.compute(*tiles)

error while passing data-frame through k-means

Although my data-frame as all the float values everywhere. While passing the data frame through k-means it shows that couldn't convert the string to float.
How to convert nan values if any to float values in the entire data-frame?
This would do your job and convert all the columns in string format to categorical codes or use one hot encoding of the variables in these columns.
import numpy as np
from sklearn.cluster import KMeans
import pandas
df = pandas.read_csv('zipIncome.csv')
print(df)
df[col_name]= df[col_name].astype('category')
df[col_name] = df[col_name].cat.codes
kmeans = KMeans(n_clusters=4,init='k-means++', max_iter=600, algorithm = 'auto').fit(df)
print (kmeans.labels_)
print(kmeans.cluster_centers_)
Based on your code, it would seem that you only instantiated the KMeans but haven't used it.
You'll need input data X that is clean (i.e. no strings etc), let's call it X
kmeans = KMeans(n_clusters=4,init='k-means++', max_iter=600, algorithm = 'auto')
clusters = kmeans.fit_predict(X)
now clusters has the cluster number for each sample in X.
(alternatively, you can do the fit(X) and then later predict(X) separately, but ultimately it is the predict that will output the cluster labels that you will need)
If you want to later get clusters on data, you should use kmeans.predict(new_data) rather than fit_predict() so that KMeans uses the learning from X, and applies it to your new_data (or depending on your needs, you might want to retrain it).
Hope this helps.
Finally, you can add another column to your pandas DataFrame by doing:
df['cluster'] = clusters
where 'cluster' is a string for your new column name, you can of course call it whatever you want

Understanding the process of loading multiple file contents into Dask Array and how it scales

Using the example on http://dask.pydata.org/en/latest/array-creation.html
filenames = sorted(glob('2015-*-*.hdf5')
dsets = [h5py.File(fn)['/data'] for fn in filenames]
arrays = [da.from_array(dset, chunks=(1000, 1000)) for dset in dsets]
x = da.concatenate(arrays, axis=0) # Concatenate arrays along first axis
I'm having trouble understanding the next line and whether its a dask_array of "dask arrays" or a "normal" np array which points to as many dask arrays as there were datasets in all the hdf5 files that gets returned.
Is there any increase in performance (thread or memory based) during the file read stage as a result of the da.from_array or is only when you concatenate into the dask array x where you should expect improvements
The objects in the arrays list are all dask arrays, one for each file.
The x object is also a dask array that combines all of the results of the dask arrays in the arrays list. It isn't a dask.array of dask arrays, it's just a single flattened dask array with an a larger first dimension.
There will probably not be an increase in performance for reading data. You're likely to be I/O bound by your disk bandwidth. Most people in this situation are using dask.array because they have more data than can conveniently fit into RAM. If this isn't valuable to you then I would stick with NumPy.

Resources