How to convert time domain data into frequency domain data using python - time-series

I have a dataframe named dataTime with 4335 rows with 20 secondes between every two rows.
dataTime.shape
Out[630]: (4335,)
enter image description here
I want to plot my data into frequency domain
from scipy.fftpack import fft
import matplotlib.pyplot as plt
dataT = dataTime.values
xf=fft(dataT)
freq = geek.fft.fftfreq(len(dataT))
plt.plot(freq,xf)
enter image description here
I'm not sure if it's the result that should be or not and how to interpret the plot in the frequency domain

Related

SVD on huge dataset with dask and xarray

I use xarray and dask to open multiple netcdf4 files that all together are around 200Gb via
import xarray as xr
ds = xr.open_mfdataset('/path/files*.nc', parallel=True)
The dimensions of this dataset "ds" are (longitude, latitude, height, time).
The files are automatically concatenated along time, which is okay.
Now I would like to apply the "svd_compressed" function from the dask library.
I would like to reshape the longitude, latitude, and height dimension into one dimension, such that I have a 2-d matrix on which I can apply the svd.
I tried using the
dask.array.reshape
function, but I get "'Dataset' object has no attribute 'shape'".
I can convert the xarray dataset to an array and use stack, which makes it 2-d, but If I then use
Dataset.to_dask_dataframe
to convert my xarray to a dask dataframe, my memory runs out.
Somebody has an Idea how I can tackle this problem?
Should I chunk my data differently for the "to_dask_dataframe" function?
Or can I use somehow the "dask svd_compressed" function on the loaded netcdf4 dataset without a reshape?
Thanks for the help.
Edit:
Here a code example that is not working. I have donwloaded Data from the ERA5 (https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-pressure-levels?tab=overview), which I load from disk.
After that I take the temperature data and stack the longitude, latitude, and level values in one dimension to have a time-space 2d-array.
Then I would like to apply an SVD on the data.
from dask.distributed import Client, progress
import xarray as xr
import dask
import dask.array as da
client = Client(processes=False, threads_per_worker=4,
n_workers=1, memory_limit='9GB')
ds = xr.open_mfdataset('/home/user/Arbeit/ERA5/Data/era_5_m*.nc', parallel=True)
ds = ds['t']
ds = ds.stack(z=("longitude", "latitude", "level"))
u, s, v = da.linalg.svd_compressed(ds, k=5, compute=True)
I get an error "dot only operates on DataArrays."
I assume its because I need to convert it to a dask array, so I do.
da = ds.to_dask_dataframe()
which gives me "DataArray' object has no attribute 'to_dask_dataframe".
So I try
ds = ds.to_dataset(name="temperature")
da = ds.to_dask_dataframe()
which results in "Unable to allocate 89.4 GiB for an array with shape".
I guess I need to rechunk it?

Find the daily and monthly mean from daily data

I am very new to python so please bare with me.
So far I have a loop that identifies my netcdf files within a date range.
I now need to calculate the daily averages and then the monthly averages for each month and add it to my dataframe so I can plot a time series.
Heres my code so far
# !/usr/bin/python
# MODIS LONDON TIMESERIES
print ('Initiating AQUA MODIS Time Series')
import pandas as pd
import xarray as xr
from glob import glob
import netCDF4
import numpy as np
import matplotlib.pyplot as plt
import os
print ('All packages imported')
#list of files
days = pd.date_range (start='4/7/2002', end='31/12/2002')
#create dataframe
df = final_data = pd.DataFrame(index = days, columns = ['day', 'night'])
print ('Data frame created')
#for loop iterating through %daterange stated in 'days' to find path using string
for day in days:
path = "%i/%02d/%02d/ESACCI-LST-L3C-LST-MODISA-LONDON_0.01deg_1DAILY_DAY-%i%02d%02d000000-fv2.00.nc" % (day.year, day.month, day.day, day.year, day.month, day.day)
print(path)
Welcome to SO! As suggested, please try to make a minimal reproducible example.
If you are able to create an Xarray dataset, here is how to take monthly avearges
import xarray as xr
# tutorial dataset with air temperature every 6 hours
ds = xr.tutorial.open_dataset('air_temperature')
# reasamples along time dimension
ds_monthly = ds.resample(time='1MS').mean()
resample() is used for upscaling and downscaling the temporal resolution. If you are familiar with Pandas, it effectively works the same way.
What resample(time='1MS') means is group along the time and 1MS is the frequency. 1MS means sample by 1 month (this is the 1M part) and have the new time vector begin at the start of the month (this is the S part). This is very powerful, you can supply different frequencies, see the Pandas offset documentation
.mean() takes the average of the data over our desired frequency. In this case, each month.
You could replace mean() with min(), max(), median(), std(), var(), sum(), and maybe a few others.
Xarray has wonderful documentation, the resample() doc is here

Can I visualize the output values of my linear regression model, If I have got 3 predictor variables and 1 target variable?

I am trying to understand whether I can Visualize a 4-dimensional graph by breaking it down into smaller dimensions.
For example when we have a 2-d plane as a prediction for a 3-d graph, We can just chose a 2-d graph that shows our prediction as a line. Can I do the same for a 4-d graph? If yes then how?
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model
data = pd.read_csv('housing.csv')
data = data[:50] #taking just 50 rows from the excel file
model = linear_model.LinearRegression() #loading the model from the library
model.fit(data[['median_income','total_rooms','households']],data.median_house_value)
# Pls add code here for visualizations
Actually you can do one funny thing - since your object is a function from R^3->R, you could, in principle, take your input space as a 3d cube (I am guessing your data is somewhat bounded), and then use colour to code your prediction. This way you will get a 3d coloured point cloud. You will probably need transparency to see through it + some interactive investigation to rotate/move around, but 4d is the highest "visualisable" dimension (as long as one dimension is "special" and thus can be coded as a colour).

error while passing data-frame through k-means

Although my data-frame as all the float values everywhere. While passing the data frame through k-means it shows that couldn't convert the string to float.
How to convert nan values if any to float values in the entire data-frame?
This would do your job and convert all the columns in string format to categorical codes or use one hot encoding of the variables in these columns.
import numpy as np
from sklearn.cluster import KMeans
import pandas
df = pandas.read_csv('zipIncome.csv')
print(df)
df[col_name]= df[col_name].astype('category')
df[col_name] = df[col_name].cat.codes
kmeans = KMeans(n_clusters=4,init='k-means++', max_iter=600, algorithm = 'auto').fit(df)
print (kmeans.labels_)
print(kmeans.cluster_centers_)
Based on your code, it would seem that you only instantiated the KMeans but haven't used it.
You'll need input data X that is clean (i.e. no strings etc), let's call it X
kmeans = KMeans(n_clusters=4,init='k-means++', max_iter=600, algorithm = 'auto')
clusters = kmeans.fit_predict(X)
now clusters has the cluster number for each sample in X.
(alternatively, you can do the fit(X) and then later predict(X) separately, but ultimately it is the predict that will output the cluster labels that you will need)
If you want to later get clusters on data, you should use kmeans.predict(new_data) rather than fit_predict() so that KMeans uses the learning from X, and applies it to your new_data (or depending on your needs, you might want to retrain it).
Hope this helps.
Finally, you can add another column to your pandas DataFrame by doing:
df['cluster'] = clusters
where 'cluster' is a string for your new column name, you can of course call it whatever you want

how create char dataset like minst_digits dataset

I have 62000 font images( 0-9,A-Z and a-z images) data set in which for single character have 1000 image.I have created csv file of 62000 row of images normalized pixel value and labels. I don't know to extract this csv file in training,validation and testing dataset so that i can get better accuracy.
enter image description here
You can use SciKit-Learn's train_test_split.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
X, y = your.data, your.target #input your own data here
train, test = train_test_split(X, test_size = 0.2, random_state=0)
Also, read this sklearn tutorial

Resources