buffering points and overlapping them to calculate areas - buffer

Say if I have some drivers' geolocations, they are points with lat/lon, I have two tasks for this data.
I want to calculate each of the drivers' driving areas by buffering those points (for example, with distance = 10) and overlapping them
After that, I also want to obtain the shared areas between each of the drivers based on the buffered areas calculated from the above
I made some data for those tasks:
#------------------------------------------------------------------------------
# import libraries
#------------------------------------------------------------------------------
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
#------------------------------------------------------------------------------
# df to gdf
#------------------------------------------------------------------------------
d = {'driver':['a','a','a','a','a','b','b','b','b','b','c','c','c','c','c'],
'lat':[41,46,39,43,51,43,45,58,49,42,40,42,48,50,46],
'lon':[-78,-73,-66,-75,-80,-78,-70,-76,-68,-80,-72,-60,-62,-74,-72]}
df = pd.DataFrame(data=d)
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.lat, df.lon))
gdf.head()
Specifically, my questions are: 1. how to calculate driving areas for the driver a, b, and c using buffer (for example, distance=10) and overlay? 2. how to calculate the overlapping areas between driver a and b, driver a and c, and driver b and c based on their driving areas from last step? Much appreciated.

Related

SVD on huge dataset with dask and xarray

I use xarray and dask to open multiple netcdf4 files that all together are around 200Gb via
import xarray as xr
ds = xr.open_mfdataset('/path/files*.nc', parallel=True)
The dimensions of this dataset "ds" are (longitude, latitude, height, time).
The files are automatically concatenated along time, which is okay.
Now I would like to apply the "svd_compressed" function from the dask library.
I would like to reshape the longitude, latitude, and height dimension into one dimension, such that I have a 2-d matrix on which I can apply the svd.
I tried using the
dask.array.reshape
function, but I get "'Dataset' object has no attribute 'shape'".
I can convert the xarray dataset to an array and use stack, which makes it 2-d, but If I then use
Dataset.to_dask_dataframe
to convert my xarray to a dask dataframe, my memory runs out.
Somebody has an Idea how I can tackle this problem?
Should I chunk my data differently for the "to_dask_dataframe" function?
Or can I use somehow the "dask svd_compressed" function on the loaded netcdf4 dataset without a reshape?
Thanks for the help.
Edit:
Here a code example that is not working. I have donwloaded Data from the ERA5 (https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-pressure-levels?tab=overview), which I load from disk.
After that I take the temperature data and stack the longitude, latitude, and level values in one dimension to have a time-space 2d-array.
Then I would like to apply an SVD on the data.
from dask.distributed import Client, progress
import xarray as xr
import dask
import dask.array as da
client = Client(processes=False, threads_per_worker=4,
n_workers=1, memory_limit='9GB')
ds = xr.open_mfdataset('/home/user/Arbeit/ERA5/Data/era_5_m*.nc', parallel=True)
ds = ds['t']
ds = ds.stack(z=("longitude", "latitude", "level"))
u, s, v = da.linalg.svd_compressed(ds, k=5, compute=True)
I get an error "dot only operates on DataArrays."
I assume its because I need to convert it to a dask array, so I do.
da = ds.to_dask_dataframe()
which gives me "DataArray' object has no attribute 'to_dask_dataframe".
So I try
ds = ds.to_dataset(name="temperature")
da = ds.to_dask_dataframe()
which results in "Unable to allocate 89.4 GiB for an array with shape".
I guess I need to rechunk it?

Find the daily and monthly mean from daily data

I am very new to python so please bare with me.
So far I have a loop that identifies my netcdf files within a date range.
I now need to calculate the daily averages and then the monthly averages for each month and add it to my dataframe so I can plot a time series.
Heres my code so far
# !/usr/bin/python
# MODIS LONDON TIMESERIES
print ('Initiating AQUA MODIS Time Series')
import pandas as pd
import xarray as xr
from glob import glob
import netCDF4
import numpy as np
import matplotlib.pyplot as plt
import os
print ('All packages imported')
#list of files
days = pd.date_range (start='4/7/2002', end='31/12/2002')
#create dataframe
df = final_data = pd.DataFrame(index = days, columns = ['day', 'night'])
print ('Data frame created')
#for loop iterating through %daterange stated in 'days' to find path using string
for day in days:
path = "%i/%02d/%02d/ESACCI-LST-L3C-LST-MODISA-LONDON_0.01deg_1DAILY_DAY-%i%02d%02d000000-fv2.00.nc" % (day.year, day.month, day.day, day.year, day.month, day.day)
print(path)
Welcome to SO! As suggested, please try to make a minimal reproducible example.
If you are able to create an Xarray dataset, here is how to take monthly avearges
import xarray as xr
# tutorial dataset with air temperature every 6 hours
ds = xr.tutorial.open_dataset('air_temperature')
# reasamples along time dimension
ds_monthly = ds.resample(time='1MS').mean()
resample() is used for upscaling and downscaling the temporal resolution. If you are familiar with Pandas, it effectively works the same way.
What resample(time='1MS') means is group along the time and 1MS is the frequency. 1MS means sample by 1 month (this is the 1M part) and have the new time vector begin at the start of the month (this is the S part). This is very powerful, you can supply different frequencies, see the Pandas offset documentation
.mean() takes the average of the data over our desired frequency. In this case, each month.
You could replace mean() with min(), max(), median(), std(), var(), sum(), and maybe a few others.
Xarray has wonderful documentation, the resample() doc is here

Can I visualize the output values of my linear regression model, If I have got 3 predictor variables and 1 target variable?

I am trying to understand whether I can Visualize a 4-dimensional graph by breaking it down into smaller dimensions.
For example when we have a 2-d plane as a prediction for a 3-d graph, We can just chose a 2-d graph that shows our prediction as a line. Can I do the same for a 4-d graph? If yes then how?
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model
data = pd.read_csv('housing.csv')
data = data[:50] #taking just 50 rows from the excel file
model = linear_model.LinearRegression() #loading the model from the library
model.fit(data[['median_income','total_rooms','households']],data.median_house_value)
# Pls add code here for visualizations
Actually you can do one funny thing - since your object is a function from R^3->R, you could, in principle, take your input space as a 3d cube (I am guessing your data is somewhat bounded), and then use colour to code your prediction. This way you will get a 3d coloured point cloud. You will probably need transparency to see through it + some interactive investigation to rotate/move around, but 4d is the highest "visualisable" dimension (as long as one dimension is "special" and thus can be coded as a colour).

how to plot three or even more dimensional multivariate gaussian distribution

In the study of machine learning and pattern recognition, we know that if a sample i has two dimensional feature like (length, weight), both of length and weight belongs to Gaussian distribution, so we can use a multivariate Gaussian distribution to describe it
it's just a 3d plot looks like this :
where z axis is the possibility ,
but what if this sample i has three dimensional features, x1, x2 , x3 ....xn or even more, how do we correctly plot it using one plot???
You can use dimensionality reduction methods to visualize higher dimensional data.
https://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html#sphx-glr-auto-examples-manifold-plot-compare-methods-py
convert D dimensional data into 2 or 3 dimensional data
plot the transformed data points on 2 or 3 data points depending upon the dimension to which the data was reduced.
Lets consider an example. Take 10th dimensional Gaussian
import matplotlib.pyplot as plt
import numpy as np
DIMENSION = 10
mean = np.zeros((DIMENSION,))
cov = np.eye(DIMENSION)
X = np.random.multivariate_normal(mean, cov, 5000)
Then perform dimensionality reduction (I used PCA, you can choose any other method depending upon the prior knowledge of effectiveness of the algorithm for a particular type of data)
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.decomposition import PCA
X_2d = PCA(n_components=2).fit_transform(X)
X_3d = PCA(n_components=3).fit_transform(X)
Then Plot them
fig = plt.figure(figsize=(12,4))
ax = fig.add_subplot(121, projection='3d')
ax.scatter(X_3d[:,0],X_3d[:,1],X_3d[:,2])
plt.title('3D')
fig.add_subplot(122)
plt.scatter(X_2d[:,0], X_2d[:,1])
plt.title('2D')
You can play with other algos as well. Each offers different kind of advantage.
I hope this answers your question.
Note: In higher dimension, phenomenon like "curse of dimensionality" also comes into play. so accurate projection in lower dimensional may not be possible. Something like why Greenland appears to be of similar size to that of Africa on cartographic map.

How to apply low-pass filter to a sound record on python?

I have to reduce white noise from a sound record.Because of that i used fourier transform.But i dont know how to use the fft function's return values which is in frequincy domain.How can i use the fft data to reduce noise?
Here is my code
from scipy.io import wavfile
import matplotlib.pyplot as plt
import simpleaudio as sa
from numpy.fft import fft,fftfreq,ifft
#reading wav file
fs,data=wavfile.read("a.wav","wb")
n=len(data)
freqs=fftfreq(n)
mask=freqs>0
#calculating raw fft values
fft_vals=fft(data)
#calculating theorical fft values
fft_theo=2*np.abs(fft_vals/n)
#ploting
plt.plot(freqs[mask],fft_theo[mask])
plt.show()```
It is better for such questions to build a synthetic example, so you don't have to post a big datafile and people can still follow your question (MCVE).
It is also important to plot intermediate results since we are talking about operations on complex numbers, so we often have to take re, im parts, or absolutes and angles respectively.
The Fourier transform of a real function is complex but is symmetric for positive vs negative frequencies. One can also look at that from an information theoretical viewpoint: you wouldn't want N independent real numbers in time to result in 2N independent real numbers describing the spectrum.
While you normally plot the absolute or absolute squared (voltage vs. power) of the spectrum, you can leave it complex when you apply the filter. After back-conversion to time via the IFFT, to plot it, you'll have to convert it to a real number again, in this case by taking the absolute.
If you design the filter kernel in the time domain (FFT of a Gaussian will be a Gaussian), the IFFT of the product of the FFT of the filter and the spectrum will have only very small imaginary parts and you can then take the real part (which makes more sense from a physics viewpoint, you started with real part, end with real part).
import numpy as np
import matplotlib.pyplot as p
%matplotlib inline
T=3 # secs
d=0.04 # secs
n=int(T/d)
print(n)
t=np.arange(0,T,d)
fr=1 # Hz
y1= np.sin(2*np.pi*fr*t) +1 # dc offset helps with backconversion, try setting it to zero
y2= 1/5*np.sin(2*np.pi*7*fr*t+0.5)
y=y1+y2
f=np.fft.fftshift(np.fft.fft(y))
freq=np.fft.fftshift(np.fft.fftfreq(n,d))
filter=np.exp(- freq**2/6) # simple Gaussian filter in the frequency domain
filtered_spectrum=f*filter # apply the filter to the spectrum
filtered_data = np.fft.ifft(filtered_spectrum) # then backtransform to time domain
p.figure(figsize=(24,16))
p.subplot(321)
p.plot(t,y1,'.-',color='red', lw=0.5, ms=1, label='signal')
p.plot(t,y2,'.-',color='blue', lw=0.5,ms=1, label='noise')
p.plot(t,y,'.-',color='green', lw=4, ms=4, alpha=0.3, label='noisy signal')
p.xlabel('time (sec)')
p.ylabel('amplitude (Volt)')
p.legend()
p.subplot(322)
p.plot(freq,np.abs(f)/n, label='raw spectrum')
p.plot(freq,filter,label='filter')
p.xlabel(' freq (Hz)')
p.ylabel('amplitude (Volt)');
p.legend()
p.subplot(323)
p.plot(t, np.absolute(filtered_data),'.-',color='green', lw=4, ms=4, alpha=0.3, label='cleaned signal')
p.legend()
p.subplot(324)
p.plot(freq,np.abs(filtered_spectrum), label = 'filtered spectrum')
p.legend()
p.subplot(326)
p.plot(freq,np.log( np.abs(filtered_spectrum)), label = 'filtered spectrum')
p.legend()
p.title(' in the log plot the noise is still visible');

Resources