Xarray Distributed Failed to serialize - dask

I need to upsample through a linear interpolation some satellite images organized in a DataArray.
Until I run the code locally I've no issue but, if I try to replicate the interpolation over a distributed system, I get back this error:
`Could not serialize object of type tuple`
to replicate the problem what's needed is to switch between a distributed or local env.
here the distributed version of the code.
n_time = 365
px = 2000
lat = np.linspace(19., 4., px)
lon = np.linspace(34., 53., px)
time = pd.date_range('1/1/2019', periods=n_time, freq='D')
data = xr.DataArray(np.random.random((n_time, px, px)), dims=('time', 'lat',
'lon'),coords={'time': time, 'lat': lat, 'lon': lon})
data = data.chunk({'time':1})
#upsampling
nlat = np.linspace(19., 4., px*2)
nlon = np.linspace(34., 53., px*2)
interp = data.interp(lat=nlat, lon=nlon)
computed = interp.compute()
Does any have and idea on how to work around the problem?
EDIT 1:
As seems that I haven't been enough clear in my first MRE so I decided to rewrite with all the inputs received up to now.
I need to upsample a satellite dataset from 500 meters to 250m. The final goal is, as chunking along the dimension to be interpolated is not yet supported **, figure out how I can create a workaround and upsampling each image to the 500 datasets.
px = 2000
n_time = 365
time = pd.date_range('1/1/2019', periods=n_time, freq='D')
# dataset to be upsampled
lat_500 = np.linspace(19., 4., px)
lon_500 = np.linspace(34., 53., px)
da_500 = xr.DataArray(dsa.random.random((n_time, px, px),
chunks=(1, 1000, 1000)),
dims=('time', 'lat', 'lon'),
coords={'time': time, 'lat': lat_500, 'lon': lon_500})
# reference dataset
lat_250 = np.linspace(19., 4., px * 2)
lon_250 = np.linspace(34., 53., px * 2)
da_250 = xr.DataArray(dsa.random.random((n_time, px * 2, px * 2),
chunks=(1, 1000, 1000)),
dims=('time', 'lat', 'lon'),
coords={'time': time, 'lat': lat_250, 'lon': lon_250})
# upsampling
da_250i = da_500.interp(lat=lat_250, lon=lon_250)
#fake index
fNDVI = (da_250i-da_250)/(da_250i+da_250)
fNDVI.to_netcdf(r'c:\temp\output.nc').compute()
This should recreate the problem, and avoid to impact on the memory as suggested by Rayan. In any case, the two datasets can be dumped to the disk and then reloaded.
**note seems that something is moving to implement an interpolation along with chunked dataset but isn't still fully available. Here the details https://github.com/pydata/xarray/pull/4155

I believe that there are two things that cause this example to crash, both likely related to memory usage
You populate your original dataset with a large numpy array (np.random.random((n_time, px, px)) and then call .chunk after the fact. This forces Dask to pass a large object around in its graphs. Solution: use a lazy loading method.
Your object interp requires 47 GB of memory. This is too much for most computers to handle. Solution: add a reduction step before calling compute. This allows you to check whether your interpolation is working properly without simultaneously loading all the results into RAM.
With these modifications, the code looks like this
import numpy as np
import dask.array as dsa
import pandas as pd
import xarray as xr
n_time = 365
px = 2000
lat = np.linspace(19., 4., px)
lon = np.linspace(34., 53., px)
time = pd.date_range('1/1/2019', periods=n_time, freq='D')
# use dask to lazily create the random data, not numpy
# this avoids populating the dask graph with large objects
data = xr.DataArray(dsa.random.random((n_time, px, px),
chunks=(1, px, px)),
dims=('time', 'lat', 'lon'),
coords={'time': time, 'lat': lat, 'lon': lon})
# upsampling
nlat = np.linspace(19., 4., px*2)
nlon = np.linspace(34., 53., px*2)
# this object requires 47 GB of memory
# computing it directly is not an option on most computers
interp = data.interp(lat=nlat, lon=nlon)
# instead, we reduce in the time dimension before computing
interp.mean(dim='time').compute()
This ran in a few minutes on my laptop.

In response to your edited question, I have a new solution.
In order to interpolate across the lat / lon dimensions, you need to rechunk the data. I added this line before the interpolation step
da_500 = da_500.chunk({'lat': -1, 'lon': -1})
After doing that, the computation executed without errors for me in distributed mode.
from dask.distributed import Client
client = Client()
fNDVI.to_netcdf(r'~/tmp/test.nc').compute()
I did notice that the computation was rather memory intensive. I recommend monitoring the dask dashboard to see if you are running out of memory.

Related

My speaker recognition neural network doesn’t work well

I have a final project in my first degree and I want to build a Neural Network that gonna take the first 13 mfcc coeffs of a wav file and return who talked in the audio file from a banch of talkers.
I want you to notice that:
My audio files are text independent, therefore they have different length and words
I have trained the machine on about 35 audio files of 10 speaker ( the first speaker had about 15, the second 10, and the third and fourth about 5 each )
I defined :
X=mfcc(sound_voice)
Y=zero_array + 1 in the i_th position ( where i_th position is 0 for the first speaker, 1 for the second, 2 for the third... )
And than trained the machine, and than checked the output of the machine for some files...
So that’s what I did... but unfortunately it’s look like the results are completely random...
Can you help me understand why?
This is my code in python -
from sklearn.neural_network import MLPClassifier
import python_speech_features
import scipy.io.wavfile as wav
import numpy as np
from os import listdir
from os.path import isfile, join
from random import shuffle
import matplotlib.pyplot as plt
from tqdm import tqdm
winner = [] # this array count how much Bingo we had when we test the NN
for TestNum in tqdm(range(5)): # in every round we build NN with X,Y that out of them we check 50 after we build the NN
X = []
Y = []
onlyfiles = [f for f in listdir("FinalAudios/") if isfile(join("FinalAudios/", f))] # Files in dir
names = [] # names of the speakers
for file in onlyfiles: # for each wav sound
# UNESSECERY TO UNDERSTAND THE CODE
if " " not in file.split("_")[0]:
names.append(file.split("_")[0])
else:
names.append(file.split("_")[0].split(" ")[0])
names = list(dict.fromkeys(names)) # names of speakers
vector_names = [] # vector for each name
i = 0
vector_for_each_name = [0] * len(names)
for name in names:
vector_for_each_name[i] += 1
vector_names.append(np.array(vector_for_each_name))
vector_for_each_name[i] -= 1
i += 1
for f in onlyfiles:
if " " not in f.split("_")[0]:
f_speaker = f.split("_")[0]
else:
f_speaker = f.split("_")[0].split(" ")[0]
(rate, sig) = wav.read("FinalAudios/" + f) # read the file
try:
mfcc_feat = python_speech_features.mfcc(sig, rate, winlen=0.2, nfft=512) # mfcc coeffs
for index in range(len(mfcc_feat)): # adding each mfcc coeff to X, meaning if there is 50000 coeffs than
# X will be [first coeff, second .... 50000'th coeff] and Y will be [f_speaker_vector] * 50000
X.append(np.array(mfcc_feat[index]))
Y.append(np.array(vector_names[names.index(f_speaker)]))
except IndexError:
pass
Z = list(zip(X, Y))
shuffle(Z) # WE SHUFFLE X,Y TO PERFORM RANDOM ON THE TEST LEVEL
X, Y = zip(*Z)
X = list(X)
Y = list(Y)
X = np.asarray(X)
Y = np.asarray(Y)
Y_test = Y[:50] # CHOOSE 50 FOR TEST, OTHERS FOR TRAIN
X_test = X[:50]
X = X[50:]
Y = Y[50:]
clf = MLPClassifier(solver='lbfgs', alpha=1e-2, hidden_layer_sizes=(5, 3), random_state=2) # create the NN
clf.fit(X, Y) # Train it
for sample in range(len(X_test)): # add 1 to winner array if we correct and 0 if not, than in the end it plot it
if list(clf.predict([X[sample]])[0]) == list(Y_test[sample]):
winner.append(1)
else:
winner.append(0)
# plot winner
plot_x = []
plot_y = []
for i in range(1, len(winner)):
plot_y.append(sum(winner[0:i])*1.0/len(winner[0:i]))
plot_x.append(i)
plt.plot(plot_x, plot_y)
plt.xlabel('x - axis')
# naming the y axis
plt.ylabel('y - axis')
# giving a title to my graph
plt.title('My first graph!')
# function to show the plot
plt.show()
This is my zip file that contains the code and the audio file : https://ufile.io/eggjm1gw
You have a number of issues in your code and it will be close to impossible to get it right in one go, but let's give it a try. There are two major issues:
Currently you're trying to teach your neural network with very few training examples, as few as a single one per speaker (!). It's impossible for any machine learning algorithm to learn anything.
To make matters worse, what you do is that you feed to the ANN only MFCC for the first 25 ms of each recording (25 comes from winlen parameter of python_speech_features). In each of these recordings, first 25 ms will be close to identical. Even if you had 10k recordings per speaker, with this approach you'd not get anywhere.
I will give you concrete advise, but won't do all the coding - it's your homework after all.
Use all MFCC, not just first 25 ms. Many of these should be skipped, simply because there's no voice activity. Normally there should be VOD (Voice Activity Detector) telling you which ones to take, but in this exercise I'd skip it for starter (you need to learn basics first).
Don't use dictionaries. Not only it won't fly with more than one MFCC vector per speaker, but also it's very inefficient data structure for your task. Use numpy arrays, they're much faster and memory efficient. There's a ton of tutorials, including scikit-learn that demonstrate how to use numpy in this context. In essence, you create two arrays: one with training data, second with labels. Example: if omersk speaker "produces" 50000 MFCC vectors, you will get (50000, 13) training array. Corresponding label array would be 50000 with single constant value (id) that corresponds to the speaker (say, omersk is 0, lucas is 1 and so on). I'd consider taking longer windows (perhaps 200 ms, experiment!) to reduce the variance.
Don't forget to split your data for training, validation and test. You will have more than enough data. Also, for this exercise I'd watch for not feeding too much of data for any single speaker - ot taking steps to make sure algorithm is not biased.
Later, when you make prediction, you will again compute MFCCs for the speaker. With 10 sec recording, 200 ms window and 100 ms overlap, you'll get 99 MFCC vectors, shape (99, 13). The model should run on each of the 99 vectors, for each producing probability. When you sum it (and normalise, to make it nice) and take top value, you'll get the most likely speaker.
There's a dozen of other things that typically would be taken into account, but in this case (homework) I'd focus on getting the basics right.
EDIT: I decided to take a stab at creating the model with your idea at heart, but basics fixed. It's not exactly clean Python, all because it's adapted from Jupyter Notebook I was running.
import python_speech_features
import scipy.io.wavfile as wav
import numpy as np
import glob
import os
from collections import defaultdict
from sklearn.neural_network import MLPClassifier
from sklearn import preprocessing
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier
audio_files_path = glob.glob('audio/*.wav')
win_len = 0.04 # in seconds
step = win_len / 2
nfft = 2048
mfccs_all_speakers = []
names = []
data = []
for path in audio_files_path:
fs, audio = wav.read(path)
if audio.size > 0:
mfcc = python_speech_features.mfcc(audio, samplerate=fs, winlen=win_len,
winstep=step, nfft=nfft, appendEnergy=False)
filename = os.path.splitext(os.path.basename(path))[0]
speaker = filename[:filename.find('_')]
data.append({'filename': filename,
'speaker': speaker,
'samples': mfcc.shape[0],
'mfcc': mfcc})
else:
print(f'Skipping {path} due to 0 file size')
speaker_sample_size = defaultdict(int)
for entry in data:
speaker_sample_size[entry['speaker']] += entry['samples']
person_with_fewest_samples = min(speaker_sample_size, key=speaker_sample_size.get)
print(person_with_fewest_samples)
max_accepted_samples = int(speaker_sample_size[person_with_fewest_samples] * 0.8)
print(max_accepted_samples)
training_idx = []
test_idx = []
accumulated_size = defaultdict(int)
for entry in data:
if entry['speaker'] not in accumulated_size:
training_idx.append(entry['filename'])
accumulated_size[entry['speaker']] += entry['samples']
elif accumulated_size[entry['speaker']] < max_accepted_samples:
accumulated_size[entry['speaker']] += entry['samples']
training_idx.append(entry['filename'])
X_train = []
label_train = []
X_test = []
label_test = []
for entry in data:
if entry['filename'] in training_idx:
X_train.append(entry['mfcc'])
label_train.extend([entry['speaker']] * entry['mfcc'].shape[0])
else:
X_test.append(entry['mfcc'])
label_test.extend([entry['speaker']] * entry['mfcc'].shape[0])
X_train = np.concatenate(X_train, axis=0)
X_test = np.concatenate(X_test, axis=0)
assert (X_train.shape[0] == len(label_train))
assert (X_test.shape[0] == len(label_test))
print(f'Training: {X_train.shape}')
print(f'Testing: {X_test.shape}')
le = preprocessing.LabelEncoder()
y_train = le.fit_transform(label_train)
y_test = le.transform(label_test)
clf = MLPClassifier(solver='lbfgs', alpha=1e-2, hidden_layer_sizes=(5, 3), random_state=42, max_iter=1000)
cv_results = cross_validate(clf, X_train, y_train, cv=4)
print(cv_results)
{'fit_time': array([3.33842635, 4.25872731, 4.73704267, 5.9454329 ]),
'score_time': array([0.00125694, 0.00073504, 0.00074005, 0.00078583]),
'test_score': array([0.40380048, 0.52969121, 0.48448687, 0.46043165])}
The test_score isn't stellar. There's a lot to improve (for starter, choice of algorithm), but the basics are there. Notice for starter how I get the training samples. It's not random, I only consider recordings as whole. You can't put samples from a given recording to both training and test, as test is supposed to be novel.
What was not working in your code? I'd say a lot. You were taking 200ms samples and yet very short fft. python_speech_features likely complained to you that the fft is should be longer than the frame you're processing.
I leave to you testing the model. It won't be good, but it's a starter.

Apply function along time dimension of XArray

I have an image stack stored in an XArray DataArray with dimensions time, x, y on which I'd like to apply a custom function along the time axis of each pixel such that the output is a single image of dimensions x,y.
I have tried: apply_ufunc but the function fails stating that I need to first load the data into RAM (i.e. cannot use a Dask Array). Ideally, I'd like to keep the DataArray as Dask Arrays internally as it isn't possible to load the entire stack into RAM. The exact error message is:
ValueError: apply_ufunc encountered a dask array on an argument, but handling for dask arrays has not been enabled. Either set the dask argument or load your data into memory first with .load() or .compute()
My code currently looks like this:
import numpy as np
import xarray as xr
import pandas as pd
def special_mean(x, drop_min=False):
s = np.sum(x)
n = len(x)
if drop_min:
s = s - x.min()
n -= 1
return s/n
times = pd.date_range('2019-01-01', '2019-01-10', name='time')
data = xr.DataArray(np.random.rand(10, 8, 8), dims=["time", "y", "x"], coords={'time': times})
data = data.chunk({'time':10, 'x':1, 'y':1})
res = xr.apply_ufunc(special_mean, data, input_core_dims=[["time"]], kwargs={'drop_min': True})
If I do load the data into RAM using .compute then I still end up with an error which states:
ValueError: applied function returned data with unexpected number of dimensions: 0 vs 2, for dimensions ('y', 'x')
I'm not sure entirely what I am missing/doing wrong.
def special_mean(x, drop_min=False):
s = np.sum(x)
n = len(x)
if drop_min:
s = s - x.min()
n -= 1
return s/n
times = pd.date_range('2019-01-01', '2019-01-10', name='time')
data = xr.DataArray(np.random.rand(10, 8, 8), dims=["time", "y", "x"], coords={'time': times})
data = data.chunk({'time':10, 'x':1, 'y':1})
res = xr.apply_ufunc(special_mean, data, input_core_dims=[["time"]], kwargs={'drop_min': True}, dask = 'allowed', vectorize = True)
The code above using the vectorize argument should work.
My aim was also to implement apply_ufunc from Xarray such that it can compute the special mean across x and y.
I enjoyed Ales example; of course by omitting the line related to the chunk. Otherwise:
ValueError: applied function returned data with unexpected number of dimensions. Received 0 dimension(s) but expected 2 dimensions with names: ('y', 'x')
Interestingly, I realized that, in a situation, to have the output of apply_ufunc 3D instead of 2D, we need to add "out_core_dims=[["time"]]" to the apply_ufunc.

Xarray Dask.delayed slow: how to be fast to select/interpolate between two datasets

I have two datasets (called satdata and atmosdata). Atmosdata is evenly gridded on latitude and longitude. Atmosdata has the dimension (latitude: 713, level: 37, longitude: 1440, time: 72), and has a total size of 12GB. Atmosdata has several variables such as temperature, humidity etc. Humidity has the shape of (time, level, latitude, longitude).
Satdata contains satellite observations and has the dimension of (across_track: 90, channel: 3, time: 32195), with 90*3*32195=8692650 points of data. Across_track means the satellite FOV across track position. Satdata is not evenly gridded in latitude/longitude. For example, satdata.latitude has a dimension of (time, channel, across_track), the same for satdata.longitude, satdata.sft.
The 'time' variable in Atmosdata and satdata contains time within the same day, but with different values in these two datasets. I need find atmosdata (humidity and temperature for example) which has the same latitude, longitude and time as satdata.
To realize this, I iterate over satdata to find the location and time of each observation; then I find the corresponding atmosdata (firstly the nearest grid to satellite data location, then interpolated to satellite time). Finally I concatenate the resulting atmosdata from all the iterations into one dataset.
A part of my code is as follows, by using a small piece of data.
import xarray as xr
import numpy as np
import dask
import pandas as pd
# define a simple satdata dataset
lon = np.array([[18.717, 18.195, 17.68 ], [18.396, 17.87 , 17.351, ]])
lat = np.array([[-71.472, -71.635, -71.802],
[-71.52 , -71.682, -71.847]])
sft = np.array([[1, 1, 1],
[1, 1, 1]])
time = np.array(['2010-09-07T00:00:01.000000000', '2010-09-07T00:00:03.000000000'],dtype='datetime64[ns]')
satdata = xr.Dataset({'sft': (['time','across_track'], sft)}, coords = {'longitude': (['time','across_track'], lon), 'latitude': (['time','across_track'], lat), 'time':time })
# atmosdata
atmoslat = np.array([-71.75, -71.5 , -71.25, -71. , -70.75, -70.5 , -70.25, -70. , -69.75 ])
atmoslon = np.array([17.25, 17.5 , 17.75, 18. , 18.25, 18.5 , 18.75, 19. , 19.25])
atmostime = np.array(['2010-09-07T00:00:00.000000000', '2010-09-07T01:00:00.000000000'],dtype='datetime64[ns]')
atmosq = np.random.rand(2,9,9)
atmosdata = xr.Dataset({'q': (['time', 'latitude', 'longitude'], atmosq)}, coords={'longitude':(['longitude'], atmoslon), 'latitude': (['latitude'], atmoslat), 'time':(['time'], atmostime)})
# do the matching:
matched = dask.compute(match(atmosdata, satdata),scheduler='processes', num_workers=20)[0]
The matching function is as below:
#dask.delayed
def match(atmosdata, satdata):
newatmos = []
newind = 0
# iterate over satdata
for i in np.ndenumerate(satdata.sft):
if i[1] != np.nan:
# find one latitude and longitude of satellite data
temp_lat = satdata.latitude.isel(time=[i[0][0]], across_track=[i[0][1]])
temp_lon = satdata.longitude.isel(time=[i[0][0]], across_track=[i[0][1]])
# find the atmosdata in the grid nearest to this location
temp_loc = atmosdata.sel(latitude =temp_lat.values.ravel()[0], longitude = temp_lon.values.ravel()[0], method='nearest')
if temp_loc.q.all() > 0:
# find atmosdata at the satellite time by interpolation
temp_time = satdata.time.isel(time=[i[0][0]])
newatmos.append(temp_loc.interp( time = temp_time.data.ravel() ))
newind += 1
return xr.concat(newatmos,dim=pd.Index(range(newind), name='NewInd'))
1) When I launch the code, it works. But if I do not use the small data size in the code, instead I use my original data (with the dimensions mentioned above), then I launch the computation and I got errors.
---> 52 matched = dask.compute(match(ecmwfdata, ssmis_land), scheduler='processes', num_workers=20 )
error: 'i' format requires -2147483648 <= number <= 2147483647
2) If I use datasets with other dimensions, satdata (across_track: 90, channel: 3, time: 100), and atmosdata (latitude: 71, level: 37, longitude: 1440, time: 72), the computation takes very long time. I guess that my coding is not optimal to use DASK in solving this problem quickly.
2) Is there a better way than using the for loop ? The for loop may not take advantage of DASK for the purpose of quick computation ?
3) Would it be a good idea to chunk the satdata, then find the limit of latitude and longitude in a chunk of satdata, then chunk the atmosdata according to this limit, and finally apply the match function to each chunk of satdata and atmosdata ? If this is good idea, I do not know yet how to iterate over each chunk of satdata manually....
4) The function uses two arguments, satdata and atmosdata. Since these two datasets can be quite big (12G for atmosdata), the computation would be slower then?
5) In the function I had to use .value in the selection, shall this make slower the computation, when using large input data ?
Thanks in advance !
Best regards
Xiaoni
I'm not really sure that I can help other than to say I think you need to use a resampling package that uses distance trees
https://github.com/pytroll/pyresample
Pyresample has a swath grid that should do what you want
I have used https://github.com/storpipfugl/pykdtree to find nearest points

Why does FFT not have an effect on my smoothed signal?

I'm playing with FFT at the moment and I try to get periods from noisy signals by recreating this example. While experimenting, I've noticed that after smoothing a quite noisy signal, the result of fft() is actually the same signal again - which is what I don't understand.
Here is a full example which can be run in an IPython Notebook (You can create a notebook here and run the code if you want).
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
figsize = (16,8)
n = 500
ls = np.linspace(0,2*np.pi, n)
x_target = np.sin(12*ls) + np.sin(52*ls)
x = np.sin(12*ls) + np.sin(52*ls) + np.random.rand(n) * 3.5
x = x - np.mean(x)
x_smooth = pd.rolling_mean(pd.DataFrame(x), 14).replace(np.nan, 0.0).as_matrix()
x_smooth = x_smooth - np.mean(x_smooth)
x_smooth = np.roll(x_smooth, -7)
# Getting shwifty and showing what we've got
plt.figure(figsize=(16,8))
plt.scatter(ls, x, s=3, c=[1.0,0.0,0.0,1.0])
plt.plot(ls, x_target, color=[1.0,0.0,0.0, 0.3])
plt.plot(ls, x_smooth)
plt.legend(["Target", "Smooth", "Noisy Data"])
# Target
x_fft = np.abs(np.fft.fft(x_target))
pd.DataFrame(x_fft).plot(figsize=figsize)
# Looks like it should
x_fft = np.abs(np.fft.fft(x))
pd.DataFrame(x_fft).plot(figsize=figsize)
# Plots the same signal?
x_fft = np.abs(np.fft.fft(x_smooth))
pd.DataFrame(x_fft).plot(figsize=figsize)
Below you find the resulting plots of this script.
Noisy data with smoothed signal:
FFT of the target function
FFT of the noisy data
FFT of the smoothed data
I don't really get why this is the case here. Can somebody explain this to me or am I doing something wrong here?
The critical difference is between:
x_fft = np.abs(np.fft.fft(x_smooth))
and
x_fft = np.abs(np.fft.fft(x_smooth.flatten()))
because it seems that x_smooth has gotten itself all 2-dimensional somewhere along the way. Its shape is (500,1) and because np.fft.fft works by default along axis=-1 (i.e. the highest dimension) it is taking the 500 separate FFTs of 500 different 1-sample signals. (Unsuprisingly enough, that returns only the DC component for each, so put them all together and you end up with the same signal you started with.)
The FFT from the smoothed signal really looks like this:

calculate the spatial dimension of a graph

Given a graph (say fully-connected), and a list of distances between all the points, is there an available way to calculate the number of dimensions required to instantiate the graph?
E.g. by construction, say we have graph G with points A, B, C and distances AB=BC=CA=1. Starting from A (0 dimensions) we add B at distance 1 (1 dimension), now we find that a 2nd dimension is needed to add C and satisfy the constraints. Does code exist to do this and spit out (in this case) dim(G) = 2?
E.g. if the points are photos, and the distances between them calculated by the Gist algorithm (http://people.csail.mit.edu/torralba/code/spatialenvelope/), I would expect the derived dimension to match the number image parameters considered by Gist.
Added: here is a 5-d python demo based on the suggestion - seemingly perfect!
'similarities' is the distance matrix.
import numpy as np
from sklearn import manifold
similarities = [[0., 1., 1., 1., 1., 1.],
[1., 0., 1., 1., 1., 1.],
[1., 1., 0., 1., 1., 1.],
[1., 1., 1., 0., 1., 1.],
[1., 1., 1., 1., 0., 1.],
[1., 1., 1., 1., 1., 0]]
seed = np.random.RandomState(seed=3)
for i in [1, 2, 3, 4, 5]:
mds = manifold.MDS(n_components=i, max_iter=3000, eps=1e-9, random_state=seed,
dissimilarity="precomputed", n_jobs=1)
print("%d %f" % (i, mds.fit(similarities).stress_))
Output:
1 3.333333
2 1.071797
3 0.343146
4 0.151531
5 0.000000
I find that when I apply this method to a subset of my data (distances between 329 pictures with '11' in the file name, using two different metrics), the stress doesn't decrease to 0 as linearly I'd expect from the above - it levels off after about 5 dimensions. (On the SURF results I tried doubling max_iter, and varying eps by an order of magnitude each way without changing results in the first four digits.)
It turns out the distances do not satisfy the triangle inequality in ~0.02% of the triangles, with the average violation roughly equal to 8% the average distance, for one metric examined.
Overall I prefer the fractal dimension of the sorted distances since it is doesn't require picking a cutoff. I'm marking the MDS response as an answer because it works for the consistent case. My results for the fractal dimension and the MDS case are below.
Another descriptive statistic turns out to be the triangle violations. Results for this further below. If anyone could generalize to higher dimensions, that would be very interesting (results and learning python :-).
MDS results, ignoring the triangle inequality issue:
N_dim stress_
SURF_match GIST_match
1 83859853704.027344 913512153794.477295
2 24402474549.902721 238300303503.782837
3 14335187473.611954 107098797170.304825
4 10714833228.199451 67612051749.697998
5 9451321873.828577 49802989323.714806
6 8984077614.154467 40987031663.725784
7 8748071137.806602 35715876839.391762
8 8623980894.453981 32780605791.135693
9 8580736361.368249 31323719065.684353
10 8558536956.142039 30372127335.209297
100 8544120093.395177 28786825401.178596
1000 8544192695.435946 28786840008.666389
Forging ahead with that to devise a metric to compare the dimensionality of the two results, an ad hoc choice is to set the criterion to
1.1 * stress_at_dim=100
resulting in the proposition that the SURF_match has a quasi-dimension in 5..6, while GIST_match has a quasi-dimension in 8..9. I'm curious if anyone thinks that means anything :-). Another question is whether there is any meaningful interpretation for the relative magnitudes of stress at any dimension for the two metrics. Here are some results to put it in perspective. Frac_d is the fractal dimension of the sorted distances, calculated according to Higuchi's method using code from IQM, Dim is the dimension as described above.
Method Frac_d Dim stress(100) stress(1)
Lab_CIE94 1.1458 3 2114107376961504.750000 33238672000252052.000000
Greyscale 1.0490 8 42238951082.465477 1454262245593.781250
HS_12x12 1.0889 19 33661589105.972816 3616806311396.510254
HS_24x24 1.1298 35 16070009781.315575 4349496176228.410645
HS_48x48 1.1854 64 7231079366.861403 4836919775090.241211
GIST 1.2312 9 28786830336.332951 997666139720.167114
HOG_250_words 1.3114 10 10120761644.659481 150327274044.045624
HOG_500_words 1.3543 13 4740814068.779779 70999988871.696045
HOG_1k_words 1.3805 15 2364984044.641845 38619752999.224922
SIFT_1k_words 1.5706 11 1930289338.112194 18095265606.237080
SURFFAST_200w 1.3829 8 2778256463.307569 40011821579.313110
SRFFAST_250_w 1.3754 8 2591204993.421285 35829689692.319153
SRFFAST_500_w 1.4551 10 1620830296.777577 21609765416.960484
SURFFAST_1k_w 1.5023 14 949543059.290031 13039001089.887533
SURFFAST_4k_w 1.5690 19 582893432.960562 5016304129.389058
Looking at the Pearson correlation between columns of the table:
Pearson correlation 2-tailed p-value
FracDim, Dim: (-0.23333296587402277, 0.40262625206429864)
Dim, Stress(100): (-0.24513480360257348, 0.37854224076180676)
Dim, Stress(1): (-0.24497740363489209, 0.37885820835053186)
Stress(100),S(1): ( 0.99999998200931084, 8.9357374620135412e-50)
FracDim, S(100): (-0.27516440489210137, 0.32091019789264791)
FracDim, S(1): (-0.27528621200454373, 0.32068731053608879)
I naively wonder how all correlations but one can be negative, and what conclusions can be drawn. Using this code:
import sys
import numpy as np
from scipy.stats.stats import pearsonr
file = sys.argv[1]
col1 = int(sys.argv[2])
col2 = int(sys.argv[3])
arr1 = []
arr2 = []
with open(file, "r") as ins:
for line in ins:
words = line.split()
arr1.append(float(words[col1]))
arr2.append(float(words[col2]))
narr1 = np.array(arr1)
narr2 = np.array(arr2)
# normalize
narr1 -= narr1.mean(0)
narr2 -= narr2.mean(0)
# standardize
narr1 /= narr1.std(0)
narr2 /= narr2.std(0)
print pearsonr(narr1, narr2)
On to the number of violations of the triangle inequality by the various metrics, all for the 329 pics with '11' in their sequence:
(1) n_violations/triangles
(2) avg violation
(3) avg distance
(4) avg violation / avg distance
n_vio (1) (2) (3) (4)
lab 186402 0.031986 157120.407286 795782.437570 0.197441
grey 126902 0.021776 1323.551315 5036.899585 0.262771
600px 120566 0.020689 1339.299040 5106.055953 0.262296
Gist 69269 0.011886 1252.289855 4240.768117 0.295298
RGB
12^3 25323 0.004345 791.203886 7305.977862 0.108295
24^3 7398 0.001269 525.981752 8538.276549 0.061603
32^3 5404 0.000927 446.044597 8827.910112 0.050527
48^3 5026 0.000862 640.310784 9095.378790 0.070400
64^3 3994 0.000685 614.752879 9270.282684 0.066314
98^3 3451 0.000592 576.815995 9409.094095 0.061304
128^3 1923 0.000330 531.054082 9549.109033 0.055613
RGB/600px
12^3 25190 0.004323 790.258158 7313.379003 0.108057
24^3 7531 0.001292 526.027221 8560.853557 0.061446
32^3 5463 0.000937 449.759107 8847.079639 0.050837
48^3 5327 0.000914 645.766473 9106.240103 0.070915
64^3 4382 0.000752 634.000685 9272.151040 0.068377
128^3 2156 0.000370 544.644712 9515.696642 0.057236
HueSat
12x12 7882 0.001353 950.321873 7555.464323 0.125779
24x24 1740 0.000299 900.577586 8227.559169 0.109459
48x48 1137 0.000195 661.389622 8653.085004 0.076434
64x64 1134 0.000195 697.298942 8776.086144 0.079454
HueSat/600px
12x12 6898 0.001184 943.319078 7564.309456 0.124707
24x24 1790 0.000307 908.031844 8237.927256 0.110226
48x48 1267 0.000217 693.607735 8647.060308 0.080213
64x64 1289 0.000221 682.567106 8761.325172 0.077907
hog
250 53782 0.009229 675.056004 1968.357004 0.342954
500 18680 0.003205 559.354979 1431.803914 0.390665
1k 9330 0.001601 771.307074 970.307130 0.794910
4k 5587 0.000959 993.062824 650.037429 1.527701
sift
500 26466 0.004542 1267.833182 1073.692611 1.180816
1k 16489 0.002829 1598.830736 824.586293 1.938949
4k 10528 0.001807 1918.068294 533.492373 3.595306
surffast
250 38162 0.006549 630.098999 1006.401837 0.626091
500 19853 0.003407 901.724525 830.596690 1.085635
1k 10659 0.001829 1310.348063 648.191424 2.021545
4k 8988 0.001542 1488.200156 419.794008 3.545072
Anyone capable of generalizing to higher dimensions? Here is my first-timer code:
import sys
import time
import math
import numpy as np
import sortedcontainers
from sortedcontainers import SortedSet
from sklearn import manifold
seed = np.random.RandomState(seed=3)
pairs = sys.argv[1]
ss = SortedSet()
print time.strftime("%H:%M:%S"), "counting/indexing"
sys.stdout.flush()
with open(pairs, "r") as ins:
for line in ins:
words = line.split()
ss.add(words[0])
ss.add(words[1])
N = len(ss)
print time.strftime("%H:%M:%S"), "size ", N
sys.stdout.flush()
sim = np.diag(np.zeros(N))
dtot = 0.0
with open(pairs, "r") as ins:
for line in ins:
words = line.split()
i = ss.index(words[0])
j = ss.index(words[1])
#val = math.log(float(words[2]))
#val = math.sqrt(float(words[2]))
val = float(words[2])
sim[i][j] = val
sim[j][i] = val
dtot += val
avgd = dtot / (N * (N-1))
ntri = 0
nvio = 0
vio = 0.0
for i in xrange(1, N):
for j in xrange(i+1, N):
d1 = sim[i][j]
for k in xrange(j+1, N):
ntri += 1
d2 = sim[i][k]
d3 = sim[j][k]
dd = d1 + d2
diff = d3 - dd
if (diff > 0.0):
nvio += 1
vio += diff
avgvio = 0.0
if (nvio > 0):
avgvio = vio / nvio
print("tot: %d %f %f %f %f" % (nvio, (float(nvio)/ntri), avgvio, avgd, (avgvio/avgd)))
Here is how I tried sklearn's Isomap:
for i in [1, 2, 3, 4, 5]:
# nbrs < points
iso = manifold.Isomap(n_neighbors=nbrs, n_components=i,
eigen_solver="auto", tol=1e-9, max_iter=3000,
path_method="auto", neighbors_algorithm="auto")
dis = euclidean_distances(iso.fit(sim).embedding_)
stress = ((dis.ravel() - sim.ravel()) ** 2).sum() / 2
Given a graph (say fully-connected), and a list of distances between all the points, is there an available way to calculate the number of dimensions required to instantiate the graph?
Yes. The more general topic this problem would be part of, in terms of graph theory, is called "Graph Embedding".
E.g. by construction, say we have graph G with points A, B, C and distances AB=BC=CA=1. Starting from A (0 dimensions) we add B at distance 1 (1 dimension), now we find that a 2nd dimension is needed to add C and satisfy the constraints. Does code exist to do this and spit out (in this case) dim(G) = 2?
This is almost exactly the way that Multidimensional Scaling works.
Multidimensional scaling (MDS) would not exactly answer the question of "How many dimensions would I need to represent this point cloud / graph?" with a number but it returns enough information to approximate it.
Multidimensional Scaling methods will attempt to find a "good mapping" to reduce the number of dimensions, say from 120 (in the original space) down to 4 (in another space). So, in a way, you can iteratively try different embeddings for increasing number of dimensions and look at the "stress" (or error) of each embedding. The number of dimensions you are after is the first number for which there is an abrupt minimisation of the error.
Due to the way it works, Classical MDS, can return a vector of eigenvalues for the new mapping. By examining this vector of eigenvalues you can determine how many of its entries you would need to retain to achieve a (good enough, or low error) representation of the original dataset.
The key concept here is the "similarity" matrix which is a fancy name for a graph's distance matrix (which you already seem to have), irrespectively of its semantics.
Embedding algorithms, in general, are trying to find an embedding that may look different but at the end of the day, the point cloud in the new space will end up having a similar (depending on how much error we can afford) distance matrix.
In terms of code, I am sure that there is something available in all major scientific computing packages but off the top of my head I can point you towards Python and MATLAB code examples.
E.g. if the points are photos, and the distances between them calculated by the Gist algorithm (http://people.csail.mit.edu/torralba/code/spatialenvelope/), I would expect the derived dimension to match the number image parameters considered by Gist
Not exactly. This is a very good use case though. In this case, what MDS would return, or what you would be probing with dimensionality reduction in general would be to check how many of these features seem to be required to represent your dataset. Therefore, depending on the scenes, or, depending on the dataset, you might realise that not all of these features are necessary for a good enough representation of the whole dataset. (In addition, you might want to have a look at this link as well).
Hope this helps.
First, you can assume that any dataset has a dimensionality of at most 4 or 5. To get more relevant dimensions, you would need one million elements (or something like that).
Apparently, you already computed a distance. Are you sure it is actually a relavnt metric? Is it efficient for images that are quite distant? Perhaps you can try Isomap (geodesic distance, starting for only close neighbors) and see if your embedded space may not actually be Euclidian.

Resources