Pyro changes dimension of Discrete latent variable when using NUTS (MCMC) sampler - probabilistic-programming

Thanks for taking time to read my issue as given below.
The issue I need help with is that the dimensions of my Binomial distribution output changes (automatically) during the second iteration when I run the model using NUTS sampler. Because of this the remaining of my code (not given here) throws a dimension mismatch error.
If I run the model function only by just calling the function (without using Sampler) it works great, even if I keep calling the function repeatedly. But it fails when I use Sampler.
I replicated the issue using a smaller and simpler code as mentioned below (this code doesn't represent my actual code but replicates the issue).
The packages I imported:
import pyro
import pyro.distributions as dist
import torch
import pyro.poutine as poutine
from pyro.infer import MCMC, NUTS
The version of Pyro is 1.5 and PyTorch is 1.7
The Model
def model ():
print("***** Start ****")
prior = torch.ones(5) / 5
print("Prior", prior)
a = pyro.sample("a", dist.Binomial(1, prior))
print("A", a)
b = pyro.sample("b", dist.Binomial(1, a))
print("B", b)
print("***** End *****")
return b
def conditioned_model(model, data):
print("**** Condition Model **** ")
return poutine.condition(model, data = {"b":data})()
data = model()
Output when I call the model directly to generate simulated data
***** Start ****
Prior tensor([0.2000, 0.2000, 0.2000, 0.2000, 0.2000])
A tensor([0., 1., 0., 0., 0.])
B tensor([0., 1., 0., 0., 0.])
***** End *****
MCMC Sampler code
nuts_kernel = NUTS(conditioned_model, jit_compile=False)
mcmc = MCMC(nuts_kernel,
num_samples=1,
warmup_steps=1,
num_chains=1)
mcmc.run(model, data)
Output when I run MCMC sampler (above code)
Warmup: 0%| | 0/2 [00:00, ?it/s]
**** Condition Model ****
***** Start ****
Prior tensor([0.2000, 0.2000, 0.2000, 0.2000, 0.2000])
A tensor([1., 0., 0., 0., 1.])
B tensor([0., 1., 0., 0., 0.])
***** End *****
**** Condition Model ****
***** Start ****
Prior tensor([0.2000, 0.2000, 0.2000, 0.2000, 0.2000])
A tensor([0., 1.])
B tensor([0., 1., 0., 0., 0.])
***** End *****
In the above output please observe the dimension of variable A. Initially it has size 5 and later it becomes 2. Due to this remaining of my code in DINA model gives error.
In above code, variable A is based on the prior variable and the dimension of prior is 5. Then as I understand, A should always be 5. Please help me understand why it changes to 2 and how I can avoid this from happening.
Also, what I am not able to understand is that the dimension of B always remains 5. In above code, B is taking A as input, but B doesn't change the dimension even when when A changes its dimension.
Thanks a lot for the help.

I found another discussion regarding this issue.
It seems to me that the issue in my code is that NUTS try to integrate out Discrete random variables. Hence, I cannot apply a conditional flow based on the discrete random variable. See here for more information: Error with NUTS when random variable is used in control flow

Related

Restricting prediction range of sklearn regressor

let's say I have the dataframe below, where we describe the course of two cases.
import pandas as pd
data = {
'case':[1,1,1,1,1,2,2,2,2,2],
'duration':[2,4,6,7,9,1,5,6,9,13],
'total_duration':[10,10,10,10,10,14,14,14,14,14],
'stage':['1','2','2','3','4','1','1','3','4','4']
}
df = pd.DataFrame(data)
Imagine I want to predict the duration of case 2 based on the duration of case 1. For this I could set up the following code.
train = df[df['case'] == 1]
test = df[df['case'] == 2]
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
X = ['duration','stage']
y = ['total_duration']
train_X, train_y = train[X], train[y]
test_X, test_y = test[X], test[y]
model.fit(train_X,train_y)
model.predict(test_X)
output: array([10., 10., 10., 10., 10.])
Because the dataset is so small, the model naively predicts the total duration of case 2 to be the same as case 1. However, the prediction is not feasible for one data point, where the current duration of case is already 13. This exceeds the predicted duration of 10.
Is there a way to restrict the model to not predict a total duration which is lower as the current duration? Which would give the output as follows:
output: array([10., 10., 10., 10., 13.])
This may not be an ideal way to predict such a feature, and an alternative may be to predict duration_left. But that would add a trend to my target variable which is what I want to prevent.
Is there a way I can achieve the goal mentioned above in sklearn?

How to do InverseDynamics with a floating base robot?

I tried with the CalcInverseDynamics, but the returned tau is an 18 dimension vector, 6(floating base) + 12(actuator), which is supposed to be 12 (equal with the num of actuators). Is there any example to do InverseDynamics with the floating-base robot using known_vdot and contact force trajectories?
I tried with the LittLeDog.urdf model. My code is:
def DoID():
legs = [plant.GetBodyByName("front_left_lower_leg"),
plant.GetBodyByName("front_right_lower_leg"),
plant.GetBodyByName("back_left_lower_leg"),
plant.GetBodyByName("back_right_lower_leg")]
contacts = [foot_frame[0].CalcPoseInBodyFrame(plant_context).translation() for i in range(4)]
F_expected = np.array([0., 0., 0., 0., 0., 0.])
forces = MultibodyForces(plant)
# add SpatialForce applied to legs into MultibodyForces
for i in range(4):
legs[i].AddInForce(
plant_context, p_BP_E=contacts[i],
F_Bp_E=SpatialForce(F=F_expected),
frame_E=plant.world_frame(), forces=forces)
nv = plant.num_velocities()
vd_d = np.zeros(nv)
tau = plant.CalcInverseDynamics(plant_context, vd_d, forces)
return tau
update:
at the CalcInverseDynamics API, it writes:
tau = M(q)v̇ + C(q, v)v - tau_app - ∑ J_WBᵀ(q) Fapp_Bo_W
This should also work for the floating-base robot, with the form of
from here, different notation but the same equation. I hope when the contact force and the known_vdot (or qddot) are 'reasonable', then the will become zeros, and the become the joint torque commands. I will use APIs like CalcMassMatrix, CalcBiasTerm and CalcGravityGeneralizedForces to get .
After get the joint commands, use PD controller or other controller to apply to robot. A functional solution to 'controller a desired acceleration' may still need to formulate a QP like http://groups.csail.mit.edu/robotics-center/public_papers/Kuindersma13.pdf. But will try the simpler way first.
My guess is that you are trying to find a controller that will (approximately) follow a desired acceleration of the entire state vector using only the actuators (for littledog, you have 12 actuators, but 19 positions / 18 velocities)?
In addition, with a legged robot like littledog, you have to think about the contact forces (and their friction cones).
The most common generalization of the inverse dynamics control for situations like this involves solving a quadratic program (using a linearization of the friction cone constraints). See for instance http://groups.csail.mit.edu/robotics-center/public_papers/Kuindersma13.pdf

Image classifier “ValueError: Found array with dim 3. Estimator expected <= 2.”

I have this problem when I try to run this code using 3 folders of dataset. Previously, this code ran perfectly with only 2 folders inside the dataset folder, one called normal, the other covid. But in this case I added another one called pneumonia to make it a 3 category image classifier. I'm new in machine learning, so I investigated a lot about how to fix this, but every solution is different and also the code. I tried them but they didn't work, that's the reason why I'm asking for your help.
This code doesn't belong to me, it's an Adrian Rosebrock code, all the credit goes to him. It's about classifying X-ray images in COVID or normal cases, but the idea to improve this code add a new category to classify images with normal (non-COVID) pneumonia. That's why I added a new folder into the dataset. Hope you can help me, thanks!
# USAGE
python train.py --dataset dataset
# import the necessary packages
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from imutils import paths
import numpy as np
import argparse
import cv2
import os
# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-d", "--dataset", required=True,
help="path to input dataset")
ap.add_argument("-p", "--plot", type=str, default="plot.png",
help="path to output loss/accuracy plot")
ap.add_argument("-m", "--model", type=str, default="covid19.model",
help="path to output loss/accuracy plot")
args = vars(ap.parse_args())
# initialize the initial learning rate, number of epochs to train for,
# and batch size
INIT_LR = 1e-3
EPOCHS = 1
BS = 8
# grab the list of images in our dataset directory, then initialize
# the list of data (i.e., images) and class images
print("[INFO] loading images...")
imagePaths = list(paths.list_images(args["dataset"]))
data = []
labels = []
# loop over the image paths
for imagePath in imagePaths:
# extract the class label from the filename
label = imagePath.split(os.path.sep)[-2]
# load the image, swap color channels, and resize it to be a fixed
# 224x224 pixels while ignoring aspect ratio
image = cv2.imread(imagePath)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image = cv2.resize(image,(224, 224))
# update the data and labels lists, respectively
data.append(image)
labels.append(label)
# convert the data and labels to NumPy arrays while scaling the pixel
# intensities to the range [0, 255]
data = np.array(data) / 255.0
labels = np.array(labels)
# perform one-hot encoding on the labels
lb = LabelBinarizer()
labels = lb.fit_transform(labels)
labels = to_categorical(labels)
# partition the data into training and testing splits using 80% of
# the data for training and the remaining 20% for testing
(trainX, testX, trainY, testY) = train_test_split(data, labels,
test_size=0.20, stratify=labels, random_state=42)
This is the error message:
[INFO] loading images...
Traceback (most recent call last):
File "train_covid19.py", line 77, in <module>
test_size=0.20, stratify=labels, random_state=42)
File "C:\Users\KQ\anaconda3\lib\site-packages\sklearn\model_selection\_split.py", line 2143, in train_test_split
train, test = next(cv.split(X=arrays[0], y=stratify))
File "C:\Users\KQ\anaconda3\lib\site-packages\sklearn\model_selection\_split.py", line 1737, in split
y = check_array(y, ensure_2d=False, dtype=None)
File "C:\Users\KQ\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 574, in check_array
% (array.ndim, estimator_name))
ValueError: Found array with dim 3. Estimator expected <= 2.
I do not think you need both LabelBinarizer and to_categorical. They do the same thing, so you only need one or the other. Try removing the call to to_categorical.
lb = LabelBinarizer()
labels = lb.fit_transform(labels)
labels = to_categorical(labels) # Remove this line.
You will also need to update the number of categories in your model.
# Change the size from 2 to 3.
headModel = Dense(3, activation="softmax")(headModel)
To avoid the need to change this every time you add or remove categories, you could count the unique labels.
n_labels = len(set(labels))
headModel = Dense(n_labels, activation="softmax")(headModel)
Update
Also note that to_categorical will only work on integer labels. That makes it more like OneHotEncoder than LabelBinarizer.
Here is what it looks like to call everything.
labels = [0, 1, 0, 2]
lb = LabelBinarizer()
binarized = lb.fit_transform(labels)
binarized
# array([[1, 0, 0],
# [0, 1, 0],
# [1, 0, 0],
# [0, 0, 1]])
to_categorical(labels)
# array([[1., 0., 0.],
# [0., 1., 0.],
# [1., 0., 0.],
# [0., 0., 1.]], dtype=float32)
to_categorical(binarized)
# array([[[0., 1.],
# [1., 0.],
# [1., 0.]],
#
# [[1., 0.],
# [0., 1.],
# [1., 0.]],
#
# [[0., 1.],
# [1., 0.],
# [1., 0.]],
#
# [[1., 0.],
# [1., 0.],
# [0., 1.]]], dtype=float32)
Note the 3-dimensional output for the labels, as it tries to one-hot encode each of the 3 parts of the already one-hot encoded data, adding an additional dimension that train_test_split does not know how to handle.
That is why you got a ValueError: Found array with dim 3.

What is the proper way to design a data generator for a multi-input Keras model?

I am trying to design a multi-input keras model. We are working with images (128x128x3) of mathematical knots. I have set up a model that takes three inputs. The three inputs will be
1). Non-rotated picture of a knot
2). Same knot as 1 but rotated about it's y axis 90 degrees
3). Same knot as 1 but rotated about it's x axis 90 degrees
The model is good and will compile properly. The problem I am having is that I am using fit_generator to train my model, and I cannot seem to get my data generator to work properly. Here is my code for my data generator:
def DataGen(in1,in2,in3,in1_label,in2_label,in3_label, batch_size):
in1 = np.array(in1)
in1 = np.reshape(in1, (in1.shape[0],128,128,3))
in2 = np.array(in2)
in2 = np.reshape(in2, (in2.shape[0],128,128,3))
in3 = np.array(in3)
in3 = np.reshape(in3, (in3.shape[0],128,128,3))
L = len(in1)
batch_start = 0
batch_end = batch_size
gen = ImageDataGenerator(rescale=1.0/255)
genX1 = gen.flow(in1, in1_label, batch_size=batch_size, seed=1)
genX2 = gen.flow(in2, in2_label, batch_size=batch_size, seed=1)
genX3 = gen.flow(in3, in3_label, batch_size=batch_size, seed=1)
#this line is just to make the generator infinite, keras needs that
while True:
limit = min(batch_end, L)
#in1
X1i = genX1.next()
print(X1i[0].shape)
#in2
X2i = genX2.next()
#in3
X3i = genX3.next()
#print(Y.shape)
#print(Y1.shape)
#print(Y2.shape)
label = np.concatenate([X1i[1],X2i[1],X3i[1]])
#print(label.shape)
#print(X1i[1].shape)
yield [X1i[0],X2i[0],X3i[0]],np.array(label) #a tuple with two numpy arrays with batch_size samples
batch_start += batch_size
batch_end += batch_size
if batch_start > L - batch_size:
batch_start = 0
batch_end = batch_size
If I run the neural network with this code it generates the following error messaage:
Input arrays should have the same number of samples as target arrays. Found 30 input samples and 90 target samples.
Which makes me think I should not concatenate the label, and just return a list of labels, along with the list of batches...If i yield the following
yield [X1i[0],X2i[0],X3i[0]],[X1i[1],X2i[1],X3i[1]]
I get the following error message:
Error when checking model target: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 1 array(s), but instead got the following list of 3 arrays: [array([[0., 1., 0.],
[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.],
[1., 0., 0.],
[1., 0., 0.],
[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.],
[0....
So I do not know what to do to fix it. The model has 45,000 input images, 15,000 go to each of the three inputs. The model will also have 45,000 validation images, again with 15,000 going to each input.
Every input(set of 15,000 images) has a corresponding label with the following structure: First 5,000 elements of the list= label 0, Second 5,000 elements of the list=label 1, Third 5,000 elements of the list=label 2. The labels are all One hot encoded as well.
Any suggestions are greatly appreciated.
My data generator is based off of:
https://stackoverflow.com/a/49405175/5432071
My Jupyter notebook code can be viewed here:
https://uofstthomasmn-my.sharepoint.com/:u:/g/personal/ward0001_stthomas_edu/EcHhuXpXl1VJu8yKZaWRdKkBdwVrD6AEs3hd4Kuwk2Cl3g?e=tlQTGx
For the Jupyer notebook link, you will have to click the link, then click download...as the html file will not be displayed on onedrive.
To summarize my question: What is wrong with my data generator for multi-input models?
Thanks in advance.

calculate the spatial dimension of a graph

Given a graph (say fully-connected), and a list of distances between all the points, is there an available way to calculate the number of dimensions required to instantiate the graph?
E.g. by construction, say we have graph G with points A, B, C and distances AB=BC=CA=1. Starting from A (0 dimensions) we add B at distance 1 (1 dimension), now we find that a 2nd dimension is needed to add C and satisfy the constraints. Does code exist to do this and spit out (in this case) dim(G) = 2?
E.g. if the points are photos, and the distances between them calculated by the Gist algorithm (http://people.csail.mit.edu/torralba/code/spatialenvelope/), I would expect the derived dimension to match the number image parameters considered by Gist.
Added: here is a 5-d python demo based on the suggestion - seemingly perfect!
'similarities' is the distance matrix.
import numpy as np
from sklearn import manifold
similarities = [[0., 1., 1., 1., 1., 1.],
[1., 0., 1., 1., 1., 1.],
[1., 1., 0., 1., 1., 1.],
[1., 1., 1., 0., 1., 1.],
[1., 1., 1., 1., 0., 1.],
[1., 1., 1., 1., 1., 0]]
seed = np.random.RandomState(seed=3)
for i in [1, 2, 3, 4, 5]:
mds = manifold.MDS(n_components=i, max_iter=3000, eps=1e-9, random_state=seed,
dissimilarity="precomputed", n_jobs=1)
print("%d %f" % (i, mds.fit(similarities).stress_))
Output:
1 3.333333
2 1.071797
3 0.343146
4 0.151531
5 0.000000
I find that when I apply this method to a subset of my data (distances between 329 pictures with '11' in the file name, using two different metrics), the stress doesn't decrease to 0 as linearly I'd expect from the above - it levels off after about 5 dimensions. (On the SURF results I tried doubling max_iter, and varying eps by an order of magnitude each way without changing results in the first four digits.)
It turns out the distances do not satisfy the triangle inequality in ~0.02% of the triangles, with the average violation roughly equal to 8% the average distance, for one metric examined.
Overall I prefer the fractal dimension of the sorted distances since it is doesn't require picking a cutoff. I'm marking the MDS response as an answer because it works for the consistent case. My results for the fractal dimension and the MDS case are below.
Another descriptive statistic turns out to be the triangle violations. Results for this further below. If anyone could generalize to higher dimensions, that would be very interesting (results and learning python :-).
MDS results, ignoring the triangle inequality issue:
N_dim stress_
SURF_match GIST_match
1 83859853704.027344 913512153794.477295
2 24402474549.902721 238300303503.782837
3 14335187473.611954 107098797170.304825
4 10714833228.199451 67612051749.697998
5 9451321873.828577 49802989323.714806
6 8984077614.154467 40987031663.725784
7 8748071137.806602 35715876839.391762
8 8623980894.453981 32780605791.135693
9 8580736361.368249 31323719065.684353
10 8558536956.142039 30372127335.209297
100 8544120093.395177 28786825401.178596
1000 8544192695.435946 28786840008.666389
Forging ahead with that to devise a metric to compare the dimensionality of the two results, an ad hoc choice is to set the criterion to
1.1 * stress_at_dim=100
resulting in the proposition that the SURF_match has a quasi-dimension in 5..6, while GIST_match has a quasi-dimension in 8..9. I'm curious if anyone thinks that means anything :-). Another question is whether there is any meaningful interpretation for the relative magnitudes of stress at any dimension for the two metrics. Here are some results to put it in perspective. Frac_d is the fractal dimension of the sorted distances, calculated according to Higuchi's method using code from IQM, Dim is the dimension as described above.
Method Frac_d Dim stress(100) stress(1)
Lab_CIE94 1.1458 3 2114107376961504.750000 33238672000252052.000000
Greyscale 1.0490 8 42238951082.465477 1454262245593.781250
HS_12x12 1.0889 19 33661589105.972816 3616806311396.510254
HS_24x24 1.1298 35 16070009781.315575 4349496176228.410645
HS_48x48 1.1854 64 7231079366.861403 4836919775090.241211
GIST 1.2312 9 28786830336.332951 997666139720.167114
HOG_250_words 1.3114 10 10120761644.659481 150327274044.045624
HOG_500_words 1.3543 13 4740814068.779779 70999988871.696045
HOG_1k_words 1.3805 15 2364984044.641845 38619752999.224922
SIFT_1k_words 1.5706 11 1930289338.112194 18095265606.237080
SURFFAST_200w 1.3829 8 2778256463.307569 40011821579.313110
SRFFAST_250_w 1.3754 8 2591204993.421285 35829689692.319153
SRFFAST_500_w 1.4551 10 1620830296.777577 21609765416.960484
SURFFAST_1k_w 1.5023 14 949543059.290031 13039001089.887533
SURFFAST_4k_w 1.5690 19 582893432.960562 5016304129.389058
Looking at the Pearson correlation between columns of the table:
Pearson correlation 2-tailed p-value
FracDim, Dim: (-0.23333296587402277, 0.40262625206429864)
Dim, Stress(100): (-0.24513480360257348, 0.37854224076180676)
Dim, Stress(1): (-0.24497740363489209, 0.37885820835053186)
Stress(100),S(1): ( 0.99999998200931084, 8.9357374620135412e-50)
FracDim, S(100): (-0.27516440489210137, 0.32091019789264791)
FracDim, S(1): (-0.27528621200454373, 0.32068731053608879)
I naively wonder how all correlations but one can be negative, and what conclusions can be drawn. Using this code:
import sys
import numpy as np
from scipy.stats.stats import pearsonr
file = sys.argv[1]
col1 = int(sys.argv[2])
col2 = int(sys.argv[3])
arr1 = []
arr2 = []
with open(file, "r") as ins:
for line in ins:
words = line.split()
arr1.append(float(words[col1]))
arr2.append(float(words[col2]))
narr1 = np.array(arr1)
narr2 = np.array(arr2)
# normalize
narr1 -= narr1.mean(0)
narr2 -= narr2.mean(0)
# standardize
narr1 /= narr1.std(0)
narr2 /= narr2.std(0)
print pearsonr(narr1, narr2)
On to the number of violations of the triangle inequality by the various metrics, all for the 329 pics with '11' in their sequence:
(1) n_violations/triangles
(2) avg violation
(3) avg distance
(4) avg violation / avg distance
n_vio (1) (2) (3) (4)
lab 186402 0.031986 157120.407286 795782.437570 0.197441
grey 126902 0.021776 1323.551315 5036.899585 0.262771
600px 120566 0.020689 1339.299040 5106.055953 0.262296
Gist 69269 0.011886 1252.289855 4240.768117 0.295298
RGB
12^3 25323 0.004345 791.203886 7305.977862 0.108295
24^3 7398 0.001269 525.981752 8538.276549 0.061603
32^3 5404 0.000927 446.044597 8827.910112 0.050527
48^3 5026 0.000862 640.310784 9095.378790 0.070400
64^3 3994 0.000685 614.752879 9270.282684 0.066314
98^3 3451 0.000592 576.815995 9409.094095 0.061304
128^3 1923 0.000330 531.054082 9549.109033 0.055613
RGB/600px
12^3 25190 0.004323 790.258158 7313.379003 0.108057
24^3 7531 0.001292 526.027221 8560.853557 0.061446
32^3 5463 0.000937 449.759107 8847.079639 0.050837
48^3 5327 0.000914 645.766473 9106.240103 0.070915
64^3 4382 0.000752 634.000685 9272.151040 0.068377
128^3 2156 0.000370 544.644712 9515.696642 0.057236
HueSat
12x12 7882 0.001353 950.321873 7555.464323 0.125779
24x24 1740 0.000299 900.577586 8227.559169 0.109459
48x48 1137 0.000195 661.389622 8653.085004 0.076434
64x64 1134 0.000195 697.298942 8776.086144 0.079454
HueSat/600px
12x12 6898 0.001184 943.319078 7564.309456 0.124707
24x24 1790 0.000307 908.031844 8237.927256 0.110226
48x48 1267 0.000217 693.607735 8647.060308 0.080213
64x64 1289 0.000221 682.567106 8761.325172 0.077907
hog
250 53782 0.009229 675.056004 1968.357004 0.342954
500 18680 0.003205 559.354979 1431.803914 0.390665
1k 9330 0.001601 771.307074 970.307130 0.794910
4k 5587 0.000959 993.062824 650.037429 1.527701
sift
500 26466 0.004542 1267.833182 1073.692611 1.180816
1k 16489 0.002829 1598.830736 824.586293 1.938949
4k 10528 0.001807 1918.068294 533.492373 3.595306
surffast
250 38162 0.006549 630.098999 1006.401837 0.626091
500 19853 0.003407 901.724525 830.596690 1.085635
1k 10659 0.001829 1310.348063 648.191424 2.021545
4k 8988 0.001542 1488.200156 419.794008 3.545072
Anyone capable of generalizing to higher dimensions? Here is my first-timer code:
import sys
import time
import math
import numpy as np
import sortedcontainers
from sortedcontainers import SortedSet
from sklearn import manifold
seed = np.random.RandomState(seed=3)
pairs = sys.argv[1]
ss = SortedSet()
print time.strftime("%H:%M:%S"), "counting/indexing"
sys.stdout.flush()
with open(pairs, "r") as ins:
for line in ins:
words = line.split()
ss.add(words[0])
ss.add(words[1])
N = len(ss)
print time.strftime("%H:%M:%S"), "size ", N
sys.stdout.flush()
sim = np.diag(np.zeros(N))
dtot = 0.0
with open(pairs, "r") as ins:
for line in ins:
words = line.split()
i = ss.index(words[0])
j = ss.index(words[1])
#val = math.log(float(words[2]))
#val = math.sqrt(float(words[2]))
val = float(words[2])
sim[i][j] = val
sim[j][i] = val
dtot += val
avgd = dtot / (N * (N-1))
ntri = 0
nvio = 0
vio = 0.0
for i in xrange(1, N):
for j in xrange(i+1, N):
d1 = sim[i][j]
for k in xrange(j+1, N):
ntri += 1
d2 = sim[i][k]
d3 = sim[j][k]
dd = d1 + d2
diff = d3 - dd
if (diff > 0.0):
nvio += 1
vio += diff
avgvio = 0.0
if (nvio > 0):
avgvio = vio / nvio
print("tot: %d %f %f %f %f" % (nvio, (float(nvio)/ntri), avgvio, avgd, (avgvio/avgd)))
Here is how I tried sklearn's Isomap:
for i in [1, 2, 3, 4, 5]:
# nbrs < points
iso = manifold.Isomap(n_neighbors=nbrs, n_components=i,
eigen_solver="auto", tol=1e-9, max_iter=3000,
path_method="auto", neighbors_algorithm="auto")
dis = euclidean_distances(iso.fit(sim).embedding_)
stress = ((dis.ravel() - sim.ravel()) ** 2).sum() / 2
Given a graph (say fully-connected), and a list of distances between all the points, is there an available way to calculate the number of dimensions required to instantiate the graph?
Yes. The more general topic this problem would be part of, in terms of graph theory, is called "Graph Embedding".
E.g. by construction, say we have graph G with points A, B, C and distances AB=BC=CA=1. Starting from A (0 dimensions) we add B at distance 1 (1 dimension), now we find that a 2nd dimension is needed to add C and satisfy the constraints. Does code exist to do this and spit out (in this case) dim(G) = 2?
This is almost exactly the way that Multidimensional Scaling works.
Multidimensional scaling (MDS) would not exactly answer the question of "How many dimensions would I need to represent this point cloud / graph?" with a number but it returns enough information to approximate it.
Multidimensional Scaling methods will attempt to find a "good mapping" to reduce the number of dimensions, say from 120 (in the original space) down to 4 (in another space). So, in a way, you can iteratively try different embeddings for increasing number of dimensions and look at the "stress" (or error) of each embedding. The number of dimensions you are after is the first number for which there is an abrupt minimisation of the error.
Due to the way it works, Classical MDS, can return a vector of eigenvalues for the new mapping. By examining this vector of eigenvalues you can determine how many of its entries you would need to retain to achieve a (good enough, or low error) representation of the original dataset.
The key concept here is the "similarity" matrix which is a fancy name for a graph's distance matrix (which you already seem to have), irrespectively of its semantics.
Embedding algorithms, in general, are trying to find an embedding that may look different but at the end of the day, the point cloud in the new space will end up having a similar (depending on how much error we can afford) distance matrix.
In terms of code, I am sure that there is something available in all major scientific computing packages but off the top of my head I can point you towards Python and MATLAB code examples.
E.g. if the points are photos, and the distances between them calculated by the Gist algorithm (http://people.csail.mit.edu/torralba/code/spatialenvelope/), I would expect the derived dimension to match the number image parameters considered by Gist
Not exactly. This is a very good use case though. In this case, what MDS would return, or what you would be probing with dimensionality reduction in general would be to check how many of these features seem to be required to represent your dataset. Therefore, depending on the scenes, or, depending on the dataset, you might realise that not all of these features are necessary for a good enough representation of the whole dataset. (In addition, you might want to have a look at this link as well).
Hope this helps.
First, you can assume that any dataset has a dimensionality of at most 4 or 5. To get more relevant dimensions, you would need one million elements (or something like that).
Apparently, you already computed a distance. Are you sure it is actually a relavnt metric? Is it efficient for images that are quite distant? Perhaps you can try Isomap (geodesic distance, starting for only close neighbors) and see if your embedded space may not actually be Euclidian.

Resources