Gephi/NodeXL - measurement difference (betweenness/closeness) - gephi

I am trying to understand how Gephi/NodeXL measures Betweeness and Closeness Centrality or rather why there is a difference in measurement.
A twitter network with 1004 nodes and 2314 edges was sent to me. I measured centrality both in Gephi and NodeXL (directed network). Unfortunately I got two different results:
Gephi:
Betweenness | Closeness
A: 358.0 | 1.0
B: 0.0 | 0.0
C: 0.0 | 0.0
NodeXL:
Betweenness | Closeness
A: 295472,785 | 0,001
B: 91827,372 | 0,000
C: 92674,065 | 0,000
At first I thought the network caused the strange results so I tried to measure the statistics using other networks. It turns out that the network was not the problem. Can anyone explain to me how this is possible?

Gephi used a modified version of closeness centrality (CC) so that the higher the CC value the closer to the center of the graph. This modified version is an inverse CC and also takes into account the graph size, where as the standard CC is not dependent on graph size and is normalized between graphs of different sizes.
As for Betweenness Centrality, I get the same output from NodeXL and Gephi.

Related

Can a SpatialInertia object be constructed from a 6x6 matrix?

The SpatialInertia object in Drake has a method called CopyToFullMatrix6(), which outputs the 6x6 representation of the SpatialInertia object. Is there a function to do the opposite? I.e. I have a 6x6 matrix that I want to make into a SpatialInertia object. (I am using Pydrake locally in Ubuntu 18.04.)
The context is: I'm working with some hydrodynamic modeling. Due to the model of the added mass, the "mass" submatrix of my Spatial Inertia matrix, which typically has three identical mass values on the diagonal, like so:
---------
| m 0 0 |
| 0 m 0 |
| 0 0 m |
---------
actually has different "masses" in the different directions, like so:
------------
| m1 0 0 |
| 0 m2 0 |
| 0 0 m3 |
------------
The consequence of this is that the usual constructors (MakeFromCentralInertia() or just SpatialInertia()) won't work because they take one mass value as an input.
My process right now is to query a body's SpatialInertia, get the 6x6 reprentation, and add the Added Mass Spatial Inertia matrix to it as a numpy array. Now I need a way to make that matrix into a SpatialInertia again so I can apply it back to the body.
Any thoughts on the process or the conversion are appreciated. Thanks!
A SpatialInertia can only have a scalar mass (it is actually represented that way, not as a 6x6 matrix). That is, that class is inherently the mass properties of a rigid body and can't be used for something more general. The multibody system does have a more general class, ArticulatedBodyInertia which does represent masses that appear different in different directions due to joint articulation. It is not clear to me how that could be used for hydrodynamics, but at least it can represent the varying masses.

Preprocessing categorical data already converted into numbers

I'm fairly new to machine learning, so I don't know the correct terminology, but I converted two categorical columns into numbers the following way. These columns are part of my features inputs, akin to the sex column in the titanic database.
(They are not the target data y which I have already created)
changed p_changed
Date
2010-02-17 0.477182 0 0
2010-02-18 0.395813 0 0
2010-02-19 0.252179 1 1
2010-02-22 0.401321 0 1
2010-02-23 0.519375 1 1
Now the rest of my data Xlooks something like this
Open High Low Close Volume Adj Close log_return \
Date
2010-02-17 2.07 2.07 1.99 2.03 219700.0 2.03 -0.019513
2010-02-18 2.03 2.03 1.99 2.03 181700.0 2.03 0.000000
2010-02-19 2.03 2.03 2.00 2.02 116400.0 2.02 -0.004938
2010-02-22 2.05 2.05 2.02 2.04 188300.0 2.04 0.009852
2010-02-23 2.05 2.07 2.01 2.05 255400.0 2.05 0.004890
close_open Daily_Change 30_Avg_Vol 20_Avg_Vol 15_Avg_Vol \
Date
2010-02-17 0.00 -0.04 0.909517 0.779299 0.668242
2010-02-18 0.00 0.00 0.747470 0.635404 0.543015
2010-02-19 0.00 -0.01 0.508860 0.417706 0.348761
2010-02-22 0.03 -0.01 0.817274 0.666903 0.562414
2010-02-23 0.01 0.00 1.078411 0.879007 0.742730
As you can see the rest of my data is continuous (containing many variables) as opposed to the two categorical columns which only have two values (0 and 1).
I was planning to preprocess all this data in one shot via this simple preprocess method
X_scaled = preprocessing.scale(X)
I was wondering if this is mistake? Is there something else I need to do to the categorical values before using this simple preprocessing?
EDIT: I tried two ways; First I tried scaling the full data, including the categorical data converted to 1's and 0's.
Full_X = OPK_df.iloc[:-5, 0:-5]
Full_X_scaled = preprocessing.scale( Full_X) # First way, which scales everything in one shot.
Then I tried dropping the last two columns, scaling, then adding the dropped columns via this code.
X =OPK_df.iloc[:-5, 0:-7] # Here I'm dropping both -7 while originally the offset was only till -5, which means two extra columns were dropped.
I created another dataframe which has those two columns I dropped
x2 =OPK_df.iloc[:-5, -7:-5]
x2 = np.array(x2) # convert it to an array
# preprocessing the data without last two columns
from sklearn import preprocessing
X_scaled = preprocessing.scale(X)
# Then concact the X_scaled with x2(originally dropped columns)
X =np.concatenate((X_scaled, x2), axis =1)
#Creating a classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn2 = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_scaled, y)
knn2.fit(X,y)
knn.score(Full_X_scaled, y)
0.71396522714526078
knn2.score(X, y)
0.71789119461581608
So there is a higher score when I do indeed drop the two columns during standarization.
You're doing pretty well so far. Do not scale your classification data. Since those appear to be binary classifications, think of this as "Yes" and "No". What does it mean to scale these?
Even worse, consider that you might have classifications such as flower types: you've coded Zinnia=0, Rose=1, Orchid=2, etc. What does it meant to scale those? It doesn't make any sense to re-code these as Zinnia=-0.257, Rose=+0.448, etc.
Scaling your input data is the necessary part: it keeps the values within comparable ranges (mathematical influence), allowing you to readily use a single treatment for your loss function. Otherwise, the feature with the largest spread of values would have the greatest influence on training, until your model's weights learned how to properly discount the large values.
For your beginning explorations, don't do any other preprocessing: just scale the input data and start your fitting exercises.

REPEATED in SPSS linear mixed model

I am anlyzing data from an experiment.
I have three groups ( GROUP, 1 between subject factor) to compare via a cognitive task.
Task is composed by a 3 way full factorial design (2x3x3); all subjects are presented two stimuli (factor1), for each stimulus there are three conditions (factor2), and for each condition three position on the screen (factor3). For each combination of factors, there are N trials that are averaged to give average accuracy (ACC) and average reaction time (RT).
I want to build a model in spss using linear mixed model.
I tried in SPSS 22 the following syntax:
MIXED ACC BY GROUP FACTOR1 FACTOR2 FACTOR3 GENDER WITH RT Age
/FIXED = GROUP FACTOR1 FACTOR2 FACTOR3 GROUP*FACTOR1 GROUP*FACTOR2 GROUP*FACTOR3 GENDER AGE RT | SSTYPE(3)
/RANDOM= INTERCEPT | SUBJECT(SUBID) COVTYPE(VC)
Considered I have averaged accuracy rates across trials for each combination, should I include a repeated statement as well? If this were the case, what is the difference between the following
/REPEATED= FACTOR1 FACTOR2 FACTOR3 | SUBJECT(SUBID) COVTYPE(UN)
and the following nomenclature?
/REPEATED= FACTOR1*FACTOR2*FACTOR3 | SUBJECT(SUBID) COVTYPE(UN)
In other words, what is the difference between including or less asterisks?
Thanks for your comments,
Alessandro
You have two questions here: (1) a statistical question about what type of analysis is appropriate, and (2) a code question.
(1) Very briefly, if you're going to use linear mixed models, I think you should use all the data, and not average across your N trials within each combination of factors. Those N trials are your repeated measurements.
(2) The IBM KnowledgeCenter page on the REPEATED subcommand states
Specify a list of variable names (of any type) connected by asterisks
(repeated measure) following the REPEATED subcommand.
which suggests that
/REPEATED= FACTOR1 FACTOR2 FACTOR3 | SUBJECT(SUBID) COVTYPE(UN)
should be a syntax error. It isn't, so I looked at the Model Information table in the output. For both REPEATED specifications, the Repeated Effects section of that table lists FACTOR1*FACTOR2*FACTOR3 as the effect.
Based on this, it's safe to say that the SPSS syntax parser interprets
/REPEATED= FACTOR1 FACTOR2 FACTOR3 | SUBJECT(SUBID) COVTYPE(UN)
to be equivalent to
/REPEATED= FACTOR1*FACTOR2*FACTOR3 | SUBJECT(SUBID) COVTYPE(UN)

How to find clusters a matrix

I have no clue about data mining or data analysis or statistical analysis but I think what I need is finding "clusters in a matrix". I have a data set of ~20k records and each has ~40 characteristics all of which are either turned on or off.
+--------+------+------+------+------+------+------+
| record | hasA | hasB | hasC | hasD | hasE | hasF |
+--------+------+------+------+------+------+------+
| foo | 1 | 0 | 1 | 0 | 0 | 0 |
| bar | 1 | 1 | 0 | 0 | 1 | 1 |
| baz | 1 | 1 | 1 | 0 | 0 | 0 |
+--------+------+------+------+------+------+------+
I'm quite convinced most of those 20k records have characteristics that fall into one of several categories. There must be means to determine how similar record 'foo' is to record 'bar'.
So, what is it that I'm actually looking at? What algorithm am I looking for?
Transform each record r into a binary vector v(r) so that i-th component of v(r) is set to 1 if r has i-th characteristic, and 0 otherwise.
Now run hierarchical clustering algorithm on this set of vectors under the Hamming distance or Jaccard distance, whichever you think is more appropriate; also make sure there's a notion of distance between clusters defined in terms of the underlying distance (see linkage criteria).
Then decide where to cut the resulting dendrogram based on common sense. Where you cut the dendrogram will affect the number of clusters.
One downside of hierarchical clustering is that it's rather slow. It takes O(n^3) time in general, so it would take quite a while on a large data set. For single- and complete-linkages you can bring the time down to O(n^2).
Hierarchical clustering is very easy to implement in languages such as Python. You can also use the implementation from the scipy library.
Example: Hierarchical Clustering in Python
Here's a code snippet to get you started. I assume S is the set of records transformed into binary vectors (i.e. each list in S corresponds to a record from your data set).
import numpy as np
import scipy
import scipy.cluster.hierarchy as sch
import matplotlib.pylab as plt
# This is the set of binary vectors, each of which would
# correspond to a record in your case.
S = [
[0, 0, 0, 1, 1], # 0
[0, 0, 0, 0, 1], # 1
[0, 0, 0, 1, 0], # 2
[1, 1, 1, 0, 0], # 3
[1, 0, 1, 0, 0], # 4
[0, 1, 1, 0, 0]] # 5
# Use Hamming distance with complete linkage.
Z = sch.linkage(sch.distance.pdist(S, metric='hamming'), 'complete')
# Compute the dendrogram
P = sch.dendrogram(Z)
plt.show()
The result is as you'd expect: cut at 0.5 to get two clusters, one of the first three vectors (which have ones at beginning, zeros at the end) and the other of the last three vectors (which have ones at the end, zeros at the beginning). Here's the image:
Hierarchical clustering starts with each vector being its own cluster. In each successive steps it merges the closest clusters. It repeats this until there is a single cluster left.
The dendrogram essentially encodes the whole clustering process. At the beginning each vector is its own cluster. Then {3} and {5} merge into {3,5} and {0} and {2} merge into {0,2}. Next, {4} and {3,5} merge into {3,4,5}, and {1} and {0,2} merge into {0,1,2}. Finally, {0,1,2} and {3,4,5} merge into {0,1,2,3,4,5}.
From the dendrogram you can usually see at which point it makes the most sense to cut---this will define your clusters.
I encourage you to experiment with various distances (e.g. Hamming distance, Jaccard distance) and linkages (e.g. single linkage, complete linkage), and various representations (e.g. binary vectors).
Are you sure you want cluster analysis?
To find similar records you don't need cluster analysis. Simply find similar records with any distance measure such as Jaccard similarity or Hamming distance (both of which are for binary data). Or cosine distance, so that you can use e.g. Lucene to find similar records fast.
To find common patterns, the use of frequent itemset mining may yield much more meaningful results, because these can work on a subset of attributes only. For example, in a supermarket, the columns Noodles, Tomato, Basil, Cheese may constitute a frequent pattern.
Most clustering algorithms attempt to divide the data into k groups. While this at first appears a good idea (get k target groups) it rarely matches what real data contains. For example customers: why would every customer belong to exactly one audience? What if the audiences are e.g. car lovers, gun lovers, football lovers, soccer moms - are you sure you don't want to allow overlap of these groups?
Furhermore, a problem with cluster analysis is that it's incredibly easy to use badly. It does not "fail hard" - you always get a result, and you might not realize that it's a bad result...

Finding standard deviation using only mean, min, max?

I want to find the standard deviation:
Minimum = 5
Mean = 24
Maximum = 84
Overall score = 90
I just want to find out my grade by using the standard deviation
Thanks,
A standard deviation cannot in general be computed from just the min, max, and mean. This can be demonstrated with two sets of scores that have the same min, and max, and mean but different standard deviations:
1 2 4 5 : min=1 max=5 mean=3 stdev≈1.5811
1 3 3 5 : min=1 max=5 mean=3 stdev≈0.7071
Also, what does an 'overall score' of 90 mean if the maximum is 84?
I actually did a quick-and-dirty calculation of the type M Rad mentions. It involves assuming that the distribution is Gaussian or "normal." This does not apply to your situation but might help others asking the same question. (You can tell your distribution is not normal because the distance from mean to max and mean to min is not close). Even if it were normal, you would need something you don't mention: the number of samples (number of tests taken in your case).
Those readers who DO have a normal population can use the table below to give a rough estimate by dividing the difference of your measured minimum and your calculated mean by the expected value for your sample size. On average, it will be off by the given number of standard deviations. (I have no idea whether it is biased - change the code below and calculate the error without the abs to get a guess.)
Num Samples Expected distance Expected error
10 1.55 0.25
20 1.88 0.20
30 2.05 0.18
40 2.16 0.17
50 2.26 0.15
60 2.33 0.15
70 2.38 0.14
80 2.43 0.14
90 2.47 0.13
100 2.52 0.13
This experiment shows that the "rule of thumb" of dividing the range by 4 to get the standard deviation is in general incorrect -- even for normal populations. In my experiment it only holds for sample sizes between 20 and 40 (and then loosely). This rule may have been what the OP was thinking about.
You can modify the following python code to generate the table for different values (change max_sample_size) or more accuracy (change num_simulations) or get rid of the limitation to multiples of 10 (change the parameters to xrange in the for loop for idx)
#!/usr/bin/python
import random
# Return the distance of the minimum of samples from its mean
#
# Samples must have at least one entry
def min_dist_from_estd_mean(samples):
total = 0
sample_min = samples[0]
for sample in samples:
total += sample
sample_min = min(sample, sample_min)
estd_mean = total / len(samples)
return estd_mean - sample_min # Pos bec min cannot be greater than mean
num_simulations = 4095
max_sample_size = 100
# Calculate expected distances
sum_of_dists=[0]*(max_sample_size+1) # +1 so can index by sample size
for iternum in xrange(num_simulations):
samples=[random.normalvariate(0,1)]
while len(samples) <= max_sample_size:
sum_of_dists[len(samples)] += min_dist_from_estd_mean(samples)
samples.append(random.normalvariate(0,1))
expected_dist = [total/num_simulations for total in sum_of_dists]
# Calculate average error using that distance
sum_of_errors=[0]*len(sum_of_dists)
for iternum in xrange(num_simulations):
samples=[random.normalvariate(0,1)]
while len(samples) <= max_sample_size:
ave_dist = expected_dist[len(samples)]
if ave_dist > 0:
sum_of_errors[len(samples)] += \
abs(1 - (min_dist_from_estd_mean(samples)/ave_dist))
samples.append(random.normalvariate(0,1))
expected_error = [total/num_simulations for total in sum_of_errors]
cols=" {0:>15}{1:>20}{2:>20}"
print(cols.format("Num Samples","Expected distance","Expected error"))
cols=" {0:>15}{1:>20.2f}{2:>20.2f}"
for idx in xrange(10,len(expected_dist),10):
print(cols.format(idx, expected_dist[idx], expected_error[idx]))
Yo can obtain an estimate of the geometric mean, sometimes called the geometric mean of the extremes or GME, using the Min and the Max by calculating the GME= $\sqrt{ Min*Max }$. The SD can be then calculated using your arithmetic mean (AM) and the GME as:
SD= $$\frac{AM}{GME} * \sqrt{(AM)^2-(GME)^2 }$$
This approach works well for log-normal distributions or as long as the GME, GM or Median is smaller than the AM.
In principle you can make an estimate of standard deviation from the mean/min/max and the number of elements in the sample. The min and max of a sample are, if you assume normality, random variables whose statistics follow from mean/stddev/number of samples. So given the latter, one can compute (after slogging through the math or running a bunch of monte carlo scripts) a confidence interval for the former (like it is 80% probable that the stddev is between 20 and 40 or something like that).
That said, it probably isn't worth doing except in extreme situations.

Resources