I am trying to understand the difference or error I am receiving between these two steps. I followed this tutorial to practice KNN with my own data (https://towardsdatascience.com/create-a-similarity-graph-from-node-properties-with-neo4j-2d26bb9d829e)
During the process we project our graph of interest, which mine contains three properties: bd_load, weight, and length of organisms. In the example we use this code below to create scaledProperties embeddings between the 3 variables.
Project graph
//(5) project graph of interest
CALL gds.graph.project('bd_graph',
'node_sim',
'*',
{nodeProperties:['bd_load', 'weight', 'length']})
Scale variables of interest between 0-1 for future Euclidean distance calculation
//(6) add scalar 0-1
CALL gds.alpha.scaleProperties.mutate('bd_graph',
{nodeProperties:['bd_load', 'weight', 'length'],
scaler:'MinMax',
mutateProperty:'scaledProperties'})
YIELD nodePropertiesWritten
We then can run KNN based on euclidean distance
//(8) project relationship to graph
CALL gds.knn.mutate("bd_graph",
{nodeProperties: {scaledProperties: "EUCLIDEAN"},
topK: 15,
mutateRelationshipType: "IS_SIMILAR",
mutateProperty: "similarity",
similarityCutoff: 0.6409912109375,
sampleRate:1,
randomSeed:42,
concurrency:1}
)
However I continue the learning curve with Neo4j and FastRP I am trying to understand the difference between the scale property and FastRP. Today I tried to create graph embeddings for my 3 variables using FastRP with 8 dimensions on my projected graph with out running the scaled property embeddings. My thought was increasing the dimensions would be better for finding similarities between nodes. The code below runs fine and there is an embedding vector with 8 elements.
FastRP
CALL gds.fastRP.mutate(
'bd_graph',
{
embeddingDimension: 8,
mutateProperty: 'fastrp-embedding',
featureProperties: ['bd_load', 'weight', 'length']
}
)
YIELD nodePropertiesWritten
But when I run the below code
ALL gds.knn.stats("bd_graph",
{
nodeProperties:{fastrp-embedding:"EUCLIDEAN"},
topK:10,
sampleRate:1,
randomSeed:42,
concurrency:1
}
) YIELD similarityDistribution
RETURN similarityDistribution
I receive an error:
Invalid input '{': expected "+" or "-" (line 4, column 22 (offset: 97))
nodeProperties:{fastrp-embedding:"EUCLIDEAN"},
Does the embedding element length have to match the number of variables in the node? Am using FastRP correctly and my understanding of creating embeddings with in nodes to then calculate Euclidean distance for a similarity score?
I am glad you are finding the tutorial helpful and getting into GDS!
Map keys in Cypher must be strings. https://neo4j.com/docs/cypher-manual/current/syntax/maps/
The - in your property name fastrp-embedding is not recognized as a string character. If you enclose that property name with back ticks, GDS will know to treat the special character as part of the map key. This should work for you.
CALL gds.knn.stats("bd_graph",
{
nodeProperties:{`fastrp-embedding`:"EUCLIDEAN"},
topK:10,
sampleRate:1,
randomSeed:42,
concurrency:1
}
) YIELD similarityDistribution
RETURN similarityDistribution
The recommended format for Neo4j property names is camel case. If you name your property fastrpEmbedding instead of fastrp-embedding, you would not need to use the back ticks.
Related
So my goal is basically implementing global top-k subsampling. Gradient sparsification is quite simple and I have already done this building on stateful clients example, but now I would like to use encoders as you have recommended here at page 28. Additionally I would like to average only the non-zero gradients, so say we have 10 clients but only 4 have nonzero gradients at a given position for a communication round then I would like to divide the sum of these gradients to 4, not 10. I am hoping to achieve this by summing gradients at numerator and masks, 1s and 0s, at denominator. Also moving forward I will add randomness to gradient selection so it is imperative that I create those masks concurrently with gradient selection. The code I have right now is
import tensorflow as tf
from tensorflow_model_optimization.python.core.internal import tensor_encoding as te
#te.core.tf_style_adaptive_encoding_stage
class GrandienrSparsificationEncodingStage(te.core.AdaptiveEncodingStageInterface):
"""An example custom implementation of an `EncodingStageInterface`.
Note: This is likely not what one would want to use in practice. Rather, this
serves as an illustration of how a custom compression algorithm can be
provided to `tff`.
This encoding stage is expected to be run in an iterative manner, and
alternatively zeroes out values corresponding to odd and even indices. Given
the determinism of the non-zero indices selection, the encoded structure does
not need to be represented as a sparse vector, but only the non-zero values
are necessary. In the decode mehtod, the state (i.e., params derived from the
state) is used to reconstruct the corresponding indices.
Thus, this example encoding stage can realize representation saving of 2x.
"""
ENCODED_VALUES_KEY = 'stateful_topk_values'
INDICES_KEY = 'indices'
SHAPES_KEY = 'shapes'
ERROR_COMPENSATION_KEY = 'error_compensation'
def encode(self, x, encode_params):
shapes_list = [tf.shape(y) for y in x]
flattened = tf.nest.map_structure(lambda y: tf.reshape(y, [-1]), x)
gradients = tf.concat(flattened, axis=0)
error_compensation = encode_params[self.ERROR_COMPENSATION_KEY]
gradients_and_error_compensation = tf.math.add(gradients, error_compensation)
percentage = tf.constant(0.1, dtype=tf.float32)
k_float = tf.multiply(percentage, tf.cast(tf.size(gradients_and_error_compensation), tf.float32))
k_int = tf.cast(tf.math.round(k_float), dtype=tf.int32)
values, indices = tf.math.top_k(tf.math.abs(gradients_and_error_compensation), k = k_int, sorted = False)
indices = tf.expand_dims(indices, 1)
sparse_gradients_and_error_compensation = tf.scatter_nd(indices, values, tf.shape(gradients_and_error_compensation))
new_error_compensation = tf.math.subtract(gradients_and_error_compensation, sparse_gradients_and_error_compensation)
state_update_tensors = {self.ERROR_COMPENSATION_KEY: new_error_compensation}
encoded_x = {self.ENCODED_VALUES_KEY: values,
self.INDICES_KEY: indices,
self.SHAPES_KEY: shapes_list}
return encoded_x, state_update_tensors
def decode(self,
encoded_tensors,
decode_params,
num_summands=None,
shape=None):
del num_summands, decode_params, shape # Unused.
flat_shape = tf.math.reduce_sum([tf.math.reduce_prod(shape) for shape in encoded_tensors[self.SHAPES_KEY]])
sizes_list = [tf.math.reduce_prod(shape) for shape in encoded_tensors[self.SHAPES_KEY]]
scatter_tensor = tf.scatter_nd(
indices=encoded_tensors[self.INDICES_KEY],
updates=encoded_tensors[self.ENCODED_VALUES_KEY],
shape=[flat_shape])
nonzero_locations = tf.nest.map_structure(lambda x: tf.cast(tf.where(tf.math.greater(x, 0), 1, 0), tf.float32) , scatter_tensor)
reshaped_tensor = [tf.reshape(flat_tensor, shape=shape) for flat_tensor, shape in
zip(tf.split(scatter_tensor, sizes_list), encoded_tensors[self.SHAPES_KEY])]
reshaped_nonzero = [tf.reshape(flat_tensor, shape=shape) for flat_tensor, shape in
zip(tf.split(nonzero_locations, sizes_list), encoded_tensors[self.SHAPES_KEY])]
return reshaped_tensor, reshaped_nonzero
def initial_state(self):
return {self.ERROR_COMPENSATION_KEY: tf.constant(0, dtype=tf.float32)}
def update_state(self, state, state_update_tensors):
return {self.ERROR_COMPENSATION_KEY: state_update_tensors[self.ERROR_COMPENSATION_KEY]}
def get_params(self, state):
encode_params = {self.ERROR_COMPENSATION_KEY: state[self.ERROR_COMPENSATION_KEY]}
decode_params = {}
return encode_params, decode_params
#property
def name(self):
return 'gradient_sparsification_encoding_stage'
#property
def compressible_tensors_keys(self):
return False
#property
def commutes_with_sum(self):
return False
#property
def decode_needs_input_shape(self):
return False
#property
def state_update_aggregation_modes(self):
return {}
I have run some simple tests manually following the steps you outlined here at page 45. It works but I have some questions/problems.
When I use list of tensors of same shape (ex:2 2x25 tensors) as input,x, of encode it works without any issues but when I try to use list of tensors of different shapes (2x20 and 6x10) it gives and error saying
InvalidArgumentError: Shapes of all inputs must match: values[0].shape = [2,20] != values1.shape = [6,10] [Op:Pack] name: packed
How can I resolve this issue? As i said I want to use global top-k so it is essential I encode entire trainable model weights at once. Take the cnn model used here, all the tensors have different shapes.
How can I do the averaging I described at the beginning? For example here you have done
mean_factory = tff.aggregators.MeanFactory(
tff.aggregators.EncodedSumFactory(mean_encoder_fn), # numerator
tff.aggregators.EncodedSumFactory(mean_encoder_fn), # denominator )
Is there a way to repeat this with one output of decode going to numerator and other going to denominator? How can I handle dividing 0 by 0? tensorflow has divide_no_nan function, can I use it somehow or do I need to add eps to each?
How is partition handled when I use encoders? Does each client get a unique encoder holding a unique state for it? As you have discussed here at page 6 client states are used in cross-silo settings yet what happens if client ordering changes?
Here you have recommended using stateful clients example. Can you explain this a bit further? I mean in the run_one_round where exactly encoders go and how are they used/combined with client update and aggregation?
I have some additional information such as sparsity I want to pass to encode. What is the suggested method for doing that?
Here are some answers, hope it helps:
If you want to treat all of the aggregated structure just as a single tensor, use concat_factory as the outermost aggregator. That will concatenate entire structure to a rank-1 Tensor at clients, and then unpack back to the original structure at the end. Example use: tff.aggregators.concat_factory(tff.aggregators.MeanFactory(...))
Note the encoding stage objects are meant to work with a single tensor, so what you describe with identical tensors probably works only accidentally.
There are two options.
a. Modify the client training code such that the weights being passed to the weighted aggregator are already what you want it to be (zero/one
mask). In the stateful clients example you link, that would be here. You will then get what you need by default (by summing the numerator).
b. Modify UnweightedMeanFactory to do exactly the variant of averaging you describe and use that. Start would be modifying this
(and 4.) I think that is what you would need to implement. The same way existing client states are initialized in the example here, you would need extend it to contain the aggregator states, and make sure those are sampled together with the clients, as done here. Then, to integrate the aggregators in the example you would need to replace this hard-coded tff.federated_mean. An example of such integration is in the implementation of tff.learning.build_federated_averaging_process, primarily here
I am not sure what the question is. Perhaps get the previous working (seems like a prerequisite to me), and then clarify and ask in a new post?
Is there any way to find the Jacobian of a frame with respect to the joints of a given model (as opposed to the whole plant), or alternatively to determine which columns of the full plant Jacobian correspond to a given model’s joints? I’ve found MultibodyPlant.CalcJacobian*, but I’m not sure if those are the right methods.
I also tried mapping the JointIndex of each joint in the model to a column of MultibodyPlant.CalcJacobian*, but the results didn't make sense -- the joint indices are sequential (all of one model followed by all of the other), but the Jacobian columns look interleaved (a column corresponding to one model followed by one corresponding to the other).
Assuming you are computing with respect to velocities, you'll want to use Joint.velocity_start() and Joint.num_velocities() to create a mask or set of indices. If you are in Python, then you can use NumPy's array slicing to select the desired columns of your Jacobian.
(If you compute w.r.t. position, then make sure you use Joint.position_start() and Joint.num_positions().)
Example notebook:
https://nbviewer.jupyter.org/github/EricCousineau-TRI/repro/blob/eb7f11d/drake_stuff/notebooks/multibody_plant_jacobian_subset.ipynb
(TODO: Point to a more official source.)
Main code to pay attention to:
def get_velocity_mask(plant, joints):
"""
Generates a mask according to supplied set of ``joints``.
The binary mask is unable to preserve ordering for joint indices, thus
`joints` required to be a ``set`` (for simplicity).
"""
assert isinstance(joints, set)
mask = np.zeros(plant.num_velocities(), dtype=np.bool)
for joint in joints:
start = joint.velocity_start()
end = start + joint.num_velocities()
mask[start:end] = True
return mask
def get_velocity_indices(plant, joints):
"""
Generates a list of indices according to supplies list of ``joints``.
The indices are generated according to the order of ``joints``, thus
``joints`` is required to be a list (for simplicity).
"""
indices = []
for joint in joints:
start = joint.velocity_start()
end = start + joint.num_velocities()
for i in range(start, end):
indices.append(i)
return indices
...
# print(Jv1_WG1) # Prints 7 dof from a 14 dof plant
[[0.000 -0.707 0.354 0.707 0.612 -0.750 0.256]
[0.000 0.707 0.354 -0.707 0.612 0.250 0.963]
[1.000 -0.000 0.866 -0.000 0.500 0.612 -0.079]
[-0.471 0.394 -0.211 -0.137 -0.043 -0.049 0.000]
[0.414 0.394 0.162 -0.137 0.014 0.008 0.000]
[0.000 -0.626 0.020 0.416 0.035 -0.064 0.000]]
I have a distributed dask array with shape (2400,2400) with chunksize (100,100). I thought I could use topk(-n) to find the smallest n values. However, it appears to return an array of shape (2400,n), so it looks like it finds the smallest n in each row.Is there a way to use topk to get the smallest n values across all rows (entire array)?
One idea is to call topk twice, once for each axis.
>>> dist
dask.array<pow, shape=(2400, 2400), dtype=float64, chunksize=(100, 100)>
>>> dist.topk(-5,axis=0).topk(-5,axis=1).compute()
array([[ 0. , 2620.09503644, 2842.15200157, 2955.08409356,
3163.49458669],
[3660.67698657, 3670.4457495 , 3700.09837707, 3717.09052889,
4002.86497399],
[4125.89820524, 4139.44658137, 4250.50420539, 4331.01304547,
4402.14606754],
[4328.22966119, 4378.25193428, 4507.94409903, 4522.4913488 ,
4555.06860541],
[4441.58755402, 4560.95625938, 4576.39333974, 4682.06215251,
4765.11531865]])
One idea is to call topk twice, once for each axis.
Sounds good to me!
You might consider flattening the array first, but I can't see an advantage to this to what you've already found.
x.flatten().topk(...)
I will like to use a clustering algorithm to find a clustering for a big Digraph, and I will like remove noise from this graph too. So, I was thinking to use the DBSCAN approach, because I saw that we can give to the algorithm a distance function for determining the distance/similarity between two different nodes.
My question is, how can I define a distance function which increases the similarity between two nodes closes in terms of hops and decrease when a node is isolated.
I don't have coordinates or node attributes, so I can not use those. I only have the topology of the graph.
The expected output will be something like this:
I'm really concern about the complexity of the solution. How can approximate a clustering with a linear complexity ...
What is wrong with the obvious?
Distance(a,b) = length of shortest path, or infinity if there is none.
You probably should take directions into account, so a0 to a3 ist 1.
The distance metric suggested by #Anony-Mousse is a good
and natural one, but I question the use of dbscan. Using
the proposed
distance = length of shortest path, or infinity if there is none
Any two nodes that are directly linked would be at distance 1.
If you used dbscan with epsilon < 1, all points would be noise
points. So you will want epsilon > 1. From your example, it looks
like if there is even one point at distance 1, you want them in
the same component so
it looks like you want minNumPts = 2. This will give the
result that it two points are connected by a path of any length
they would be in the same cluster. It looks to me like what
you are after has nothing to do with density and clustering,
rather, I think that what you want is connected components.
If two nodes are connected by a path of any length, they are
in the same component. Finding this via dbscan or some other clustering
method may be possible, but that is probably the
wrong way to think about this. You have a graph and a graph
theoretic problem. You should probably use methods from graph
theory.
I will illustrate using R and igraph. There are other tools
if you don't care for these.
Most of the work is simply setting up your problem.
library(igraph)
to = c("a1", "a2", "a3", "a0", "b1", "b2", "b3", "b0")
from = c("a0", "a1", "a2", "a3", "b0", "b1", "b2", "b3")
EL = data.frame(from, to)
Vert = c("a0", "a1", "a2", "a3", "b0", "b1", "b2", "b3", "c0", "d0")
Vdf = data.frame(Vert)
g = graph_from_data_frame(d = EL, vertices=Vdf)
LO = matrix(c(1.2,1,1,1.2, 2.2,2,2,2.2, 0, 3, 4,3,2,1,4,3,2,1,4,4),
ncol=2)
plot(g, layout=LO)
Now we can use a one-liner to get everything that we need
about the components.
Comp = components(g, mode="weak")
Comp
$membership
a0 a1 a2 a3 b0 b1 b2 b3 c0 d0
1 1 1 1 2 2 2 2 3 4
$csize
[1] 4 4 1 1
$no
[1] 4
This is telling us component membership of the nodes,
the number of nodes per component and the number of
components. Since you wanted to call the single node
components "noise" in the style of dbscan, you can
see that components 3 and 4 have one node each.
They are the noise. The others are "real" components.
To show how to use this and to come to closure with a
pretty picture, I will plot the graph coloring the
components and use light gray for the "noise".
ColorMap = rainbow(Comp$no)
ColorMap[Comp$csize == 1] = "lightgray"
plot(g, layout=LO, vertex.color=ColorMap[Comp$membership])
I encourage you to think about your graph problem as a graph.
refer to julia-lang documentations :
hist(v[, n]) → e, counts
Compute the histogram of v, optionally using approximately n bins. The return values are a range e, which correspond to the edges of the bins, and counts containing the number of elements of v in each bin. Note: Julia does not ignore NaN values in the computation.
I choose a sample range of data
testdata=0:1:10;
then use hist function to calculate histogram for 1 to 5 bins
hist(testdata,1) # => (-10.0:10.0:10.0,[1,10])
hist(testdata,2) # => (-5.0:5.0:10.0,[1,5,5])
hist(testdata,3) # => (-5.0:5.0:10.0,[1,5,5])
hist(testdata,4) # => (-5.0:5.0:10.0,[1,5,5])
hist(testdata,5) # => (-2.0:2.0:10.0,[1,2,2,2,2,2])
as you see when I want 1 bin it calculates 2 bins, and when I want 2 bins it calculates 3.
why does this happen?
As the person who wrote the underlying function: the aim is to get bin widths that are "nice" in terms of a base-10 counting system (i.e. 10k, 2×10k, 5×10k). If you want more control you can also specify the exact bin edges.
The key word in the doc is approximate. You can check what hist is actually doing for yourself in Julia's base module here.
When you do hist(test,3), you're actually calling
hist(v::AbstractVector, n::Integer) = hist(v,histrange(v,n))
That is, in a first step the n argument is converted into a FloatRange by the histrange function, the code of which can be found here. As you can see, the calculation of these steps is not entirely straightforward, so you should play around with this function a bit to figure out how it is constructing the range that forms the basis of the histogram.