I want to understand how to compute big-O for a dense versus sparse graph.
"Algorithms in a nutshell" says that for sparse graph, O(E) is O(V) and for dense graph O(E) is closer to O(V^2). Does anyone know how is that derived?
Assuming the graph is simple - at the worst case every node can be connected to all |V|-1 other nodes, resulting in [in not directed graph:] |E| = (|V|-1) + (|V| -2) + ... + 1 <= |V| * (|V| -1) = O(|V|^2). And in directed graph: |E| = |V| * (|V|-1) = O(|V|^2).
A good example for a dense graph is a clique - which have all the edges.
For sparsed graph - we assume the number of edges connected to each vertex is bounded by a constant. Let this constant be k. Thus: |E| <= k* |V|, and we get |E| = O(|V|)
A good example for a sparsed graph is the internet, where every URL is a node and every link is an edge.
Note that if the graph is not simple, you cannot bound |E| with any function of |V|.
It's not derived, it's a definition. In a fully connected (directed) graph with self-loops, the number of edges |E| = |V|² so the definition of a dense graph is reasonable. The definition of a sparse graph is one where O(|E|) = O(|V|), so there's a constant maximum number of edges per vertex.
Note that if the number of edges is much smaller, e.g. O(lg |V|), then it's still O(|V|) as well. One could imagine a "semi-sparse" class of graphs with |E| = O(|V| lg |V|) or something like that, but I personally have never encountered such a class in practice.
Related
I am trying to understand the difference or error I am receiving between these two steps. I followed this tutorial to practice KNN with my own data (https://towardsdatascience.com/create-a-similarity-graph-from-node-properties-with-neo4j-2d26bb9d829e)
During the process we project our graph of interest, which mine contains three properties: bd_load, weight, and length of organisms. In the example we use this code below to create scaledProperties embeddings between the 3 variables.
Project graph
//(5) project graph of interest
CALL gds.graph.project('bd_graph',
'node_sim',
'*',
{nodeProperties:['bd_load', 'weight', 'length']})
Scale variables of interest between 0-1 for future Euclidean distance calculation
//(6) add scalar 0-1
CALL gds.alpha.scaleProperties.mutate('bd_graph',
{nodeProperties:['bd_load', 'weight', 'length'],
scaler:'MinMax',
mutateProperty:'scaledProperties'})
YIELD nodePropertiesWritten
We then can run KNN based on euclidean distance
//(8) project relationship to graph
CALL gds.knn.mutate("bd_graph",
{nodeProperties: {scaledProperties: "EUCLIDEAN"},
topK: 15,
mutateRelationshipType: "IS_SIMILAR",
mutateProperty: "similarity",
similarityCutoff: 0.6409912109375,
sampleRate:1,
randomSeed:42,
concurrency:1}
)
However I continue the learning curve with Neo4j and FastRP I am trying to understand the difference between the scale property and FastRP. Today I tried to create graph embeddings for my 3 variables using FastRP with 8 dimensions on my projected graph with out running the scaled property embeddings. My thought was increasing the dimensions would be better for finding similarities between nodes. The code below runs fine and there is an embedding vector with 8 elements.
FastRP
CALL gds.fastRP.mutate(
'bd_graph',
{
embeddingDimension: 8,
mutateProperty: 'fastrp-embedding',
featureProperties: ['bd_load', 'weight', 'length']
}
)
YIELD nodePropertiesWritten
But when I run the below code
ALL gds.knn.stats("bd_graph",
{
nodeProperties:{fastrp-embedding:"EUCLIDEAN"},
topK:10,
sampleRate:1,
randomSeed:42,
concurrency:1
}
) YIELD similarityDistribution
RETURN similarityDistribution
I receive an error:
Invalid input '{': expected "+" or "-" (line 4, column 22 (offset: 97))
nodeProperties:{fastrp-embedding:"EUCLIDEAN"},
Does the embedding element length have to match the number of variables in the node? Am using FastRP correctly and my understanding of creating embeddings with in nodes to then calculate Euclidean distance for a similarity score?
I am glad you are finding the tutorial helpful and getting into GDS!
Map keys in Cypher must be strings. https://neo4j.com/docs/cypher-manual/current/syntax/maps/
The - in your property name fastrp-embedding is not recognized as a string character. If you enclose that property name with back ticks, GDS will know to treat the special character as part of the map key. This should work for you.
CALL gds.knn.stats("bd_graph",
{
nodeProperties:{`fastrp-embedding`:"EUCLIDEAN"},
topK:10,
sampleRate:1,
randomSeed:42,
concurrency:1
}
) YIELD similarityDistribution
RETURN similarityDistribution
The recommended format for Neo4j property names is camel case. If you name your property fastrpEmbedding instead of fastrp-embedding, you would not need to use the back ticks.
I am using the function cv.matchTemplate to try to find template matches.
result = cv.matchTemplate(img, templ, match_method)
After I run the function I have a bunch of answers in list result. I want to filter the list to find the best n matches. The data in result just a large array of numbers so I don't know what criteria to filter based on. Using extremes = cv.minMaxLoc(result, None) filters the result list in an undesired way before converting them to locations.
The match_method is cv.TM_SQDIFF. I want to:
filter the results down to the best matches
Use the results to obtain the locations
How can I acheive this?
You can treshold the result of matchTemplate to find locations with sufficient match. This tutorial should get you started. Read at the bottom of the page for finding multiple matches.
import numpy as np
threshold = 0.2
loc = np.where( result <= threshold) # filter the results
for pt in zip(*loc[::-1]): #pt marks the location of the match
cv2.rectangle(img_rgb, pt, (pt[0] + w, pt[1] + h), (0,0,255), 2)
Keep in mind depending on the function you use will determine how you filter. cv.TM_SQDIFF tends to zero as the match quality increases so setting the threshold closer to zero filters out worse. The opposite is true for cv.TM CCORR cv.TM_CCORR_NORMED cv.TM_COEFF and cv.TM_COEFF_NORMED matching methods (better tends to 1)
The above answer does not find the best N matches as the question asked. It filters out answers based on a threshold leaving open the (likely) possibility that you still have more than N results or zero results that beat the threshold.
To find the N 'best matches' we're looking for the N highest numbers in a 2d array and retrieving their indexes so we know the location. We can use nump.argpartition to find the highest N indexes in a 1d array and numpy.ndarray.flatten with numpy.unravel_index to go back and forth between a 2d and 1d array like so:
find_num = 5
result = cv.matchTemplate(img, templ, match_method)
idx_1d = np.argpartition(result.flatten(), -find_num)[-find_num:]
idx_2d = np.unravel_index(idx_1d, result.shape)
From here you have the x,y locations of the top 5 matches.
I am trying to use osmnx to find distances between a origin point (lat/lon) and nearest infrastructure, such as railways, water or parks.
1) I get the entire graph from an area with network_type='walk'.
2) Get the needed infrastructure, e.g. railway for that same area.
3) Compose the two graphs into one.
4) Find the nearest node from origin point in the original graph.
5) Find the nearest node from the origin point in the infrastructure graph
6) Find the shortest route length between the two nodes.
If you run the example below, you will see that it is missing 20% of the data because it cannot find a route between the nodes. For infrastructure='way["leisure"~"park"]' or infrastructure='way["natural"~"wood"]' this is even worse, with 80-90% of nodes not being connected.
Minimal reproducible example:
import osmnx as ox
import networkx as nx
bbox = [55.5267243, 55.8467243, 12.4100724, 12.7300724]
g = ox.graph_from_bbox(bbox[0], bbox[1], bbox[2], bbox[3],
retain_all=True,
truncate_by_edge=True,
simplify=False,
network_type='walk')
points = [(55.6790884456018, 12.568493971506154),
(55.6790884456018, 12.568493971506154),
(55.6867418740291, 12.58232314016353),
(55.6867418740291, 12.58232314016353),
(55.6867418740291, 12.58232314016353),
(55.67119624894504, 12.587201455313153),
(55.677406927839506, 12.57651997656002),
(55.6856574907879, 12.590500429002823),
(55.6856574907879, 12.590500429002823),
(55.68465359365924, 12.585474365063224),
(55.68153666806675, 12.582594757267945),
(55.67796979175, 12.583111746311117),
(55.68767346629932, 12.610040871066179),
(55.6830855237578, 12.575431380892427),
(55.68746749645466, 12.589488615911913),
(55.67514254640597, 12.574308210656602),
(55.67812748568291, 12.568454119053886),
(55.67812748568291, 12.568454119053886),
(55.6701733527419, 12.58989203029166),
(55.677700136266616, 12.582800629527789)]
railway = ox.graph_from_bbox(bbox[0], bbox[1], bbox[2], bbox[3],
retain_all=True,
truncate_by_edge=True,
simplify=False,
network_type='walk',
infrastructure='way["railway"]')
g_rail = nx.compose(g, railway)
l_rail = []
for point in points:
nearest_node = ox.get_nearest_node(g, point)
rail_nn = ox.get_nearest_node(railway, point)
if nx.has_path(g_rail, nearest_node, rail_nn):
l_rail.append(nx.shortest_path_length(g_rail, nearest_node, rail_nn, weight='length'))
else:
l_rail.append(-1)
There are 2 things that caught my attention.
OSMNX documentation specifies ox.graph_from_bbox parameters be given in the order of north, south, east, west (https://osmnx.readthedocs.io/en/stable/osmnx.html). I mention this because when I tried to run your code, I was getting empty graphs.
The parameter 'retain_all = True' is the key as you may already know. When set to true, it retains all nodes in the graph, even if they are not connected to any of the other nodes in the graph. This happens primarily due to the incompleteness of OpenStreetMap which contains voluntarily contributed geographic information. I suggest you set 'retain_all = False' meaning your graph now contains only the connected nodes. In this way, you get a complete list without any -1.
I hope this helps.
g = ox.graph_from_bbox(bbox[1], bbox[0], bbox[3], bbox[2],
retain_all=False,
truncate_by_edge=True,
simplify=False,
network_type='walk')
railway = ox.graph_from_bbox(bbox[1], bbox[0], bbox[3], bbox[2],
retain_all=False,
truncate_by_edge=True,
simplify=False,
network_type='walk',
infrastructure='way["railway"]')
g_rail = nx.compose(g, railway)
l_rail = []
for point in points:
nearest_node = ox.get_nearest_node(g, point)
rail_nn = ox.get_nearest_node(railway, point)
if nx.has_path(g_rail, nearest_node, rail_nn):
l_rail.append(nx.shortest_path_length(g_rail, nearest_node, rail_nn, weight='length'))
else:
l_rail.append(-1)
print(l_rail)
Out[60]:
[7182.002999999995,
7182.002999999995,
5060.562000000002,
5060.562000000002,
5060.562000000002,
6380.099999999999,
7127.429999999996,
4707.014000000001,
4707.014000000001,
5324.400000000003,
6153.250000000002,
6821.213000000002,
8336.863999999998,
6471.305,
4509.258000000001,
5673.294999999996,
6964.213999999994,
6964.213999999994,
6213.673,
6860.350000000001]
I will like to use a clustering algorithm to find a clustering for a big Digraph, and I will like remove noise from this graph too. So, I was thinking to use the DBSCAN approach, because I saw that we can give to the algorithm a distance function for determining the distance/similarity between two different nodes.
My question is, how can I define a distance function which increases the similarity between two nodes closes in terms of hops and decrease when a node is isolated.
I don't have coordinates or node attributes, so I can not use those. I only have the topology of the graph.
The expected output will be something like this:
I'm really concern about the complexity of the solution. How can approximate a clustering with a linear complexity ...
What is wrong with the obvious?
Distance(a,b) = length of shortest path, or infinity if there is none.
You probably should take directions into account, so a0 to a3 ist 1.
The distance metric suggested by #Anony-Mousse is a good
and natural one, but I question the use of dbscan. Using
the proposed
distance = length of shortest path, or infinity if there is none
Any two nodes that are directly linked would be at distance 1.
If you used dbscan with epsilon < 1, all points would be noise
points. So you will want epsilon > 1. From your example, it looks
like if there is even one point at distance 1, you want them in
the same component so
it looks like you want minNumPts = 2. This will give the
result that it two points are connected by a path of any length
they would be in the same cluster. It looks to me like what
you are after has nothing to do with density and clustering,
rather, I think that what you want is connected components.
If two nodes are connected by a path of any length, they are
in the same component. Finding this via dbscan or some other clustering
method may be possible, but that is probably the
wrong way to think about this. You have a graph and a graph
theoretic problem. You should probably use methods from graph
theory.
I will illustrate using R and igraph. There are other tools
if you don't care for these.
Most of the work is simply setting up your problem.
library(igraph)
to = c("a1", "a2", "a3", "a0", "b1", "b2", "b3", "b0")
from = c("a0", "a1", "a2", "a3", "b0", "b1", "b2", "b3")
EL = data.frame(from, to)
Vert = c("a0", "a1", "a2", "a3", "b0", "b1", "b2", "b3", "c0", "d0")
Vdf = data.frame(Vert)
g = graph_from_data_frame(d = EL, vertices=Vdf)
LO = matrix(c(1.2,1,1,1.2, 2.2,2,2,2.2, 0, 3, 4,3,2,1,4,3,2,1,4,4),
ncol=2)
plot(g, layout=LO)
Now we can use a one-liner to get everything that we need
about the components.
Comp = components(g, mode="weak")
Comp
$membership
a0 a1 a2 a3 b0 b1 b2 b3 c0 d0
1 1 1 1 2 2 2 2 3 4
$csize
[1] 4 4 1 1
$no
[1] 4
This is telling us component membership of the nodes,
the number of nodes per component and the number of
components. Since you wanted to call the single node
components "noise" in the style of dbscan, you can
see that components 3 and 4 have one node each.
They are the noise. The others are "real" components.
To show how to use this and to come to closure with a
pretty picture, I will plot the graph coloring the
components and use light gray for the "noise".
ColorMap = rainbow(Comp$no)
ColorMap[Comp$csize == 1] = "lightgray"
plot(g, layout=LO, vertex.color=ColorMap[Comp$membership])
I encourage you to think about your graph problem as a graph.
I am trying to extract Rotation matrix and Translation vector from the essential matrix.
<pre><code>
SVD svd(E,SVD::MODIFY_A);
Mat svd_u = svd.u;
Mat svd_vt = svd.vt;
Mat svd_w = svd.w;
Matx33d W(0,-1,0,
1,0,0,
0,0,1);
Mat_<double> R = svd_u * Mat(W).t() * svd_vt; //or svd_u * Mat(W) * svd_vt;
Mat_<double> t = svd_u.col(2); //or -svd_u.col(2)
</code></pre>
However, when I am using R and T (e.g. to obtain rectified images), the result does not seem to be right(black images or some obviously wrong outputs), even so I used different combination of possible R and T.
I suspected to E. According to the text books, my calculation is right if we have:
E = U*diag(1, 1, 0)*Vt
In my case svd.w which is supposed to be diag(1, 1, 0) [at least in term of a scale], is not so. Here is an example of my output:
svd.w = [21.47903827647813; 20.28555196246256; 5.167099204708699e-010]
Also, two of the eigenvalues of E should be equal and the third one should be zero. In the same case the result is:
eigenvalues of E = 0.0000 + 0.0000i, 0.3143 +20.8610i, 0.3143 -20.8610i
As you see, two of them are complex conjugates.
Now, the questions are:
Is the decomposition of E and calculation of R and T done in a right way?
If the calculation is right, why the internal rules of essential matrix are not satisfied by the results?
If everything about E, R, and T is fine, why the rectified images obtained by them are not correct?
I get E from fundamental matrix, which I suppose to be right. I draw epipolar lines on both the left and right images and they all pass through the related points (for all the 16 points used to calculate the fundamental matrix).
Any help would be appreciated.
Thanks!
I see two issues.
First, discounting the negligible value of the third diagonal term, your E is about 6% off the ideal one: err_percent = (21.48 - 20.29) / 20.29 * 100 . Sounds small, but translated in terms of pixel error it may be an altogether larger amount.
So I'd start by replacing E with the ideal one after SVD decomposition: Er = U * diag(1,1,0) * Vt.
Second, the textbook decomposition admits 4 solutions, only one of which is physically plausible (i.e. with 3D points in front of the camera). You may be hitting one of non-physical ones. See http://en.wikipedia.org/wiki/Essential_matrix#Determining_R_and_t_from_E .