I have two DiGraphs, say G and H, and I would like to count how many paths of G are part of H.
For any node pairs (src, dst) I can generate the paths between them using the 'all_simple_paths' function to get the generators:
G_gen = nx.all_simple_paths(G, src, dst)
H_gen = nx.all_simple_paths(H, src, dst)
Since the amount of paths is considerably high (the graphs have typically 100 nodes) I cannot resort to building lists etc.. (e.g. list(G_gen)) so I am wondering if there are smarter ways to deal with it. In addition, I would also like to distinguish based on the path lengths.
.. or maybe a better solution can be found with a different module ?
Thanks in advance for any help on this.
Thierry
I wonder if there is some reason why nx.intersection (see here) wouldn't work here? I'm not sure if it checks for direction under the hood but it doesn't seem to force outputs to standard Graph output either. Below might work:
# Create a couple of random preferential attachment graphs
G = nx.barabasi_albert_graph(100, 5)
H = nx.barabasi_albert_graph(100, 5)
# Convert to directed
G = G.to_directed()
H = H.to_directed()
# Get intersection
intersection = nx.intersection(G, H)
# Print info for each
print(nx.info(G))
print(nx.info(H))
print(nx.info(intersection))
which outputs:
>>> DiGraph with 100 nodes and 950 edges
>>> DiGraph with 100 nodes and 950 edges
>>> DiGraph with 100 nodes and 176 edges
The nodes are all shared in the example since the node ids are just simple integers and so they follow the same generation index. With real data I suppose your node sets might not be equivalent like here and you probably will see differences there too.
On the path lengths I'm not quite sure how you would go about that. The intersection just checks which nodes and edges are shared between two graphs and returns those that are in both, unaware of any other conditions I suspect. There might be a way to impose some additional constraints by adapting the source code with of the intersection function with some conditional checks.
I guess this doesn't check the number of paths but rather the number of edges, so I suppose you're looking for something more specific than this. But at the very least no path can exist outside of the intersection, since all shared paths must contain the same edges in both (since if an edge is missing from a path in either, it cannot exist as a path in the shared solution).
Hope this helps in some way shape or form, though I feel I've oversimplified your question quite a bit.
EDIT: Intuitively, the full solution to your question might be to simply enumerate all possible paths in the intersection.
I am struggling to find 1 efficient algorithm which will give me all possible paths between 2 nodes in a directed graph.
I found RGL gem, fastest so far in terms of calculations. I am able to find the shortest path using the Dijkstras Shortest Path Algorithm from the gem.
I googled, inspite of getting many solutions (ruby/non-ruby), either couldn't convert the code or the code is taking forever to calculate (inefficient).
I am here primarily if someone can suggest to find all paths using/tweaking various algorithms from RGL gem itself (if possible) or some other efficient way.
Input of directed graph can be an array of arrays..
[[1,2], [2,3], ..]
P.S. : Just to avoid negative votes/comments, unfortunately I don't have inefficient code snippet to show as I discarded it days ago and didn't save it anywhere for the record or reproduce here.
The main problem is that the number of paths between two nodes grows exponentially in the number of overall nodes. Thus any algorithm finding all paths between two nodes, will be very slow on larger graphs.
Example:
As an example imagine a grid of n x n nodes each connected to their 4 neighbors. Now you want to find all paths from the bottom left node to the top right node. Even when you only allow for moves to the right (r) and moves up (u) your resulting paths can be described by any string of length 2n with equal number of (r)'s and (u)'s. This will give you "2n choose n" number of possible paths (ignoring other moves and cycles)
Here is my problem which I am trying to solve since one complete year. With no success till end of the year. I have to seek help and a concrete solutions from the stackoverflow experts.
My problem statement:
I have been working with some design patterns which I want to trace if eulerian path exist(as shown in below gifs), programmatically. Below are the patterns and the way I wanna draw them(gifs).
What I wanna achieve:
Give the design pattern images as input. I want trace the design pattern image in a single stroke as shown in the gifs(gifs animations are just examples of how the patterns is drawn in single stroke). Once I get the x and y coordinates of the image in single stroke fashion(eulerian path). I will feed those coordinates to my program to just trace those coordinates.
Thing to be noted in the animation:
1) basically its an undetected graph (the nodes being the vertices of your shapes, the edges if exists being the strokes between 2 vertices). (eulerian path)
Here are the 15 unique shapes which I used to build the patterns with:
I have more then 400 patterns(3 patterns already shown below) and till now I am not able to find a generic solution for this. I have manually got the x y coordinates of the patterns and placed it in sequence. But that is not at all scalable.
How to trace the patterns such that each node is visited only once ?:
1st kind of pattern and the way it should be drawn:
2nd kind of pattern and the way it should be drawn:
3rd kind of pattern and the way it should be drawn:
Perhaps you can look into the traveling salesman problem if your still struggling with the above. TSP visits cities only once. And if in your case each node is a crossing for your strike-through then this might help.
Check here for the python code to look at. I've checked and the print statement looks nice and structured. Well done cMinor!
Edit based on discussion: file 1, file2, file3.
if we choose 20 topics in LDA and then if we choose 30 topics. So my question is will both these results intersect those 20 topics and produce similar results
Short answer - no. The way LDA works is it uses Gibbs sampler to get Dirichlet distribution over document vectors. Allocations are then made on this sample and hence will always be different both because of sampling randomness and allocation uncertainties unless you define explicit random seed and run same number of topics k. Take a look at original paper Blei et al. 2003 to see how k is defined.
UPDATE (with regard to comment): Hierarchical LDA (hLDA) is trying to solve the problem of retaining topics and subtopics by constructing levels of topics following the Chinese restaurant model. But it's still not perfect.
The way flat LDA works, however, is it looks at documents rather than topics to produce further results. Say, you get topic 0 (first table in restaurant) and all documents try to sit there, but it's not really enough space and you create another topic 1 where some docs feel more comfortable, etc., etc. now you are right from the point of view of how these tables are created. But there is one big thing that's critical - topic 0 CHANGES when you create a new table/Topic 1 because some documents have left the first table and took the words (or probabilities of cooccurence thereof) with them to the new table and all words in topic 0 got reshuffled given new situation. Same happens when you create more tables/topics that all the previous are also re-estimated. Hence, you will never get same 20 topics when rerunning with 30.
If you compare two sets of data (such as two files), the differences between these sets can be displayed in two columns, or two panes, such as WinMerge does.
But are there any visual paradigms to display the differences between multiple data sets?
Update
The starting point of my question was the assumption that displaying differences between 2 files is relatively easy, as I mentioned WinMerge, whereas comparing 3 or more text files turns out to be more complicated, as there will be more and more differences between, say, different versions of a document that have been created over time.
How would you highlight parts of the file that are the same in 2 versions, but different from other versions?
The data sets I have in mind are objects (A, B, C, ...) which may or may not exist and have properties (a, b, c, ...) which may be set or not set.
Example:
Set 1: A(a, b, c), B(b, c), C(c)
Set 2: A(a, b, c), B(b), C(c)
Set 3: A(a, b), B(b)
If you compare 2 sets, e.g. 1 and 2, the difference would be in B(c). Comparing sets 2 and 3 results in the difference A(c) and C().
If you compare all 3 sets, you end up with 3 comparisons (n * (n-1) / 2)
I have a different view than some of those who provided Answers--i.e., that you need to further specify the problem. The abstraction level is about right. Further specification would make the problem easier, but the solution less useful.
A couple of years ago, i saw a graphic on ProgrammableWeb--it compared the results from a search on Yahoo with the results from the same search on Google. There's a lot of information to covey: some results are in both sets, some in just one, and the common results will have different positions in the respective engine's results, which somehow has to be shown.
I like the graphic and reimplemented it in Matplotlib (a Python scientific plotting library). Below is an example using some random points as well as python code i used to generate it:
from matplotlib import pyplot as PLT
xvals = NP.array([(2,3), (5,7), (8,6), (1.5,1.8), (3.0,3.8), (5.3,5.2),
(3.7,4.1), (2.9, 3.7), (8.4, 6.1), (7.1, 6.4)])
yvals = NP.tile( NP.array([5,3]), [10,1] )
fig = PLT.figure()
ax1 = fig.add_subplot(111)
ax1.plot(x, y, "-", lw=3, color='b')
ax1.plot(x, y2, "-", lw=3, color='b')
for a, b in zip(xvals, yvals) : ax1.plot(a,b,'-o',ms=8,mfc='orange', color='g')
PLT.axis("off")
PLT.show()
This model has some interesting features: (i) it actually deals with 'similarity' on a per-item basis (the vertically-oriented line connecting the dots) rather than aggregate similarity; (ii) the degree of similarity between two data points is proportional to the angle of the line connecting them--90 degrees if they are equal, with a decreasing angle as the difference increases; this is very intuitive; (iii) cases in which a point in one data set is not present in the second data set are easy to show--a point will appear on one of the two lines but without a line connecting it to a point on the other line.
This model works well for comparing search results because each search result has a 'score' (its index, or order in the Results List). For other types of data, you might have to assign a score to each data point--a similarity metric might i suppose (in a sense, that's actually what the search result order is, an distance from the top of the list)
Since there has been so much work into displaying a diff of two files, you might start by expressing your 'multiple data sets' in an appropriate text format, then using whatever you want to show a diff between those text formats.
But you should tell us more about your data sets!
I experimented a bit, and implemented two displays:
Matrix
Timeline
I agree with Peter, you should specify what type your data is and what you wish to bring out in the comparison.
Depending on the nature of the data/comparison you can consider different visualisations. Is your data ordered or unordered? How many things are you comparing, i.e. fine grain or gross comparison?
Examples:
Visualizing a comparison of unordered data could just be plotting the two histograms of your sets (i.e. distributions):
image source
On the other hand, comparing a huge ordered dataset like DNA can be done innovatively.
Also, check out visual complexity, it's a great resource for interesting visualization.