A couple of years ago I had occasion to work a bit with the Ruby "nested_set" gem. With some helpful explanation from the chief technologist where I was at, I was able to appreciate how it work, with its columns:
parent
left
right
I've had no occasion to reconsider it in the past couple of years however since I don't work with Rails regularly, but now I'd like to implement it myself on another platform, to structure some data as a tree. So I am seeking a cogent explanation of how it works, be that with a link or links, or with a fleshed out answer.
Thanks in advance
Nested sets are similar to adjacency lists, but offer additional operations that can't easily be performed if parents and children only know about their immediate joins via parent column.
For instance, if we were given the following data model:
Graph Table
A node, parent
/ \ A,
B E B, A
/ \ C, B
C D D, B
E, A
We could easily retrieve node A's immediate children, but where it gets tricky is if we wanted to determine if node C is in node A's hierarchy, or if we want to retrieve node A's entire tree and not just it's immediate children. It's tricky because node C is not an an immediate child of node A, and without knowing the depth of the tree, or a recursive query (isn't an option for some databases), or some kind of SQL voodoo we're pretty much out of luck. Another example that might become problematic is if we wanted to destroy or update every record in the node A tree.
Nested sets introduce "left" and "right" attributes in addition to our initial parent attribute. Now nodes are numbered twice in relation to when they get visited via tree traversal as the record is inserted or modified. Using the previous example with a nested set would look something like this:
+---------------------------+ id, text, lft, rgt
| A | 1, A, 1, 10
| | 2, B, 2, 7
| +----------------+ +----+ | 3, C, 3, 4
| | B | | E | | 4, D, 5, 6
| | | | | | 5, E, 8, 9
| | +----+ +----+ | +----+ |
| | | C | | D | | |
| | | | | | | |
| | +----+ +----+ | |
| +----------------+ |
+---------------------------+
1 2 3 4 5 6 7 8 9 10
With the above example we can determine that node A has a left and right depth of 1 & 10 respectively, so anything within it's hierarchy will have a left and right depth somewhere between those two values. With that said, querying for node A's entire tree now becomes trivial:
SELECT c.id, c.text, c.lft, c.rgt FROM nodes c, nodes p WHERE p.lft < c.lft AND p.rgt > c.rgt AND p.id = 1;
Gives us:
id, text, lft, rgt
2, B, 2, 7
3, C, 3, 4
4, D, 5, 6
5, E, 8, 9
See Recursive data structures with rails for source. As discussed in the question's comments, there might be better/more efficient solutions depending on your requirements - the linked article covers this in more detail.
Related
I am working with a spreadsheet where I store the books I read. The format is as follows:
A | B | C | D | E | F
year | book | author | mark | language | country of the author
With entries like:
A | B | C | D | E | F
-------------------------------------------------------------
2004 | Hamlet | Shakespeare | 8 | ES | UK
2005 | Crimen y punishment | Dostoevsky | 9 | CAT | Russia
2007 | El mundo es ansí | Baroja | 8 | ES | Spain
2011 | Dersu Uzala | Arsenyev | 8 | EN | Russia
2015 | Brothers Karamazov | Dostoevsky | 8 | ES | Russia
2019 | ... Shanti Andía | Baroja | 7 | ES | Spain
I have several pivot tablas to get different data, such as top countries, top books, etc. In one of them I want to group by authors and order by number of books I have read from each one of them.
So I defined:
ROWS
author (column C) with
order: Desc for COUNT of author
VALUES
author
summation by: COUNT
show as Default
mark
summation by: AVERAGE
show as Default
This way, the data above show like this:
author | COUNT of author | AVERAGE of mark
-------------------------------------------------------------
Baroja | 2 | 7,5
Dostoevsky | 2 | 8,5
Shakespeare | 1 | 8
Arsenyev | 1 | 8
It is fine, since it orders data having top read authors on top. However, I would also like to order also by AVERAGE of mark. This way, when COUNT of author matches, it would use AVERAGE of mark to solve the ties and put on top the one author with a better average on their books.
On my sample data, Dostoevsky would go above Baroja (8,5 > 7).
I have been looking for different options, but I could not find any without including an extra column in the pivot table.
How can I use a second option to solve the ties when the first option gives the same value?
You can achieve a customized sort order on a pivot table without any extra columns in the source range. However... you'd definately need an extra field added to the pivot.
In the Pivot table editor go to Values and add a Calculated Field.
Use any formula that describes the sort order you want. E.g. let's multiply the counter by 100 to use as first criteria:
=COUNTA(author) * 100 + AVERAGE(score)
Do notice it is important to select Summarize by your Custom formula (screenshot above).
Now, just add this new calculated field as your row's Sort by field, and you're done!
Notice though, you do get an extra column added to the pivot.
Of course, you could hide it.
Translated from my answer to the cross-posted question on es.SO.
try:
=QUERY(A2:F,
"select C,count(C),avg(D)
where A is not null
group by C
order by count(C) desc
label C'author',count(C)'COUNT of author',avg(D)'AVERAGE of mark'")
I want to write a neo4j query, that finds a 'knitting' pattern. With that I basically mean a set of four nodes with three special edges between them. And continuing these four, there can be a next set of four nodes, that are connected by three other edges, like this (in some Cypher-like syntax with vertical edges.
(n1)-[:e1]-(n2)-[:e2]-(n3)-[:e3]-(n4)
| | | |
[:ew] [:ex] [:ey] [:ez]
| | | |
(n5)-[:e1]-(n6)-[:e2]-(n7)-[:e3]-(n8)
| | | |
[:ew] [:ex] [:ey] [:ez]
| | | |
(n?)-[:e3]-(n!)-[:e3]-(n&)-[:e3]-(n$)
. . . .
: : : :
I can write a query for exactly 8, 12 or 16... nodes, that are connected in this way. But I would like to write it more general to get the longest connected components, that are knitted like this.
Can you give me a hint, how to go along with that, because I'm totally new to neo4j?
Problem: Compute the natural join of R and S. Which of the following tuples is in the result? Assume each tuple has schema (A,B,C,D).
Relation R
| A | C |
|---|---|
| 3 | 3 |
| 6 | 4 |
| 2 | 3 |
| 3 | 5 |
| 7 | 1 |
Relation S
| B | C | D |
|---|---|---|
| 5 | 1 | 6 |
| 1 | 5 | 8 |
| 4 | 3 | 9 |
I'm not quite sure what it means by "assume each tuple has a schema of A,B,C,D". Does this mean the R relation has a scheme of ABCD although it only lists A and C? I should assume there's also B and D but columns B and D are blank?
Operating under that assumption, I got the answer wrong. The explanation says there's no (7,5) in R which there clearly is under column A. Could someone explain to me what I'm doing wrong or if I'm missing something? Thank you!
The answer feedback is misleading and wrong, that would be the feedback if you choose (7,1,5,8)
Your answer is right.
For thoroughness: in a natural join you connect tuples on common attributes, in this case C is the attribute in common.
Your return tuples are:
R S
A,C B,C,D A,B,C,D
(7,1) & (5,1,6) = (7,5,1,6)
(3,5) & (1,5,8) = (3,1,5,8)
(2,3) & (4,3,9) = (2,4,3,9)
(3,3) & (4,3,9) = (3,4,3,9) --Your answer, correct
I even found a Stanford doc defining a natural join, just in case they lived in a different universe than the rest of us, but they don't. It's just a bug in the quiz.
The question doesn't say R has that scheme. It says the natural join of R & S has that scheme.
(There are many variations on what a relation is, what relational operators are available, how they work & what their symbols are. They are telling you to expect that the schema for the join of those two relations has columns A, B, C & D. You should already know that from the definitions in the course, but since they give it nobody should get that part wrong.)
You seem to be saying that your choice of a row in the natural join was 2. That's correct. The explanation says that a wrong choice can't be right because tuple (7,5) is not in R. They do not mean that (7,5) is a list of values "under column A". But that feedback is for choice 3, not choice 2. So the answer checking seems to have a bug. Let them know.
I have a graph with about 1.2 million nodes, and roughly 3.85 million relationships between them. I need to find the top x weighted edges - that is to say, the top x tuples of a, b, n where a and b are unique pairs of vertices and n is the number of connections between them. I am currently getting this data via the following Cypher query:
MATCH (n)-[r]->(x)
WITH n, x, count(r) as weight
ORDER BY weight DESC
LIMIT 50
RETURN n.screen_name, r.screen_name, weight;
This query takes about 25 seconds to run, which is far too slow for my needs. I ran the query with the profiler, which returned the following:
ColumnFilter(0)
|
+Extract
|
+ColumnFilter(1)
|
+Top
|
+EagerAggregation
|
+TraversalMatcher
+------------------+---------+---------+-------------+------------------------------------------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+------------------+---------+---------+-------------+------------------------------------------------------------------------------------------------+
| ColumnFilter(0) | 50 | 0 | | keep columns n.twitter, x.twitter, weight |
| Extract | 50 | 200 | | n.twitter, x.twitter |
| ColumnFilter(1) | 50 | 0 | | keep columns n, x, weight |
| Top | 50 | 0 | | { AUTOINT0}; Cached( INTERNAL_AGGREGATE01a74d75-74df-42f8-adc9-9a58163257d4 of type Integer) |
| EagerAggregation | 3292734 | 0 | | n, x |
| TraversalMatcher | 3843717 | 6245164 | | x, r, x |
+------------------+---------+---------+-------------+------------------------------------------------------------------------------------------------+
My questions, then, are these:
1. Are my expectations way off, and is this the sort of thing that's just going to be slow? It's functionally a map/reduce problem, but this isn't that large of a data set - and it's just test data. The real thing will have a lot more (but be filterable by relationship properties; I'm working on a representative sample).
2. What the heck can I do to make this run faster? I've considered using a start statement but that doesn't seem to help. Actually, it seems to make it worse.
3. What don't I know here, and where can I go to find out that I don't know it?
Thanks,
Chris
You're hitting the database 6,245,164 in the first statement: MATCH (n)-[r]->(x).
It looks like what you're attempting to do is a graph global query -- that is, you're trying to run query over the entire graph. By doing this you're not taking advantage of indexing that reduces the number of hits to the database.
In order to do this at the level of performance you're needing, an unmanaged extension may be required.
http://docs.neo4j.org/chunked/stable/server-unmanaged-extensions.html
Also, a good community resource to learn about unmanaged extensions: http://www.maxdemarzi.com/
The approach here would be to create a REST API method that extends the Neo4j server. This requires some experience programming with Java.
Alternatively you may be able to run an iterative Cypher query that updates an aggregate count on each [r] between (n) and (x).
A | B | C | D | E | F | G
name|num|quant|item|quant2
car | 5 | 100 |
| | |wheel| 4
| | |axel | 2
| | |engine|1
truck| 2 | 20 |
| | |wheel| 6
| | |bed | 1
| | | axel| 2
I need a formula which will do B*C*E. the tables look like this, so it needs to be something like
=b$2*c$2*e3 and then dragged.... and then the next set, b$6*c$6*e7 and dragged, etc but i want sure how to get the cieling sort of something. if b5 is empty, look at each above until it finds the one not filled.
I am trying to use this to get total quantity of parts per car, truck etc.... and then group by part.
I dont have a set of DB tables to do this, just a spreadsheet.
I had to add some additional information to resolve this.
I was thinking there would be a way to do a google script that would do this and update the file, but i couldnt seem to find it.
I first summed each group item:
=b$3*e4
and dragged for that grouping.
Then afterwards, i went to a selection of space and wrote up a query.
=query(D:F, "select D,sum(F) group by D")