I have a tree structure containing quantities in property of each node.
I want to build up the sums like adding the quantities of child nodes and then multiply the sum with the quantity of the parent. This calculated quantity will then be used in the next parent when he collects the child quantities.
I can not modify a property within the node because the structure is used for calculating quantities in different sections of the tree.
I attached virtual nodes to the existing tree containing copies of the quantities. The problem is: I cannot execute matches on the virtual nodes and their relations. Is there a way to use a mixture of "real" nodes and virtual nodes as a database to execute cypher queries on them?
I am open to alternative solutions...
Thanks
Related
I am using Google DataFlow Java SDK 2.2.0. Use case as follows:
PCollection pEmployees: employees and corresponding department name. may contain up to 10 million elements.
PCollection pDepartments: department name and number of elements to be published per department. will contain few hundred elements.
task: Collect elements from pEmployees as per the department-wise number for all departments from pDepartments. This will be a big collection (up to a few hundred thousand elements or few GBs).
We cannot user Top transform here as it would work one at a time on pEmployee, whereas we have multiple departments and that too, in a PCollection. We can assign a row number to each of the elements from pEmployees, join it with pDepartments and filter the records where row_number > target number from pDepartments. This will require a global ranking.
Question: how can we assign rank/row numbers to the elements in a pcollection?.
This is very close to the Sample transform, but not quite, because it applies the same threshold to all keys when used as .perKey(). Generally, Beam currently doesn't support per-key combines with different combine function parameters.
I'd recommend to emulate it by using CoGroupByKey to join pEmployees and pDepartments and obtain tuples (CoGbkResult) containing department name, N = number of elements, and all employees in that department. Then simply iterate through the employees and emit the first N and discard the rest.
I have a tree-like structure and I'm trying to get a Cypher query which will replace the parent node with the child if the parent node does not have a certain relation
for example the query: MATCH (c)-[:CHILD_OF*]->(p {id:"123"}) return c returns a structure like so (we don't care about what the other nodes are, the structure is the only thing that needs to be preserved)
()<-(A)
()<-()<-(B)<-()<-(C)
()<-(D)<-(E)<-()<-(F)
\-(G)<-()<-H)
How could I get the query to ignore all nodes without a certain property but keep it the same structure like so:
(A)
(B)<-(C)
(D)<-(E)<-(F)
(G)<-(H)
You should take a look at the procedures for creating virtual nodes and relationships in APOC Procedures.
These will allow you to create virtual relationships, that will not be saved to the graph, but will be present and viewable in your query.
The tricky part will be creating those new virtual relationships. You'll likely be filtering down nodes in all paths to the nodes you're interested in. At that point you may need to use apoc.coll.pairsMin() in order to get each adjacent pair of nodes in the collection on a row so you can create the virtual relationships between them.
After all the virtual relationships are created (in the same cypher query), match from the root node using those virtual relationships, and you should see the graph you want.
My business requirement says I need to add an arbitrary number of well-defined (AKA not dynamic, not unknown) attributes to certain types of nodes. I am pretty sure that while there could be 30 or 40 different attributes, a node will probably have no more than 4 or 5 of them. Of course there will be corner cases...
In this context, I am generically using 'attribute' as a tag wanted by the business, and not in the Neo4J sense.
I'll be expected to report on which nodes have which attributes. For example, I might have to report on which nodes have the "detention", "suspension", or "double secret probation" attributes.
One way is to simply have an array of appropriate attributes on each entity. But each query would require a search of all nodes. Or, I could create explicit attributes on each node. Now they could be indexed. I'm not seriously considering either of these approaches.
Another way is to implement each attribute as a singleton Neo node, and allow many (tens of thousands?) of other nodes to relate to these nodes. This implementation would have 10,000 nodes but 40,000 relationships.
Finally, the attribute nodes could be created and used by specific entity nodes on an as-needed basis. In this case, if 10,000 entities had an average of 4 attributes, I'd have a total of 50,000 nodes.
As I type this, I realize that in the 2nd case, I still have 40,000 relationships; the 'truth' of the situation did not change.
Is there a reason to avoid the 'singleton' implementation? I could put timestamps on the relationships. But those wouldn't be indexed...
For your simple use case, I'd suggest an approach you didn't list -- which is to use a node label for each "attribute".
Nodes can have multiple labels, and neo4j can quickly iterate through all the nodes with the same label -- making it very quick and easy to find all the nodes with a specific label.
For example:
MATCH (n:Detention)
RETURN n;
I would like to represent millions of products that belong to one or more of dozens of categories.
I'm contemplating a few approaches:
Indexed Category Nodes - Create nodes for each category and create an auto_index on category_name. Then create "isCategoryOf" relationships between each of my product nodes and their respective category nodes.
Individual Category Relationship Types- Create respective "isCategoryGames", "isCategoryFood", "isCategoryLifestyle", etc... relationships between products and the root node.
Storing Categories as a Property of One Relationship Type - Create "isCategory" relationshps between prduct nodes and the root node and store their respective category types in a property of the relationship, e.g. relationship "isCategory" { categoryName:"food"}
Which of these approaches is most efficent and/or scalable. Is there a limit or performance implications of having almost every node in the database connect to the root node?
If you attach millions of nodes to the root node, you make the root node a supernode. This can be problematic.
The general concept of Option 1 shows promise. If you were modeling food, you might have nodes with a name property like "Nuts", "Dairy Products", "Desserts", "Produce" and a type property of "Category". You would then have other nodes with a name property like "Cherry Cheesecake" with outgoing "category" edges to the "Dairy Products", and "Desserts" nodes.
Whether this structure is going to be performant enough depends on your queries. If you have broad categories like 'food', you could end up with a supernode, and you'll take a linear scan through the connected nodes to find a node with a given property. A linear scan over thousands of things might be fast enough for your purposes, but a scan over 1M things might not.
To find out, I would recommend creating a quick prototype where you generate some random product and category nodes, then connect each product node to a random number of category nodes. Indexing the product and category nodes by name will help you find individual products or categories, but it's the traversals that will cause performance problems if you hit supernodes. Experiment with a few of the Gremlin traversals or Cypher queries that you think might be most problematic. Try scaling up the number of nodes from 1K, 10K, 100K, and 1M with a proportionate number of edges. How do your traversal / query times change?
I modelled a tree structure using the Neo4J graph database. All nodes represent a category with a characterising name. So I have to traverse my tree very often from the root to a specific node / category. To which node depends on a list coming as input. This list contains strings representing the names of the categories from the root to the target node.
I wonder, if it would be effective to store these names as the types of the edges instead of a name property in the particular nodes.
I thought that when I do so, Neo4J doesn't have to look for the fitting name property of every child node every time going a step deeper in the tree. Instead Neo4J could lookup the name in the map that contains the outgoing edges.
What do you think?
Sounds sensible. How many different names do you have? If it is just categories those shouldn't be millions.
Did you load your data into the graph and run a performance comparison between both approaches? Is it a performance critical thing in your graph?