When I use the Neo4j REST API, there seems to be a bug:
A node was indexed by some index. After I deleted some properties of that node, unindex it, and then index it again, those properties came back.
This happens once a while. Not every time.
I'm sure those properties are deleted, by querying that node in the cypher console after the delete operation.
Also, some posts reported this without a satisfying answer: the number of nodes/relationships/properties reported by neo4j webadmin looks crazy. I have 5 (including id 0) nodes, but it shows 932 nodes, 4213 properties. This happens every time. Some people say it's the highest ID in use. I don't think it makes any sense semantically to show the highest ID on the "nodes" label. In addition, the highest ID for my nodes is 466, not 932.
I assume you're judging the properties off the count, instead of off a query?
Neo4j's web console uses meta data to display information like node count, property count, and relationship count. This metadata is not always up to date, but it's much faster to use this then to have to scan the entire Graph Database for this information every time.
Neo4j will adjust these properties every now and then, but it doesn't do a de-fragment of it's information all the time.
Related
I have a 50K node graph with 10 properties per node. Each node of the same type but with different values. Each of the properties is on an index and I have increased the heap and page cache memory sizes for the database. However using the browser console, creating the nodes takes 6 minutes!
And also a query for all the properties takes a very long time (~2 minutes) to appear in the browser console but when the results do appear the bottom of the browser says that the result of 50K node properties took only 2500 ms.
How do I improve the performance importing/querying 10's of thousands of unique instances a single node with 10 properties each and no relationships?
It takes time to update 10 different indexes for each node that you create. Do you really have use cases that require an index for every single property? If not, get rid of the indexes you do not need. Remember, indexes can speed up finding the first node(s) to initiate a query, but they do not help at all when traversing paths through a graph.
If you really need all 10 indexes, then to speed up the importing step, you can: drop all the indexes, import all 50K nodes, and then create each index one at a time (which will take some time for each index). The overall time will be about the same, but the import itself should be much faster.
It takes the neo4j browser a very long time to generate and display the visualization for a very large result (e.g., 10's of thousands of nodes). The browser is not intended for viewing that much data at one time.
1) Check that you are running a recent version of Neo4j. 3+ has optimised the way that properties are stored and indexed.
2) Check how you're running the query. Maybe your query is not optimised or is problematic in some way. Note in particular that each MATCH generates a 'row'. Multiple MATCH clauses will yield the Cartesian product of all matched sets, which could be problematic with large armounts of data.
3) Check that each of these properties needs to be attached to a node. Neo4j is optimised for searching for relationships, not for properties.
Consider turning nodes that look like this:
(:Train {
maxSpeedInKPH: 350,
fuelType: 'Diesel',
numberOfEngines: 3
})
to
(:Train)
-[:USES_FUEL_TYPE]->(:Fuel {type: 'Diesel'}),
-[:HAS_MAX_SPEED]->(:MaxSpeed {value: 350, unit: 'k/h'}),
-[:HAS_ENGINE]->(:Engine),
-[:HAS_ENGINE]->(:Engine),
-[:HAS_ENGINE]->(:Engine)
There is generally a benefit to spinning properties out into relationships, even if the uniqueness is low. For example if you have a property which has a unique value per node, generally keep that in the node. But if your 50000 nodes have less, say, 25000 unique values in that property, it would probably still be beneficial to spin them out into relationships. This is absolutely the case with integer-type properties, where you'll also be able to add additional "bucket relationships" to provide a form of indexing. In the example above, the max speed was 350. After spinning the property out into a relationship, you could also put an additional relationship of the type [:HAS_MAX_SPEED_ABOVE]-> 300. This would complicate your querying, but should make it faster.
4) If none of the above apply to you, cannot be implemented or do not help, consider switching to a more traditional relational database like SQL. SQL would be a perfect candidate for your use case, i.e. 50k different nodes (rows) with only 10 different properties (columns) and no relationships (joins).
How do you quickly get the maximum (or minimum) value for a property of all instances of a relationship? You can assume the machine I'm running this on is well within the recommended spec's for the cpu and memory size of graph and the heap size is set accordingly.
Facts:
Using Neo4j v2.2.3
Only have access to modify graph via Cypher query language which I'm hitting via PHP or in the web interfacxe--would love to avoid any solution that requires java coding.
I've got a relationship, call it likes that has a single property id that is an integer.
There's about 100 million of these relationships and growing
Every day I grab new likes from a MySQL table to add to the graph within in Neo4j
The relationship property id is actually the primary key (auto incrementing integer) from the raw MySQL table.
I only want to add new likes so before querying MySQL for the new entries I want to get the max id from the likes, so I can use it in my SQL query as SELECT * FROM likes_table WHERE id > max_neo4j_like_property_id
How can I accomplish getting the max id property from neo4j in a optimal way? Please indicate the create statement needed for any index as well as the query you'd used to get the final result.
I've tried creating an index as follows:
CREATE INDEX ON :likes(id);
After the index is online I've tried:
MATCH ()-[r:likes]-() RETURN r.i ORDER BY r.id DESC LIMIT 1
as well as:
MATCH ()-[r:likes]->() RETURN MAX(r.id)
They work but take freaking forever as the explain plan for both indicate no indexes being used.
UPDATE: Holy $?##$?!!!! It looks like the new schema indexes aren't functional for relationships even though you can create them and show them with :schema. It also looks as if there's no way with cypher directly to create Legacy Indexes which look like they might solve this issue.
If you need to query relationship properties, it is generally a sign of a model issue.
The need of this query reveals you that you would better extract these properties into a node, that you'll then be able to query faster.
I don't say it is 100% the case, but certainly 99% of the people seen so far with the same problem has been demonstrating this model concern.
What is your model right now ?
Also you don't use labels at all in your query, likes have a context bound to the nodes.
I have a fullDB, (a graph clustered by Country) that contains ALL countries and I have various single country test DBs that contain exactly the same schema but only for one given country.
My query's "start" node, is identified via a match on a given value for a property e.g
match (country:Country{name:"UK"})
and then proceeds to the main query defined by the variable country. So I am expecting the query times to be similar given that we are starting from the same known node and it will be traversing the same number of nodes related to it in both DBs.
But I am getting very difference performance for my query if I run it in the full DB or just a single country.
I immediately thought that I must have some kind of "Cartesian Relationship" issue going on so I profiled the query in the full DB and a single country DB but the profile is exactly the same for each step in the plan. I was assuming that the profile would reveal a marked increase in db hits at some point in the plan, but the values are the same. Am I mistaken in what profile is displaying?
Some sizing:
The fullDB would have 70k nodes, the test DB 672 nodes, the time in full db for the query to complete is 218764ms while the test db is circa 3407ms.
While writing this I realised that there will be an increase in the number of outgoing relationships on certain nodes (suppliers can supply different countries) which I think is probably the cause, but the question remains as to why I am not seeing any indication of this in the profiling.
Any thoughts welcome.
What version are you using?
Both query times are way too long for your dataset size.
So you might check your configuration / disk.
Did you create an index/constraint for :Country(name) and is that index online?
And please share your query and your query plans.
Environment: Neo4j Community 1.8,2, Node.js 0.10.22, Debian squeeze, JDK 1.6.x
The problem I'm about to describe is very spurious and we are at a loss to figure out what in our code could be causing it. So this is a shot in the dark...
All of our Nodes are assigned a GUID property on creation via TransactionEventHandler plugin unless they have an existing GUID property. We have auto-indexing enabled for this GUID property. This seems to work fine. The majority of our queries are GUID-based. That is, we often find Nodes by GUID as all or part of the query. We've noticed that rarely an existing Node with guidA is overwritten with the properties of a just-created Node with guidB. Note that in this case, the GUIDs were actually generated by a foreign system (we're importing users from one system into another). We can see this happening because we keep a version history for each GUID. And we can see at the time that this problem occurs both guidA and guidB share the same Neo4j node id. It also might be the case that a Node with guidB had been created and then deleted some time in the past. We have to do more experimentation to confirm this.
One hypotheses is that:
the node with guidB was created in the past and had Neo4j id = 1234.
It was then deleted which allowed id 1234 to be reused at some time in the future. However, the guidB --> 1234 record still existed in the index.
The node with guidA was then created and was given Neo4j id 1234.
The user with guidB was then re-imported into the system, looked up by GUID, and because the original record in the index still remained, the node with id 1234 was found.
The properties of the node with id n were then overwritten with guidB's user properties.
The only reason I came up with this is because I know that the Lucene records are not immediately deleted when the associated node is deleted. Again, this happens infrequently and the key may be the deletion of the node.
Any possibility that this is an indexing bug?
This issue with auto-indexing was fixed at some point.
It only happens across server restarts after the deletion and before the new node is created, that's why it is so rarely.
What you can do is to query the index for the newly deleted GUID then it will be removed. For safekeeping you can also add a check that compares the GUID of the node returned from the index with the GUID searched for.
Probably a good idea to have a job go over your data and check the index / re-index the data by re-setting the guid property.
And as it is a GUID probably use the unique node creation features with the GUID to create the nodes in the first place?
I've created a Neo4j GraphDB via Java with the EmbeddedDatabase. When I saw my ID's I thought that something is wrong since they are way up.
The graph has only inserts, no deleted or updates yet. And I see that the Overview Dashboard reports 13182 nodes and 24785 relationships. Any idea why this is so high?
When querying all my nodes and relationships, I see what I expect. I just find it strange that without any deletes, the id's are so high. Is this normal behaviour.
P.s.: I'm running Neo4j 2.0.0 M003
The dashboard is not accurate as it does not show real count of nodes, rels and props. Indeed it shows the highest id in use for nodes, rels and props. Neo4j 2.0.0M3 seems to allocate ids blockwise, that's why you get the difference. If you delete some stuff in the db the highest id in use might stay at the same level, also a source if inaccuracy.
If you need the real count of nodes, you can use
start n=node(*) return count(n)