I am little bit confuse in solr report terminology can anyone help me.I am using https://localhost:8443/solr4/admin/cores?action=REPORT&wt=xml to generate the report in this report i am confuse in two tags one is name="Index error count" and another is name="Index unindexed count.Can anyone tell both are same or different?
They are different. As the documentation says:
Index error count: Specifies the count of the error documents (ID starts with ERROR-) in the index. It is used to mark nodes that failed
to be indexed. If the value of this parameter is not zero, then there
is an issue with the index.
Index unindexed count:Specifies the count of the unindexed documents (ID starts with UNINDEXED-) in the index. It is created for
nodes that have PROP_IS_INDEXED property set to false in the metadata.
This property is set to control indexing process, so it can be > 0.
For example, hidden and rendition nodes have this property set to
FALSE.
Related
My DB has around 2 million nodes(10 different types of nodes) and 3 million relationships in total.
Problem:
I want to run a query on the DB and select a set of nodes (similar type) based on the presence of a keyword(or phrase) in one of their fields called 'Description'.
'Description' field is plaintext (around 10 lines of text). So, I tried to make a Full-text search index on this field using the following command.
CREATE FULLTEXT INDEX Descriptions FOR (n:Nodename) ON EACH [n.Description]
This command completes after only 1 second without any errors, and I can see the newly created index in the list of indexes. But when I try to run my query, nothing returns. I guess there is something wrong with the index because I believe index creating should take some time in a massive DB like mine.
I used the following command to search and return the nodes:
CALL db.index.fulltext.queryNodes("Descriptions", "Keyword or Phrase") YIELD node, score
RETURN node.Description, score
No records were retrieved after the above command, but I am sure that I have hundreds of matches in my DB. Any idea about this problem?
Or any other solution for a fuzzy text search on a field based on a keyword/phrase?
Using Neo4j 3.2.3 inside a Docker instance. Need help interpreting the profiler.
Importing relational data into Neo4j. Initially importing a status column as a relationship to the Respondent node (i.e. (r:Respondent)-[:FROM]->(s:Status) where status has a unique id corresponding to the status column in the flat file). Respondent nodes have a property visit_date which is an integer YYYYMMDD. Using Status as a relationship in:
MATCH (s:Status {id: 1})<-[:FROM]-(r:Respondent)
WHERE r.visit_date >= 20160101 and r.visit_date <= 20161231
RETURN COUNT(r)
first does a node seek by range on the Respondent node's visit_date property (yielding 101,057 DB hits identical to the number of respondents with status equal to 1). There is an index on the 'visit_date' property. Next Neo4j expands all against the Status. This expansion does 303,168 db hits equal to the number of all respondents and a filter is applied to each. Would have expected that the number of hits would be lower than the first range seek instead Neo4j fans out.
If I put the status as a property in Respondent and query:
MATCH (r:Respondent)
WHERE r.visit_date >= 20160101 and r.visit_date <= 20161231 and r.status =1
RETURN COUNT(r);
the range seek is done first on visit_date and then a filter applied status (just on the range of 101,057 respondents). The query is faster and has less total db hits.
Surprised by the results in the profiler (in particular fan-out in the expand all after the Respondents have been ranged). One caveat is that I am profiling on my laptop with DDR3 RAM limited to 8GB so there is only 4G dedicated to the Docker container and 3072M to the heap which doesn't leave much over for page caching (500 M). The speed difference in querying might be due to a poor configuration but the profiling (i.e. expanding or not) shouldn't be effected by the Neo4j configuration. The question is why the fan-out when Status is a relationship when the range seek already has reduced the result set?
Update 1 (with explain images)
Adds explain images (note: couldn't figure out how people dump explains/profiles as text). Weird that the index hint on Status when it is a relationship is just as fast despite the actual hits being higher.
Matching Status relationship
Indexing status as property of Respondent
Matching Status relationship using index hint which is just as fast as indexing on status as a property
I am trying to link two types of documents in my Solr index. The parent is named "house" and the child is named "available". So, I want to return a list of houses that have available documents with some filtering. However, the following query gives me around 18 documents, which is wrong. It should return 0 documents.
q=*:*
&fq={!join from=house_id_fk to=house_id}doctype:available AND discount:[1 TO *] AND start_date:[NOW/DAY TO NOW/DAY%2B21DAYS]
&fq={!join from=house_id_fk to=house_id}doctype:available AND sd_year:2014 AND sd_month:11
To debug it, I tried first to check whether there is any available documents with the given filter queries. So, I tried the following query:
q=*:*
&fq=doctype:available AND discount:[1 TO *] AND start_date:[NOW/DAY TO NOW/DAY%2B21DAYS]
&fq=doctype:available AND sd_year:2014 AND sd_month:11
The query gives 0 results, which is correct. So as you can see both queries are the same, the different is using the join query parser. I am a bit confused, why the first query gives results. My understanding is that this should not happen because the second query shows that there is no any available documents that satisfy the given filter queries.
I have figured it out.
The reason is simply the type of join in Solr. It is an outer join. Since both filter queries are executed separately, a house that has available documents with discount > 1 or (sd_year:2014 AND sd_month:11) will be returned even though my intention was applying bother conditions at the same time.
However, in the second case, both conditions are applied at same time to find available documents, then houses based on the matching available documents are returned. Since there is no any available document that satisfies both conditions, then there is no any matching house which gives zero results.
It really took sometime to figure this out, I hope this will help someone else.
I have nodes with multiple "sourceIds" in one array-valued property called "sourceIds", just because there could be multiple resources a node could be derived from (I'm assembling multiple databases into one Neo4j model).
I want to be able to look up nodes by any of their source IDs. With legacy indexing this was no problem, I would just add a node to the index associated with each element of the sourceIds property array.
Now I wanted to switch to indexing with labels and I'm wondering how that kind of index works here. I can do
CREATE INDEX ON :<label>(sourceIds)
but what does that actually do? I hoped it would just create index entries for each array element, but that doesn't seem to be the case. With
MATCH n:<label> WHERE "testid" in n.sourceIds RETURN n
the query takes between 300ms and 500ms which is too long for an index lookup (other schema indexes work three to five times faster). With
MATCH n:<label> WHERE n.sourceIds="testid" RETURN n
I don't get a result. That's clear because it's an array property but I just gave it a try since it would make sense if array properties would be broken down to their elements for indexing purposes.
So, is there a way to handle array properties with schema indexing or are there plans or will I just have to stick to legacy indexing here? My problem with the legacy Lucene index was that I hit the max number of boolean clauses (1024). Another question thus would be: Can I raise this number? Lucene allows that, but can I do this with the Lucene index used by Neo4j?
Thanks and best regards!
Edit: A bit more elaboration on why I hit the boolean clauses max limit: I need to export specific parts of the database into custom file formats for text processing pipelines. These pipelines use components I cannot (be it for the sake of accessibility or time) change to query Neo4j directly, so I'd rather stay with the defined required file format(s). I do the export via the pattern "give me all IDs in the DB; now, for batches of IDs, query the desired information (e.g. specific paths) from Neo4j and store the results to file". Why I use batches at all? Well, if I don't, things are slowed down significantly via the connection overhead. Thus, large batches are a kind of optimization here.
Schema indexes can only do exact matches right now. Your "testid" in n.sourceIds does not use the index (as shown by your query times). I think there are plans to make this behave better, but I'm waiting for them just as eagerly as you are.
I've actually hit a lower max in the lucene query: 512. If there is a way to increase it I'd love to hear of it. The way I got around it is just doing more than one query if I have one of the rare cases that actually goes over 512 ids. What query are you doing where you need more?
When I use the Neo4j REST API, there seems to be a bug:
A node was indexed by some index. After I deleted some properties of that node, unindex it, and then index it again, those properties came back.
This happens once a while. Not every time.
I'm sure those properties are deleted, by querying that node in the cypher console after the delete operation.
Also, some posts reported this without a satisfying answer: the number of nodes/relationships/properties reported by neo4j webadmin looks crazy. I have 5 (including id 0) nodes, but it shows 932 nodes, 4213 properties. This happens every time. Some people say it's the highest ID in use. I don't think it makes any sense semantically to show the highest ID on the "nodes" label. In addition, the highest ID for my nodes is 466, not 932.
I assume you're judging the properties off the count, instead of off a query?
Neo4j's web console uses meta data to display information like node count, property count, and relationship count. This metadata is not always up to date, but it's much faster to use this then to have to scan the entire Graph Database for this information every time.
Neo4j will adjust these properties every now and then, but it doesn't do a de-fragment of it's information all the time.