Suppose you have a list of items (instructions in a function, posts on a blog, episodes in a TV series etc) that need to be kept in order, what is the recommended way to store them in Neo4j? Two possibilities that come to mind:
Assuming the items don't already have a suitable property for sorting by, assign them incrementing sequence numbers.
Use a linked list of nodes.
Which of these is typically recommended? Or is there a third option I'm missing?
Use a linked list.
Sequence numbers still have to be sorted, which is unnecessary overhead. And to do the sort, neo4j has to iterate through every node in the sequence, even if you are only interested in a small part of the sequence.
Related
I have some questions about the ideas proposed in this video.
The speaker shows an array that holds values and pointers, and he also shows a separate "free" linked list, that is updated whenever an item is added/removed.
Why are these used? Doesn't using an array / limiting yourself to a set of free nodes defeat the purpose of a linked list?
Isn't one of the perk of using a linked list the ability to traverse fragmented data?
Why use these free nodes, when you can dynamically allocate storage?
The proposed structure, to me, doesn't seem dynamic at all, and is in fact a convoluted and inefficient array.
The approach you mention makes sense in certain use cases. For example if the common case is that the array is 90% full and most of the time is spent iterating over it, you can very quickly loop over an array and just skip the few empty items. This can be much, much faster than "pointer chasing" which plain linked lists use, because the CPU's hardware prefetcher can predict which memory you will need in advance.
And compared with a plain array and no free list, it has the advantage of O(1) allocation of an element into an empty slot.
I know that in a linked list you dont need preallocated memory and the insertion and deletion method is really easy to do and the only thing I really know about stack is the push and pop method.
Linked lists are good for inserting and removing elements at random positions. In a stack, we only append to or remove from the end.
Linked List vs Array(Stack)
Both Arrays and Linked List can be used to store linear data of similar types, but they both have some advantages and disadvantages over each other.
Following are the points in favour of Linked Lists.
(1) The size of the arrays is fixed: So we must know the upper limit on the number of elements in advance. Also, generally, the allocated memory is equal to the upper limit irrespective of the usage, and in practical uses, upper limit is rarely reached.
(2) Inserting a new element in an array of elements is expensive, because room has to be created for the new elements and to create room existing elements have to shifted.
For example, suppose we maintain a sorted list of IDs in an array id[].
id[] = [1000, 1010, 1050, 2000, 2040, …..].
And if we want to insert a new ID 1005, then to maintain the sorted order, we have to move all the elements after 1000 (excluding 1000).
Deletion is also expensive with arrays until unless some special techniques are used. For example, to delete 1010 in id[], everything after 1010 has to be moved.
So Linked list provides following two advantages over arrays
1) Dynamic size
2) Ease of insertion/deletion
Linked lists have following drawbacks:
1) Random access is not allowed. We have to access elements sequentially starting from the first node. So we cannot do binary search with linked lists.
2) Extra memory space for a pointer is required with each element of the list.
3) Arrays have better cache locality that can make a pretty big difference in performance.
I have nodes with multiple "sourceIds" in one array-valued property called "sourceIds", just because there could be multiple resources a node could be derived from (I'm assembling multiple databases into one Neo4j model).
I want to be able to look up nodes by any of their source IDs. With legacy indexing this was no problem, I would just add a node to the index associated with each element of the sourceIds property array.
Now I wanted to switch to indexing with labels and I'm wondering how that kind of index works here. I can do
CREATE INDEX ON :<label>(sourceIds)
but what does that actually do? I hoped it would just create index entries for each array element, but that doesn't seem to be the case. With
MATCH n:<label> WHERE "testid" in n.sourceIds RETURN n
the query takes between 300ms and 500ms which is too long for an index lookup (other schema indexes work three to five times faster). With
MATCH n:<label> WHERE n.sourceIds="testid" RETURN n
I don't get a result. That's clear because it's an array property but I just gave it a try since it would make sense if array properties would be broken down to their elements for indexing purposes.
So, is there a way to handle array properties with schema indexing or are there plans or will I just have to stick to legacy indexing here? My problem with the legacy Lucene index was that I hit the max number of boolean clauses (1024). Another question thus would be: Can I raise this number? Lucene allows that, but can I do this with the Lucene index used by Neo4j?
Thanks and best regards!
Edit: A bit more elaboration on why I hit the boolean clauses max limit: I need to export specific parts of the database into custom file formats for text processing pipelines. These pipelines use components I cannot (be it for the sake of accessibility or time) change to query Neo4j directly, so I'd rather stay with the defined required file format(s). I do the export via the pattern "give me all IDs in the DB; now, for batches of IDs, query the desired information (e.g. specific paths) from Neo4j and store the results to file". Why I use batches at all? Well, if I don't, things are slowed down significantly via the connection overhead. Thus, large batches are a kind of optimization here.
Schema indexes can only do exact matches right now. Your "testid" in n.sourceIds does not use the index (as shown by your query times). I think there are plans to make this behave better, but I'm waiting for them just as eagerly as you are.
I've actually hit a lower max in the lucene query: 512. If there is a way to increase it I'd love to hear of it. The way I got around it is just doing more than one query if I have one of the rare cases that actually goes over 512 ids. What query are you doing where you need more?
Given two linked lists of integers. I was asked to return a linked list which contains the non-common elements. I know how to do it in O(n^2), any way to do it in O(n)?
Use a hash table.
Iterate through the first linked list, entering the values you come across into a hash table.
Iterate through the second linked list, adding any element not found into the hash table into your list of non-common elements.
This solution should be O(n), assuming no collisions in the hash table.
create a new empty list. have a hash table and populate it with elements of both lists. complexity n. then iterate over each list sequentially and while iterating, put those elements in the new list which are not present in the hash table. complexity n. overall complexity=n
If they're unsorted, then I don't believe it is possible to get better than O(n^2). However, you can do better by sorting them... you can sort in reasonably fast time, and then get something like O(nlogn) (I'm not certain that's what it would be, but I think it can be that fast if you use the right algorithm).
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Linked list interview question
This is an interview question for which I don't have an answer.
Given Two lists, You cannot change list and you dont know the length.
Give best possible algorithm to:
Check if two lists are merging at any point?
If merging, at what point they are merging?
If I allow you to change the list how would you modify your algorithm?
I'm assuming that we are talking about simple linked lists and we can safely create a hash table of the list element pointers.
Q1: Iterate to end of both lists, If the respective last elements are the same, the lists merge at some point.
Complexity - O(N), space complexity - O(1)
Q2:
Put all elements of one list into a hash table
Iterate over 2nd list, probing the hash table for each element of the list. The first hit (if any) is the merge point, and we have the position in the 2nd list.
To get the position in the 1st list, iterate over the first list again looking for the element found in the previous step.
Time complexity - O(N). Space complexity - O(N)
Q3:
As Q1, but also reverse the direction of the list pointers.
Then iterate the reversed lists looking for the last common element - that is the merge point - and restoring the list to the original order.
Time complexity - O(N). Space complexity - O(1)
Number 1: Just iterate both and then check if they end with the same element. Thats O(n) and it cant be beaten (as it might possibly be the last element that is common, and getting there always takes O(n)).
Walk those two lists parallel by one element, add each element to Set of visited nodes (can be hash map, or simple set, you only need to check if you visited that node before). At each step check if you visited that node (if yes, then it's merging point), and add it to set of nodes if you visit it first time. Another version (as pointed by #reinier) is to walk only first list, store its nodes in Set and then only check second list against that Set. First approach is faster when your lists merge early, as you don't need to store all nodes from first list. Second is better at worst case, where both list don't merge at all, since it didn't store nodes from second list in Set
see 1.
Instead of Set, you can try to mark each node, but if you cannot modify structure, then it's not so helpful. You could also try unlink each visited node and link it to some guard node (which you check at each step if you encountered it while traversing). It saves memory for Set if list is long enough.
Traverse both the list and have a global variable for finding the number of NULL encountered . If they merge at some point there will be only 1 NULL else there will be two NULL.