Is there a more efficient way to find the middle of a singly-linked list? (Any language) - linked-list

Let's set the context/limitations:
A linked-list consists of Node objects.
Nodes only have a reference to their next node.
A reference to the list is only a reference to the head Node object.
No preprocessing or indexing has been done on the linked-list other than construction (there are no other references to internal nodes or statistics collected, i.e. length).
The last node in the list has a null reference for its next node.
Below is some code for my proposed solution.
Node cursor = head;
Node middle = head;
while (cursor != null) {
cursor = cursor.next;
if (cursor != null) {
cursor = cursor.next;
middle = middle.next;
}
}
return middle;
Without changing the linked-list architecture (not switching to a doubly-linked list or storing a length variable), is there a more efficient way to find the middle element of singly-linked list?
Note: When this method finds the middle of an even number of nodes, it always finds the left middle. This is ideal as it gives you access to both, but if a more efficient method will always find the right middle, that's fine, too.

No, there is no more efficient way, given the information you have available to you.
Think about it in terms of transitions from one node to the next. You have to perform N transitions to work out the list length. Then you have to perform N/2 transitions to find the middle.
Whether you do this as a full scan followed by a half scan based on the discovered length, or whether you run the cursor (at twice speed) and middle (at normal speed) pointers in parallel is not relevant here, the total number of transitions remains the same.
The only way to make this faster would be to introduce extra information to the data structure which you've discounted but, for the sake of completeness, I'll include it here. Examples would be:
making it a doubly-linked list with head and tail pointers, so you could find it in N transitions by "squeezing" in from both ends to the middle. That doubles the storage requirements for pointers however so may not be suitable.
having a skip list with each node pointing to both it's "child" and its "grandchild". This would speed up the cursor transitions resulting in only about N in total (that's N/2 for each of cursor and middle). Like the previous point, there's an extra pointer per node required for this.
maintaining the length of the list separately so you could find the middle in N/2 transitions.
same as the previous point but caching the middle node for added speed under certain circumstances.
That last point bears some extra examination. Like many optimisations, you can trade space for time and the caching shows one way to do it.
First, maintain the length of the list and a pointer to the middle node. The length is initially zero and the middle pointer is initially set to null.
If you're ever asked for the middle node when the length is zero, just return null. That makes sense because the list is empty.
Otherwise, if you're asked for the middle node and the pointer is null, it must be because you haven't cached the value yet.
In that case, calculate it using the length (N/2 transitions) and then store that pointer for later, before returning it.
As an aside, there's a special case here when adding to the end of the list, something that's common enough to warrant special code.
When adding to the end when the length is going from an even number to an odd number, just set middle to middle->next rather than setting it back to null.
This will save a recalculation and works because you (a) have the next pointers and (b) you can work out how the middle "index" (one-based and selecting the left of a pair as per your original question) changes given the length:
Length Middle(one-based)
------ -----------------
0 none
1 1
2 1
3 2
4 2
5 3
: :
This caching means, provided the list doesn't change (or only changes at the end), the next time you need the middle element, it will be near instantaneous.
If you ever delete a node from the list (or insert somewhere other than the end), set the middle pointer back to null. It will then be recalculated (and re-cached) the next time it's needed.
So, for a minimal extra storage requirement, you can gain quite a bit of speed, especially in situations where the middle element is needed more often than the list is changed.

Related

How to get top K elements from count-min-sketch?

I'm reading how the probabilistic data structure count-min-sketch is used in finding the top k elements in a data stream. But I cannot seem to wrap my head around the step where we maintain a heap to get our final answer.
The problem:
We have a stream of items [B, C, A, B, C, A, C, A, A, ...]. We are asked to find out the top k most frequently appearing
items.
My understanding is that, this can be done using micro-batching, in which we accumulate N items before we start doing some real work.
The hashmap+heap approach is easy enough for me to understand. We traverse the micro-batch and build a frequency map (e.g. {B:34, D: 65, C: 9, A:84, ...}) by counting the elements. Then we maintain a min-heap of size k by traversing the frequency map, adding to and evicting from the heap with each [item]:[freq] as needed. Straightforward enough and nothing fancy.
Now with CMS+heap, instead of a hashmap, we have this probabilistic lossy 2D array, which we build by traversing the micro-batch. The question is: how do we maintain our min-heap of size k given this CMS?
The CMS only contains a bunch of numbers, not the original items. Unless I also keep a set of unique elements from the micro-batch, there is no way for me to know which items I need to build my heap against at the end. But if I do, doesn't that defeat the purpose of using CMS to save memory space?
I also considered building the heap in real-time when we traverse the list. With each item coming in, we can quickly update the CMS and get the cumulative frequency of that item at that point. But the fact that this frequency number is cumulative does not help me much. For example, with the example stream above, we would get [B:1, C:1, A:1, B:2, C:2, A:2, C:3, A:3, A:4, ...]. If we use the same logic to update our min-heap, we would get incorrect answers (with duplicates).
I'm definitely missing something here. Please help me understand.
Keep a hashmap of size k, key is id, value is Item(id, count)
Keep a minheap of size k with Item
As events coming in, update the count-min 2d array, get the min, update Item in the hashmap, bubble up/bubble down the heap to recalculate the order of the Item. If heap size > k, poll min Item out and remove id from hashmap as well
Below explanation comes from a comment from this Youtube video:
We need to store the keys, but only K of them (or a bit more). Not all.
When every key comes, we do the following:
Add it to the count-min sketch.
Get key count from the count-min sketch.
Check if the current key is in the heap. If it presents in the heap, we update its count value there. If it not present in the heap, we check if heap is already full. If not full, we add this key to the heap. If heap is full, we check the minimal heap element and compare its value with the current key count value. At this point we may remove the minimal element and add the current key (if current key count > minimal element value).

Why is splitting a Rust's std::collections::LinkedList O(n)?

The .split_off method on std::collections::LinkedList is described as having a O(n) time complexity. From the (docs):
pub fn split_off(&mut self, at: usize) -> LinkedList<T>
Splits the list into two at the given index. Returns everything after the given index, including the index.
This operation should compute in O(n) time.
Why not O(1)?
I know that linked lists are not trivial in Rust. There are several resources going into the how's and why's like this book and this article among several others, but I haven't got the chance to dive into those or the standard library's source code yet.
Is there a concise explanation about the extra work needed when splitting a linked list in (safe) Rust?
Is this the only way? And if not why was this implementation chosen?
The method LinkedList::split_off(&mut self, at: usize) first has to traverse the list from the start (or the end) to the position at, which takes O(min(at, n - at)) time. The actual split off is a constant time operation (as you said). And since this min() expression is confusing, we just replace it by n which is legal. Thus: O(n).
Why was the method designed like that? The problem goes deeper than this particular method: most of the LinkedList API in the standard library is not really useful.
Due to its cache unfriendliness, a linked list is often a bad choice to store sequential data. But linked lists have a few nice properties which make them the best data structure for a few, rare situations. These nice properties include:
Inserting an element in the middle in O(1), if you already have a pointer to that position
Removing an element from the middle in O(1), if you already have a pointer to that position
Splitting the list into two lists at an arbitrary position in O(1), if you already have a pointer to that position
Notice anything? The linked list is designed for situations where you already have a pointer to the position that you want to do stuff at.
Rust's LinkedList, like many others, just store a pointer to the start and end. To have a pointer to an element inside the linked list, you need something like an Iterator. In our case, that's IterMut. An iterator over a collection can function like a pointer to a specific element and can be advanced carefully (i.e. not with a for loop). And in fact, there is IterMut::insert_next which allows you to insert an element in the middle of the list in O(1). Hurray!
But this method is unstable. And methods to remove the current element or to split the list off at that position are missing. Why? Because of the vicious circle that is:
LinkedList lacks almost all features that make linked lists useful at all
Thus (nearly) everyone recommends not to use it
Thus (nearly) no one uses LinkedList
Thus (nearly) no one cares about improving it
Goto 1
Please note that are a few brave souls occasionally trying to improve the situations. There is the tracking issue about insert_next, where people argue that Iterator might be the wrong concept to perform these O(1) operations and that we want something like a "cursor" instead. And here someone suggested a bunch of methods to be added to IterMut (including cut!).
Now someone just has to write a nice RFC and someone needs to implement it. Maybe then LinkedList won't be nearly useless anymore.
Edit 2018-10-25: someone did write an RFC. Let's hope for the best!
Edit 2019-02-21: the RFC was accepted! Tracking issue.
Maybe I'm misunderstanding your question, but in a linked list, the links of each node have to be followed to proceed to the next node. If you want to get to the third node, you start at the first, follow its link to the second, then finally arrive at the third.
This traversal's complexity is proportional to the target node index n because n nodes are processed/traversed, so it's a linear O(n) operation, not a constant time O(1) operation. The part where the list is "split off" is of course constant time, but the overall split operation's complexity is dominated by the dominant term O(n) incurred by getting to the split-off point node before the split can even be made.
One way in which it could be O(1) would be if a pointer existed to the node after which the list is split off, but that is different from specifying a target node index. Alternatively, an index could be kept mapping the node index to the corresponding node pointer, but it would be extra space and processing overhead in keeping the index updated in sync with list operations.
pub fn split_off(&mut self, at: usize) -> LinkedList<T>
Splits the list into two at the given index. Returns everything after the given index, including the index.
This operation should compute in O(n) time.
The documentation is either:
unclear, if n is supposed to be the index,
pessimistic, if n is supposed to be the length of the list (the usual meaning).
The proper complexity, as can be seen in the implementation, is O(min(at, n - at)) (whichever is smaller). Since at must be smaller than n, the documentation is correct that O(n) is a bound on the complexity (reached for at = n / 2), however such a large bound is unhelpful.
That is, the fact that list.split_off(5) takes the same time if list.len() is 10 or 1,000,000 is quite important!
As to why this complexity, this is an inherent consequence of the structure of doubly-linked list. There is no O(1) indexing operation in a linked-list, after all. The operation implemented in C, C++, C#, D, F#, ... would have the exact same complexity.
Note: I encourage you to write a pseudo-code implementation of a linked-list with the split_off operation; you'll realize this is the best you can get without altering the data-structure to be something else.

Swift 3 and Index of a custom linked list collection type

In Swift 3 Collection indices have to conform to Comparable instead of Equatable.
Full story can be read here swift-evolution/0065.
Here's a relevant quote:
Usually an index can be represented with one or two Ints that
efficiently encode the path to the element from the root of a data
structure. Since one is free to choose the encoding of the “path”, we
think it is possible to choose it in such a way that indices are
cheaply comparable. That has been the case for all of the indices
required to implement the standard library, and a few others we
investigated while researching this change.
In my implementation of a custom linked list collection a node (pointing to a successor) is the opaque index type. However, given two instances, it is not possible to tell if one precedes another without risking traversal of a significant part of the chain.
I'm curious, how would you implement Comparable for a linked list index with O(1) complexity?
The only idea that I currently have is to somehow count steps while advancing the index, storing it within the index type as a property and then comparing those values.
Serious downside of this solution is that indices must be invalidated when mutating the collection. And while that seems reasonable for arrays, I do not want to break that huge benefit linked lists have - they do not invalidate indices of unchanged nodes.
EDIT:
It can be done at the cost of two additional integers as collection properties assuming that single linked list implements front insert, front remove and back append. Any meddling around in the middle would anyway break O(1) complexity requirement.
Here's my take on it.
a) I introduced one private integer type property to my custom Index type: depth.
b) I introduced two private integer type properties to the collection: startDepth and endDepth, which both default to zero for an empty list.
Each front insert decrements the startDepth.
Each front remove increments the startDepth.
Each back append increments the endDepth.
Thus all indices startIndex..<endIndex have a reflecting integer range startDepth..<endDepth.
c) Whenever collection vends an index either by startIndex or endIndex it will inherit its corresponding depth value from the collection. When collection is asked to advance the index by invoking index(_ after:) I will simply initialize a new Index instance with incremented depth value (depth += 1).
Conforming to Comparable boils down to comparing left-hand side depth value to the right-hand side one.
Note that because I expand the integer range from both sides as well, all the depth values for the middle indices remain unchanged (thus are not invalidated).
Conclusion:
Traded benefit of O(1) index comparisons at the cost of minor increase in memory footprint and few integer increments and decrements. I expect index lifetime to be short and number of collections relatively small.
If anyone has a better solution I'd gladly take a look at it!
I may have another solution. If you use floats instead of integers, you can gain kind of O(1) insertion-in-the-middle performance if you set the sortIndex of the inserted node to a value between the predecessor and the successor's sortIndex. This would require to store (and update) the predecessor's sortIndex on your nodes (I imagine this should not be to hard since it is only changed on insertion or removal and it can always be propagated 'up').
In your index(after:) method you need to query the successor node, but since you use your node as index, that is be straightforward.
One caveat is the finite precision of floating points, so if on insertion you the distance between the two sort indices are two small, you need to reindex at least part of the list. Since you said you only expect small scale, I would just go through the hole list and use the position for that.
This approach has all the benefits of your own, with the added benefit of good performance on insertion in the middle.

Interview: System/API design

This question was asked in one of the big software company. I have come up with a simple solution and I want to know what others feel about the solution.
You are supposed to design an API and a backend for a system that can
allot phone numbers to people living in a city. The phone numbers will
start from 111-111-1111 and end at 999-999-9999. The API should enable
the clients (people in the city) to do the following:
When a client requests for a phone number, it allots one of the available numbers to them.
Some clients may want fancy numbers, so they can specifically ask for a number to be alloted to them. If the requested number is
available then the system allots it to them, otherwise the system
allots any available number.
The system need not have to know which number is alloted to which
client. The same client may make successive requests and get multiple
phone numbers for himself, but the system is not bothered. At any
point of time, the system only knows which phone numbers are alloted
and which phone numbers are free.
The numbers from 111-111-1111 to 999-999-9999 roughly corresponds to 8 billion numbers. Assuming that memory is not a constraint, I can think of the following two approaches (which are almost similar).
Maintain a huge boolean array of length 8 billion and have a next pointer that points to an array index (next is initialized to zero). If the value pointed by next is not free, then forward next until a free number is found. When fancy numbers are requested, just check whether the corresponding index position is free and return the number. The downside of this approach is, when allocating numbers in a regular way, if there is a huge chunk (say 1 billion) numbers in the middle that was allocated by fancy allocation, then the next pointer has to be moved 1 billion times.
To overcome the difficulty mentioned in the previos design, we can use some sort of a linked hashmap. We maintain a doubly linked list (this replaces the array in the previous design) and another array of the same length as the list where each element of the array points to a corresponding element in the list. So when allocating numbers in regular method, we advance a pointer in the linked list and mark nodes as and when we allocate (same as the previous method). When allocating fancy numbers, we can directly find the node in the list that corresponds to the special number requested by first indexing into the array and the following the pointer. Once the node is identified, short circuit the previous node and the next node so that we do not have to skip the used numbers one by one (which was the problem with the previous approach) when doing a regular allocation.
Let me know whether I am on the right track. Please enlighten me with any important details that I am missing.
You can do significantly better in the anser to this question.
First you should design you API. The one recommended by Icarus3 is perfectly good:
string acquireNextAvailableNumber();
boolean acquireRequestedNumber(string special);
The second one returns true (and reserves the number) if it is available, otherwise returns false.
The question doesn't specify how you allocate phone numbers, so allocate them to suit yourself. Make the first 'next available' request return "111-111-1111", the next "111-111-1112" etc. This means you can record all the numbers allocated through 'next' by just remembering the last one allocated. (You'll need to ask whether '111-111-1119" is followed by "111-111-1120" or 111-111-1121", but that's the sort of thing you should be asking anyway. In any case, the important thing is you can work out what is the next number knowing the last allocated one.)
Special requests you will need to store individually. A hash table work, but so does a BST or simply an ordered list. It depends on what tradeoffs you want between space and speed, and how often special numbers are likely to be requested. I'll use a BST (ordered by the number) in the rest of this, for reasons I'll come to.
So, how do you code this? For the next allocated number:
Look at the last allocated number, and find the next in sequence.
Check that number hasn't been allocated as a special number. You can do this very quickly with a BST because if it's there, it will be the lowest entry in the BST.
If the number was in the 'special numbers' database, increment the 'allocated numbers' value (to include that number) and remove the entry from the special numbers. Then repeat this process until you get a number that isn't in the special numbers.
Note that this process ensures that all 'special numbers' lower than the last one allocated by 'next' do not appear in the special numbers database. As the 'last normal number allocated' increases, it absorbs any special numbers allocated that were less than that, removing them from the table. This is what ensures that when we ask whether the next number in sequence is in the special numbers database, we only have to look at the lowest entry.
Checking for a special number is easy. If it is lower than the last 'normal' number allocated it isn't available. Otherwise you check to see if it exists in the BST. If it doesn't, you add it to the BST.
You can optimize this process by storing not just single numbers in the BST, but storing ranges of numbers. If the allocated special numbers are dense, then it reduces the amount of space in the tree and the number of accesses to find if one is in there. During the test to find if the 'next' number discovers a rnage of size n, then you can immediately increment the highest normal number by n, instead of having to go round the loop n times.
First, you did not prototype your APIs. For example, if I have to design these APIs I will publish 2 APIs.
string acquireNextAvailableNumber();
string acquireRequestedNumber(string special);
Second, you need to decide how you are going to implement it. code driven or data driven ?
You can maintain hash for all these numbers ( it will consume memory ) and quickly query the availability of the number. Or
you could maintain single list to store only distributed numbers ( less memory ). So, whenever request comes, you start searching 1 to n numbers in that list ( increased time-complexity ). if any first (or requested) number isn't there then you allocate it to client and add that entry in the list.
As, there are billion numbers, you will need to consider the trade-off between space and time.
You could also take the advantage of the database.
To enhance previous answers, any BST may not be good enough as insertions or deletions can make it unbalanced. A balanced BST, e.g. Red-Black Tree, should be a good choice.
So, a Red-Black Tree can be created and filled in the beginning to represent available numbers, and each allocation should remove an element from it.
init(from, to) - can be done in O(n) time, a straightforward implementation would be O(n log n). But that is a one-time initialization on your server's start
acquireNextAvailableNumber() - should remove smallest element, time cost O(1)
acquireRequestedNumber(special) - should make a search and remove element if found, guaranteed time cost O(log n)
In Java, a TreeSet<String> or TreeSet<Integer> could be used since it is implemented with Red-Black Tree.
The next question would probably have been that several request-processing threads would access your API, so since Java's TreeSet is not thread-safe, you should have wrapped it at initialization like so:
TreeSet numbers = init(...);
SortedSet availableNumbers = Collections.synchronizedSortedSet(numbers);

Time Complexity of Doubly Linked List Element Removal?

A lot of what I'm reading says that removing an internal element in a doubly linked list (DLL) is O(1); but why is this the case?
I understand why it's O(n) for SLLs; traverse the list O(n) and remove O(1) but don't you still need to traverse the list in a DLL to find the element?
For a doubly linked list, it's constant time to remove an element once you know where it is.
For a singly linked list, it's constant time to remove an element once you know where it and its predecessor are.
Since that link you point to shows a singly linked list removal as O(n) and a doubly linked one as O(1), it's certain that's once you already know where the element is that you want to remove, but not anything else.
In that case, for a doubly linked list, you can just use the prev and next pointers to remove it, giving you O(1). Ignoring the edge cases where you're at the head or tail, that means something like:
corpse->prev->next = corpse->next
corpse->next->prev = corpse->prev
free (corpse)
However, in a singly linked list where you only know the node you want deleted, you can't use corpse->prev to get the one preceding it because there is no prev link.
You have to instead find the previous item by traversing the list from the head, looking for one which has a next of the element you want to remove. That will take O(n), after which it's once again O(1) for the actual removal, such as (again, ignoring the edge cases for simplicity):
lefty = head
while lefty->next != corpse:
lefty = lefty-> next
lefty->next = corpse->next
free (corpse)
That's why the two complexities are different in that article.
As an aside, there are optimisations in a singly-linked list which can make the deletion O(n) (the deletion being effectively O(1) once you've found the item you want to delete, and the previous item). In code terms, that goes something like:
# Delete a node, returns true if found, otherwise false.
def deleteItem(key):
# Special cases (empty list and deleting head).
if head == null: return false
if head.data == key:
curr = head
head = head.next
free curr
return true
# Search non-head part of list (so prev always exists).
prev = head
curr = head.next
while curr != null:
if curr.data == key:
# Found it so delete (using prev).
prev.next = curr.next
free curr
return true
# Advance to next item.
prev = curr
curr = curr.next
# Not found, so fail.
return false
As it's stated where your link points to:
The cost for changing an internal element is based on already having a pointer to it, if you need to find the element first, the cost for retrieving the element is also taken.
So, for both DLL and SLL linear search is O(n), and removal via pointer is O(1).
The complexity of removal in DLL is O(1).
It can also be O(1) in SLL if provided pointer to preceding element and not to the element itself.
This complexity is assuming you know where the element is.
I.e. the operation signature is akin to remove_element(list* l, link* e)
Searching for the element is O(n) in both cases.
#Matuku: You are correct.
I humbly disagree with most answers here trying to justify how delete operation for DLL is O(1). It's not.
Let me explain.
Why are we considering the scenario that we 'would' have the pointer to the node that is being deleted? LinkedLists (Singly/Doubly) are traversed linearly, that's their definition. They have pointers only to the head/tail. How can we suddenly have a pointer to some node in between? That defeats the purpose of this data structure. And going by that assumption, if I have a DLL list of say 1 million nodes, then do I also have to maintain 1 million pointers (let's call them access pointers) pointing to each of those nodes so that I can delete them in O(1)? So how would I store those 1 millions access pointers? And how do I know which access pointer points to the correct data/node that I want to delete?
Can we have a real world example where we 'have' the pointer to the data that has to be deleted 100% of the time?
And if you know the exact location/pointer/reference of/to the node to be deleted, why to even use a LinkedList? Just use array! That's what arrays are for - direct access to what you want!
By assuming that you have direct access to any node you want in DLL is going against the whole idea of LinkedList as a conceptual Data Structure. So I agree with OP, he's correct. I will stick with this - Doubly LinkedLists cannot have O(1) for deleting any node. You still need to start either from head or tail, which brings it down to O(n).
" If " we have the pointer to the node to be deleted say X, then of course it's O(1) because we have pointers to the next and prev node we can delete X. But that big if is imaginary, not real.
We cannot play with the definition of the sacred Data Structure called LinkedLists for some weird assumptions we may have from time to time.

Resources