How determine which algorithm has a run-time of log n - linked-list

The teacher told us that every time you divide something by 2, the run-time is likely to be log n. For instance, if we divide an array into two, each time we traverse one of the array, the run-time would be log n. However, we may run into a case with LinkedList where we may be easily misled. For instance, we may have an algorithm to find the maximum of the list by starting from either the head or the tail in order to have a run-time of less than n. Logically, we may think that the run time would be log n, but it's not. Why is that? And how do you determine that?

Starting from the front or back (or alternating) does not change the basis of the search for the greatest value. All it does is reorder the search strategy.
If you have a sequential, ordered list and you do a binary search, each comparison reduces the possible locations for a match by 1/2.
If you look at one element of the linked list, each comparison reduces the possible locations for a match by 1 element.
That is a crucial difference.

Related

How to get top K elements from count-min-sketch?

I'm reading how the probabilistic data structure count-min-sketch is used in finding the top k elements in a data stream. But I cannot seem to wrap my head around the step where we maintain a heap to get our final answer.
The problem:
We have a stream of items [B, C, A, B, C, A, C, A, A, ...]. We are asked to find out the top k most frequently appearing
items.
My understanding is that, this can be done using micro-batching, in which we accumulate N items before we start doing some real work.
The hashmap+heap approach is easy enough for me to understand. We traverse the micro-batch and build a frequency map (e.g. {B:34, D: 65, C: 9, A:84, ...}) by counting the elements. Then we maintain a min-heap of size k by traversing the frequency map, adding to and evicting from the heap with each [item]:[freq] as needed. Straightforward enough and nothing fancy.
Now with CMS+heap, instead of a hashmap, we have this probabilistic lossy 2D array, which we build by traversing the micro-batch. The question is: how do we maintain our min-heap of size k given this CMS?
The CMS only contains a bunch of numbers, not the original items. Unless I also keep a set of unique elements from the micro-batch, there is no way for me to know which items I need to build my heap against at the end. But if I do, doesn't that defeat the purpose of using CMS to save memory space?
I also considered building the heap in real-time when we traverse the list. With each item coming in, we can quickly update the CMS and get the cumulative frequency of that item at that point. But the fact that this frequency number is cumulative does not help me much. For example, with the example stream above, we would get [B:1, C:1, A:1, B:2, C:2, A:2, C:3, A:3, A:4, ...]. If we use the same logic to update our min-heap, we would get incorrect answers (with duplicates).
I'm definitely missing something here. Please help me understand.
Keep a hashmap of size k, key is id, value is Item(id, count)
Keep a minheap of size k with Item
As events coming in, update the count-min 2d array, get the min, update Item in the hashmap, bubble up/bubble down the heap to recalculate the order of the Item. If heap size > k, poll min Item out and remove id from hashmap as well
Below explanation comes from a comment from this Youtube video:
We need to store the keys, but only K of them (or a bit more). Not all.
When every key comes, we do the following:
Add it to the count-min sketch.
Get key count from the count-min sketch.
Check if the current key is in the heap. If it presents in the heap, we update its count value there. If it not present in the heap, we check if heap is already full. If not full, we add this key to the heap. If heap is full, we check the minimal heap element and compare its value with the current key count value. At this point we may remove the minimal element and add the current key (if current key count > minimal element value).

Why is splitting a Rust's std::collections::LinkedList O(n)?

The .split_off method on std::collections::LinkedList is described as having a O(n) time complexity. From the (docs):
pub fn split_off(&mut self, at: usize) -> LinkedList<T>
Splits the list into two at the given index. Returns everything after the given index, including the index.
This operation should compute in O(n) time.
Why not O(1)?
I know that linked lists are not trivial in Rust. There are several resources going into the how's and why's like this book and this article among several others, but I haven't got the chance to dive into those or the standard library's source code yet.
Is there a concise explanation about the extra work needed when splitting a linked list in (safe) Rust?
Is this the only way? And if not why was this implementation chosen?
The method LinkedList::split_off(&mut self, at: usize) first has to traverse the list from the start (or the end) to the position at, which takes O(min(at, n - at)) time. The actual split off is a constant time operation (as you said). And since this min() expression is confusing, we just replace it by n which is legal. Thus: O(n).
Why was the method designed like that? The problem goes deeper than this particular method: most of the LinkedList API in the standard library is not really useful.
Due to its cache unfriendliness, a linked list is often a bad choice to store sequential data. But linked lists have a few nice properties which make them the best data structure for a few, rare situations. These nice properties include:
Inserting an element in the middle in O(1), if you already have a pointer to that position
Removing an element from the middle in O(1), if you already have a pointer to that position
Splitting the list into two lists at an arbitrary position in O(1), if you already have a pointer to that position
Notice anything? The linked list is designed for situations where you already have a pointer to the position that you want to do stuff at.
Rust's LinkedList, like many others, just store a pointer to the start and end. To have a pointer to an element inside the linked list, you need something like an Iterator. In our case, that's IterMut. An iterator over a collection can function like a pointer to a specific element and can be advanced carefully (i.e. not with a for loop). And in fact, there is IterMut::insert_next which allows you to insert an element in the middle of the list in O(1). Hurray!
But this method is unstable. And methods to remove the current element or to split the list off at that position are missing. Why? Because of the vicious circle that is:
LinkedList lacks almost all features that make linked lists useful at all
Thus (nearly) everyone recommends not to use it
Thus (nearly) no one uses LinkedList
Thus (nearly) no one cares about improving it
Goto 1
Please note that are a few brave souls occasionally trying to improve the situations. There is the tracking issue about insert_next, where people argue that Iterator might be the wrong concept to perform these O(1) operations and that we want something like a "cursor" instead. And here someone suggested a bunch of methods to be added to IterMut (including cut!).
Now someone just has to write a nice RFC and someone needs to implement it. Maybe then LinkedList won't be nearly useless anymore.
Edit 2018-10-25: someone did write an RFC. Let's hope for the best!
Edit 2019-02-21: the RFC was accepted! Tracking issue.
Maybe I'm misunderstanding your question, but in a linked list, the links of each node have to be followed to proceed to the next node. If you want to get to the third node, you start at the first, follow its link to the second, then finally arrive at the third.
This traversal's complexity is proportional to the target node index n because n nodes are processed/traversed, so it's a linear O(n) operation, not a constant time O(1) operation. The part where the list is "split off" is of course constant time, but the overall split operation's complexity is dominated by the dominant term O(n) incurred by getting to the split-off point node before the split can even be made.
One way in which it could be O(1) would be if a pointer existed to the node after which the list is split off, but that is different from specifying a target node index. Alternatively, an index could be kept mapping the node index to the corresponding node pointer, but it would be extra space and processing overhead in keeping the index updated in sync with list operations.
pub fn split_off(&mut self, at: usize) -> LinkedList<T>
Splits the list into two at the given index. Returns everything after the given index, including the index.
This operation should compute in O(n) time.
The documentation is either:
unclear, if n is supposed to be the index,
pessimistic, if n is supposed to be the length of the list (the usual meaning).
The proper complexity, as can be seen in the implementation, is O(min(at, n - at)) (whichever is smaller). Since at must be smaller than n, the documentation is correct that O(n) is a bound on the complexity (reached for at = n / 2), however such a large bound is unhelpful.
That is, the fact that list.split_off(5) takes the same time if list.len() is 10 or 1,000,000 is quite important!
As to why this complexity, this is an inherent consequence of the structure of doubly-linked list. There is no O(1) indexing operation in a linked-list, after all. The operation implemented in C, C++, C#, D, F#, ... would have the exact same complexity.
Note: I encourage you to write a pseudo-code implementation of a linked-list with the split_off operation; you'll realize this is the best you can get without altering the data-structure to be something else.

Splitting and runtime of log n

Sorry, I made a mistake in my earlier question. Because of that I didn't get the answer I wanted.
The teacher told us that every time you divide something by 2, the run-time is likely to be log n. For instance, if we divide an array into two, each time we traverse one of the array, the run-time would be log n. However, we may run into a case with LinkedList where we may be easily misled. For instance, we may have an algorithm to set the nth element of the list to something else by starting from either the head or the tail in order to have a run-time of less than n. Logically, we may think that the run time would be log n, but it's not. Why is that? And how do you determine that?
Do we need to absolutely have splitting to get a run-time of log n? I don't think it makes any logical sense to say the run-time of n when the maximum run-time of the loop is n/2.
I think some concepts need a bit of refining here, because the time complexity is only related to algorithm, not to the size of the data structure you're operating on.
The teacher told us that every time you divide something by 2, the run-time is likely to be log n. For instance, if we divide an array into two, each time we traverse one of the array, the run-time would be log n.
Now, traversing an array, like
for (int i = 0; i < array.size; i++) {
variable = array[i];
}
runs in O(n): the time needed to perform such an operation varies linearly with the size of the array. You will have O(log n) for operations like a binary search on an array, but you cannot generalize this concept to all array operations, and especially not to those who need to iterate over the array.
Now, this sentence
For instance, we may have an algorithm to set the nth element of the list to something else by starting from either the head or the tail in order to have a run-time of less than n.
leads me to believe that you think that the n as used in big O and what you call the "nth element" are directly related. They aren't. On a linked list your only option to go to element n is to go to the start of the list and follow the links down the element you're looking for (or in the case of a double linked list, go to the start or end depending on the position of the element you're looking for), so this operation has a time complexity of O(n), ie linearly related to the length of the collection.

Interview: System/API design

This question was asked in one of the big software company. I have come up with a simple solution and I want to know what others feel about the solution.
You are supposed to design an API and a backend for a system that can
allot phone numbers to people living in a city. The phone numbers will
start from 111-111-1111 and end at 999-999-9999. The API should enable
the clients (people in the city) to do the following:
When a client requests for a phone number, it allots one of the available numbers to them.
Some clients may want fancy numbers, so they can specifically ask for a number to be alloted to them. If the requested number is
available then the system allots it to them, otherwise the system
allots any available number.
The system need not have to know which number is alloted to which
client. The same client may make successive requests and get multiple
phone numbers for himself, but the system is not bothered. At any
point of time, the system only knows which phone numbers are alloted
and which phone numbers are free.
The numbers from 111-111-1111 to 999-999-9999 roughly corresponds to 8 billion numbers. Assuming that memory is not a constraint, I can think of the following two approaches (which are almost similar).
Maintain a huge boolean array of length 8 billion and have a next pointer that points to an array index (next is initialized to zero). If the value pointed by next is not free, then forward next until a free number is found. When fancy numbers are requested, just check whether the corresponding index position is free and return the number. The downside of this approach is, when allocating numbers in a regular way, if there is a huge chunk (say 1 billion) numbers in the middle that was allocated by fancy allocation, then the next pointer has to be moved 1 billion times.
To overcome the difficulty mentioned in the previos design, we can use some sort of a linked hashmap. We maintain a doubly linked list (this replaces the array in the previous design) and another array of the same length as the list where each element of the array points to a corresponding element in the list. So when allocating numbers in regular method, we advance a pointer in the linked list and mark nodes as and when we allocate (same as the previous method). When allocating fancy numbers, we can directly find the node in the list that corresponds to the special number requested by first indexing into the array and the following the pointer. Once the node is identified, short circuit the previous node and the next node so that we do not have to skip the used numbers one by one (which was the problem with the previous approach) when doing a regular allocation.
Let me know whether I am on the right track. Please enlighten me with any important details that I am missing.
You can do significantly better in the anser to this question.
First you should design you API. The one recommended by Icarus3 is perfectly good:
string acquireNextAvailableNumber();
boolean acquireRequestedNumber(string special);
The second one returns true (and reserves the number) if it is available, otherwise returns false.
The question doesn't specify how you allocate phone numbers, so allocate them to suit yourself. Make the first 'next available' request return "111-111-1111", the next "111-111-1112" etc. This means you can record all the numbers allocated through 'next' by just remembering the last one allocated. (You'll need to ask whether '111-111-1119" is followed by "111-111-1120" or 111-111-1121", but that's the sort of thing you should be asking anyway. In any case, the important thing is you can work out what is the next number knowing the last allocated one.)
Special requests you will need to store individually. A hash table work, but so does a BST or simply an ordered list. It depends on what tradeoffs you want between space and speed, and how often special numbers are likely to be requested. I'll use a BST (ordered by the number) in the rest of this, for reasons I'll come to.
So, how do you code this? For the next allocated number:
Look at the last allocated number, and find the next in sequence.
Check that number hasn't been allocated as a special number. You can do this very quickly with a BST because if it's there, it will be the lowest entry in the BST.
If the number was in the 'special numbers' database, increment the 'allocated numbers' value (to include that number) and remove the entry from the special numbers. Then repeat this process until you get a number that isn't in the special numbers.
Note that this process ensures that all 'special numbers' lower than the last one allocated by 'next' do not appear in the special numbers database. As the 'last normal number allocated' increases, it absorbs any special numbers allocated that were less than that, removing them from the table. This is what ensures that when we ask whether the next number in sequence is in the special numbers database, we only have to look at the lowest entry.
Checking for a special number is easy. If it is lower than the last 'normal' number allocated it isn't available. Otherwise you check to see if it exists in the BST. If it doesn't, you add it to the BST.
You can optimize this process by storing not just single numbers in the BST, but storing ranges of numbers. If the allocated special numbers are dense, then it reduces the amount of space in the tree and the number of accesses to find if one is in there. During the test to find if the 'next' number discovers a rnage of size n, then you can immediately increment the highest normal number by n, instead of having to go round the loop n times.
First, you did not prototype your APIs. For example, if I have to design these APIs I will publish 2 APIs.
string acquireNextAvailableNumber();
string acquireRequestedNumber(string special);
Second, you need to decide how you are going to implement it. code driven or data driven ?
You can maintain hash for all these numbers ( it will consume memory ) and quickly query the availability of the number. Or
you could maintain single list to store only distributed numbers ( less memory ). So, whenever request comes, you start searching 1 to n numbers in that list ( increased time-complexity ). if any first (or requested) number isn't there then you allocate it to client and add that entry in the list.
As, there are billion numbers, you will need to consider the trade-off between space and time.
You could also take the advantage of the database.
To enhance previous answers, any BST may not be good enough as insertions or deletions can make it unbalanced. A balanced BST, e.g. Red-Black Tree, should be a good choice.
So, a Red-Black Tree can be created and filled in the beginning to represent available numbers, and each allocation should remove an element from it.
init(from, to) - can be done in O(n) time, a straightforward implementation would be O(n log n). But that is a one-time initialization on your server's start
acquireNextAvailableNumber() - should remove smallest element, time cost O(1)
acquireRequestedNumber(special) - should make a search and remove element if found, guaranteed time cost O(log n)
In Java, a TreeSet<String> or TreeSet<Integer> could be used since it is implemented with Red-Black Tree.
The next question would probably have been that several request-processing threads would access your API, so since Java's TreeSet is not thread-safe, you should have wrapped it at initialization like so:
TreeSet numbers = init(...);
SortedSet availableNumbers = Collections.synchronizedSortedSet(numbers);

Package for fast determination of similarity between two bit sequences

I need to compare a query bit sequence with a database of up to a million bit sequences. All bit sequences are 100 bits long. I need the lookup to be as fast as possible. Are there any packages out there for fast determination of the similarity between two bit sequences? --Edit-- The bit sequences are position sensitive.
I have seen a possible algorithm on Bit Twiddling Hacks but if there is a ready made package that would be better.
If the database is rather static, you may want to build a tree data structure on it.
Search the tree recursively or in multiple threads and per search keep an actual difference variable. If the actual difference becomes greater than what you would consider 'similar', abort the search.
E.g. Suppose we have the following tree:
root
0 1
0 1 0 1
0 1 0 1 0 1 0 1
If you want to look for patterns similar to 011, and only want to allow 1 different bit at most, search like this (recursively or multi-threaded):
Start at the root
Take the left branch (0), this is similar, so difference is still 0
Take the left branch (0), this is different, so difference becomes 1, which is still acceptable
take the left branch (0), this is different, so difference becomes 2, which is too high. Abort looking in this branch.
take the right branch (1), this is equal, so difference remains 1, continue to search in this branch (not shown here)
Take the right branch (1), this is equal, so difference remains 0, go on
take the left branch (0), this is different, so difference becomes 1, which is still acceptable, go on.
This goes on until you have found your bit patterns.
If your bit patterns are more dynamic and being updated in your application, you will have to update the tree.
If memory is a problem, consider going to 64-bit.
If you want to look up the, let's say 50, most matching patterns, and we can assume that the input data set is rather static (or can be dynamically updated), you can repeat the initial phase of the previous comment, so:
For every bit pattern, count the bits.
Store the bit patterns in a multi_map (if you use STL, Java probably has something similar)
Then, use the following algorithm:
Make 2 collections: one for storing the found patterns, one for storing possibly good patterns (this second collection should probably be map, mapping 'distances' to patterns)
Take your own pattern and count the bits, assume this is N
Look in the multimap at index N, all these patterns will have the same sum, but not necessarily be completely identical
Compare all the patterns at index N. If they are equal store the result in the first collection. If they are not equal, store the result in the second collection/map, using the difference as key.
Look in the multimap at index N-1, all these patterns will have a distance of 1 or more
Compare all the patterns at index N-1. If they have a distance of 1, store them in the first collection. If they have a larger distance, store the result in the second collection/map, using the difference as key.
Repeat for index N+1
Now look in the second collection/map and see if there is something stored with distance 1. If it is, remove them from the second collection/map and store them in the first collection.
Repeat this for distance 2, distance 3, ... until you have enough patterns.
If the number of required patterns is not too big, and the average distance is also not too big, then the number of real compares between patterns is probably only a few %.
Unfortunately, since the patterns will be distributed using a Gaussian curve, there will still be quite some patterns to check. I didn't do a mathematical check on it, but in practice, if you don't want too many patterns out of the millions, and the average distance is not too far, you should be able to find the set of most-close patterns by checking only a few percent of the total bit patterns.
Please keep me updated of your results.
I came up with a second alternative.
For every bit pattern of the million ones count the number of bits and store the bit patterns in an STL multi_map (if you're writing in C++).
Then count the number of bits in your pattern. Suppose you have N bits set in your bit pattern.
If you now want to allow at most D differences, look up all the bit patterns in the multi_map having N-D, N-D+1, ..., N-1, N, N+1, ... N+D-1, N+D bits.
Unfortunately, the division of bit patterns in the multi_map will follow a Gaussian pattern, which means that in practice you will still have to compare quite some bit patterns.
(Originally I thought this could be solved by counting even 0's and uneven 1's but this isn't true.)
Assuming that you want to allow 1 difference, you have to look up 3 slots in the multi_map out of the 100 possible slots, leaving you with 3% of the actual bit patterns to do a full compare.

Resources