What is the sorting complexity time in the streaming API for the sorted() and thenComparing () methods (sorting by multiple fields (conditions))? - java-stream

I have an information that the method sorted() in Stream API maybe to use merging sort (mergesort).
Then time complexity for the kind of sort :
Big Θ (n (log n) ) - best
Big Ω (n (log n) ) - average
Big O (n (log n) ) - worst
space complecity – O (n) - worst
And what is the time complexity if we use sorting by multiple fields of a custom object, using then.comparing() to build a chain of comparisons ?
How would you calculate the time complexity in such a case ?

While the actual algorithm used in Stream.sorted is intentionally unspecified, there are obvious reasons not to implement another sorting algorithm, but use the existing implementation of Arrays.sort.
The current implementation uses TimSort, a variation of merge sort that can exploit ranges of pre-sorted elements within the input, with a best case of being entirely linear, when the input is already sorted, which also applies to the possible case that the input is sorted backwards. In these cases, no additional memory is needed. The average case is somewhere between that best case and the unchanged worst case of O(n log n).
As explained in this answer, general statements about the algorithms used in Arrays.sort are tricky, because all of them are hybrid sort algorithms and constantly improved.
Normally, a comparison function does not depend on the size of the input (the array or collection to sort), which doesn’t change when using Comparator.comparing(…).thenComparing(…), as more expensive comparison functions only add a constant factor that doesn’t affect the overall time complexity, as long as the comparator still doesn’t depend on the input size.

Related

Guarantee Print Order After Parallelism

I have X amount of cores doing unique work in parallel, however, their output needs to be printed in order.
Object {
Data data
int order
}
I've tried putting the objects in a min heap after they're done with their parallel work, however, even that is too much of a bottleneck.
Is there any way I could have work done in parallel and guarantee the print order? Is there a known term for my problem? Have others encountered it before?
Is there any way I could have work done in parallel and guarantee the print order?
Needless to say, we design parallelized routines with focus on an efficiency, but not constraining the order of the calculations. The printing of the results at the end, when everything is done, should dictate the ordering. In fact, parallel routines often do calculations in such a way that they’re conspicuously not in order (e.g., striding on each thread) to minimize thread and synchronization overhead.
The only question is how you structure the results to allow efficient storage and efficient, ordered retrieval. I often just use a mutable buffer or a pre-populated array. It’s very efficient in terms of both storage and retrieval. Or you can use a dictionary, too. It depends upon the nature of your Data. But I’d avoid the order property pattern in your result Object.
Just make sure you’re using optimized build if using standard Swift collections, as this can have a material impact on performance.
Q : Is there a known term for my problem?
Yes, there is. A con·​tra·​dic·​tion:
Definition of contradiction…2a : a proposition, statement, or phrase that asserts or implies both the truth and falsity of something// … both parts of a contradiction cannot possibly be true …— Thomas Hobbes
2b : a statement or phrase whose parts contradict each other// a round square is a contradiction in terms
3a : logical incongruity
3b : a situation in which inherent factors, actions, or propositions are inconsistent or contrary to one anothersource: Merriam-Webster
Computer science, having borrowed the terms { PARALLEL | SERIAL | CONCURRENT } from the theory of systems, respects the distinctive ( and never overlapping ) properties of each such class of operations, where:
[PARALLEL] orchestration of units-of-work implies, that any and every work-unit: a) starts and b) gets executed and c) gets finished at the same time, i.e. all get into/out-of [PARALLEL]-section at once and being elaborated at the very same time, not otherwise.
[SERIAL] orchestration of units-of-work implies, that all work-units be processed in a one, static, known, particular order, starting work-unit(s) in such an order, just a (known)-next one after previous one has finished its work - i.e. one-after-another, not otherwise.
[CONCURRENT] orchestration of units-of-work permits to start more than one unit-of-work, if resources and system conditions permit (scheduler priorities obeyed), resulting in unknown order of execution and unknown time of completion, as both the former and the latter depend on unknown externalities (system conditions and (non)-availability of resources, that are/will be needed for a particular work-unit elaboration)
Whereas there is an a-priori known, inherently embedded sense of an ORDER in [SERIAL]-type of processing ( as it was already pre-wired into the units-of-work processing-orchestration-code ), it has no such meaning in either [CONCURRENT], where opportunistic scheduling makes a wished-to-have order an undeterministically random result from the system states, skewed by the coincidence of all other externalities, and the same wished-to-have order is principally singular value in true [PARALLEL] by definition, as all start/execute/finish at-the-same-time - so all units-of-work being executed in [PARALLEL] fashion have no other chance, but be both 1st and last at the same time.
Q : Is there any way I could have work done in parallel and guarantee the print order?
No, unless you intentionally or unknowingly violate the [PARALLEL] orchestration rules and re-enter a re-[SERIAL]-iser logic into the work-units, so as to imperatively enforce any such wished-to-have ordering, that is not known, the less natural for the originally [PARALLEL] work-units' orchestration ( as is a common practice in python - using a GIL-monopolist indoctrinated stepping - as an example of such step )
Q : Have others encountered it before?
Yes. Since 2011, each and every semester this or similar questions reappear here, on Stack Overflow at growing amounts every year.

Why is splitting a Rust's std::collections::LinkedList O(n)?

The .split_off method on std::collections::LinkedList is described as having a O(n) time complexity. From the (docs):
pub fn split_off(&mut self, at: usize) -> LinkedList<T>
Splits the list into two at the given index. Returns everything after the given index, including the index.
This operation should compute in O(n) time.
Why not O(1)?
I know that linked lists are not trivial in Rust. There are several resources going into the how's and why's like this book and this article among several others, but I haven't got the chance to dive into those or the standard library's source code yet.
Is there a concise explanation about the extra work needed when splitting a linked list in (safe) Rust?
Is this the only way? And if not why was this implementation chosen?
The method LinkedList::split_off(&mut self, at: usize) first has to traverse the list from the start (or the end) to the position at, which takes O(min(at, n - at)) time. The actual split off is a constant time operation (as you said). And since this min() expression is confusing, we just replace it by n which is legal. Thus: O(n).
Why was the method designed like that? The problem goes deeper than this particular method: most of the LinkedList API in the standard library is not really useful.
Due to its cache unfriendliness, a linked list is often a bad choice to store sequential data. But linked lists have a few nice properties which make them the best data structure for a few, rare situations. These nice properties include:
Inserting an element in the middle in O(1), if you already have a pointer to that position
Removing an element from the middle in O(1), if you already have a pointer to that position
Splitting the list into two lists at an arbitrary position in O(1), if you already have a pointer to that position
Notice anything? The linked list is designed for situations where you already have a pointer to the position that you want to do stuff at.
Rust's LinkedList, like many others, just store a pointer to the start and end. To have a pointer to an element inside the linked list, you need something like an Iterator. In our case, that's IterMut. An iterator over a collection can function like a pointer to a specific element and can be advanced carefully (i.e. not with a for loop). And in fact, there is IterMut::insert_next which allows you to insert an element in the middle of the list in O(1). Hurray!
But this method is unstable. And methods to remove the current element or to split the list off at that position are missing. Why? Because of the vicious circle that is:
LinkedList lacks almost all features that make linked lists useful at all
Thus (nearly) everyone recommends not to use it
Thus (nearly) no one uses LinkedList
Thus (nearly) no one cares about improving it
Goto 1
Please note that are a few brave souls occasionally trying to improve the situations. There is the tracking issue about insert_next, where people argue that Iterator might be the wrong concept to perform these O(1) operations and that we want something like a "cursor" instead. And here someone suggested a bunch of methods to be added to IterMut (including cut!).
Now someone just has to write a nice RFC and someone needs to implement it. Maybe then LinkedList won't be nearly useless anymore.
Edit 2018-10-25: someone did write an RFC. Let's hope for the best!
Edit 2019-02-21: the RFC was accepted! Tracking issue.
Maybe I'm misunderstanding your question, but in a linked list, the links of each node have to be followed to proceed to the next node. If you want to get to the third node, you start at the first, follow its link to the second, then finally arrive at the third.
This traversal's complexity is proportional to the target node index n because n nodes are processed/traversed, so it's a linear O(n) operation, not a constant time O(1) operation. The part where the list is "split off" is of course constant time, but the overall split operation's complexity is dominated by the dominant term O(n) incurred by getting to the split-off point node before the split can even be made.
One way in which it could be O(1) would be if a pointer existed to the node after which the list is split off, but that is different from specifying a target node index. Alternatively, an index could be kept mapping the node index to the corresponding node pointer, but it would be extra space and processing overhead in keeping the index updated in sync with list operations.
pub fn split_off(&mut self, at: usize) -> LinkedList<T>
Splits the list into two at the given index. Returns everything after the given index, including the index.
This operation should compute in O(n) time.
The documentation is either:
unclear, if n is supposed to be the index,
pessimistic, if n is supposed to be the length of the list (the usual meaning).
The proper complexity, as can be seen in the implementation, is O(min(at, n - at)) (whichever is smaller). Since at must be smaller than n, the documentation is correct that O(n) is a bound on the complexity (reached for at = n / 2), however such a large bound is unhelpful.
That is, the fact that list.split_off(5) takes the same time if list.len() is 10 or 1,000,000 is quite important!
As to why this complexity, this is an inherent consequence of the structure of doubly-linked list. There is no O(1) indexing operation in a linked-list, after all. The operation implemented in C, C++, C#, D, F#, ... would have the exact same complexity.
Note: I encourage you to write a pseudo-code implementation of a linked-list with the split_off operation; you'll realize this is the best you can get without altering the data-structure to be something else.

find the last item but N in a stream w/o storing n items

Suppose there is a stream of data arriving, D(0), D(1), D(2), .... When D(i) comes, I want to know D(i - N). The most straight forward way is to store the most recent N items and keep updating them upon arrival of new data. But the problem is N can be large so that there is no enough memory to store them. Is there anyway to achieve this by storing much less items than N? A constant of M << N of spaces are preferred? Thanks in advance.
Not as far as I can see, unless there is some regularity in the data that you can exploit. If the data are completely random (such that no element can be inferred from the others), then a choice of not saving element k will make it impossible to reproduce that element in iteration k + N.
Instead, consider:
Can you reduce N?
Can you store information on disk or (if you are in an embedded environment) on a slower, cheaper form of memory?
Is there some pattern in the data? If there is e.g. a repeating pattern, you can utilize that, or if there is some mathematical relationship between the numbers, perhaps some formula can aid in reconstructing one number from others. Even if there is no perceptible pattern, perhaps you could use some compression algorithm to reduce the data size?
Is there some limitation to the data, e.g. every number is between 0 and 255? If so, you could perhaps reduce the storage requirements.
(What is the application of this, by the way?)

how to remove duplicates from an array without sorting

i have an array which might contain duplicate objects.
I wonder if it's possible to find and remove the duplicates in the array:
- without sorting (strict requirement)
- without using a temporary secondary array
- possibily in O(N), with N the nb of the elements in the array
In my case the array is a Lua array, which contains tables:
t={
{a,1},
{a,2},
{b,1},
{b,3},
{a,2}
}
In my case, t[5] is a duplicate of t[2], while t[1] is not.
To summarize, you have the following options:
time: O(n^2), no extra memory - for each element in the array look for an equal one linearly
time: O(n*log n), no extra memory - sort first, walk over the array linearly after
time: O(n), memory: O(n) - use a lookup table (edit: that's probably not an option as tables cannot be keys in other tables as far as I remember)
Pick one. There's no way to do what you want in O(n) time with no extra memory.
Can't be done in O(n) but ...
what you can do is
Iterate thru the array
For each member search forward for repetitions, remove those.
Worst case scenario complexity is O(n^2)
Iterate the array, stick every value in a hash, checking if the it exists first. If it does remove from original array (or don't write to the new one). Not very memory efficient, but only 0(n) since you are only iterating the array once.
Yes, depending on how you look at it.
You can override the object insertion to prevent insertion of duplicate items. This is O(n) per object insertion and may feel faster for smaller arrays.
If you provide sorted object insertion and deletion then it is O(log n). Essentially you always keep the list sorted as you insert and delete so that finding elements is a binary search. The cost here is that element retrieval is now O(log n) instead of O(1).
This can be also be implemented efficiently using things like red-black tree's and multitree's but at the cost of additional memory. Such implementations offer several benefits for certain problems. For example, we can have O(log n) type of behavior even very very large tables with small a small memory footprint by using nested tree's. The top level tree provides a sort of paired down overview of the dataset while subtree's provide more refined access when needed.
For example, to see this suppose we have N elements. We could partition that into n1 groups. Each of those groups could then further be partitions into n2 more groups and those groups into n2 groups. Hence we have a depth of N/n1n2...
As you can see, the product of n's can become quite huge very quickly even for small n's. If N = 1 Trillion elements and n1 = 1000, n2 = 1000, n3 = 1000 it takes only 1000 + 1000 + 1000 + 1000 s = 4000 per access time. Also, we only have 10^9 times per node memory footprint.
Compare this to the average 500 billion access time's required for a direct linear search. It is over 100 million times faster and 1000 times less memory than a binary tree but about 100 times slower! (of course there is some overhead for keeping the tree's consistent but even that can be reduced).
If we were to use a binary tree then it would have a depth of about 40. The problem is there are about 1 trillion nodes so that is a huge amount of additional memory. By storing multiple values per node(and in the above case each node actually partial values and other tree's) we can significantly reduce the memory footprint but still have impressive performance.
Essentially linear access prevails at lower numbers and tree's prevail at high numbers. Tree's. Tree's consume more memory. By using multitree's we can combine the best of both worlds by using linear access over smaller numbers and having a larger number of elements per node(compared to binary tree's).
Such tree's are not trivial to create but essentially follow the same algorithmic nature of standard binary tree's, red-black tree's, AVL tree's, etc...
So if you are dealing with large datasets it is not a huge issue for performance and memory. Essentially, as you probably know, as one goes up the other goes down. Multitree's, sort of find the optimal medium. (assuming you chose your node sizes correctly)
The depth of the multitree is N/product(n_k,k=1..m). The memory footprint is the number of nodes which is product(n_k,k=1..m) (which can generally be reduced by an order of magnitude or possibly n_m)

Speeding up Erlang indexation function

So following on from this question:
Erlang lists:index_of function?
I have the following code which works just fine:
-module(test_index_of).
-compile(export_all).
index_of(Q)->
N=length(Q),
Qs=lists:zip(lists:sort(Q), lists:seq(1, N)),
IndexFn=fun(X)->
{_, {_, I}}=lists:keysearch(X, 1, Qs),
I
end,
[IndexFn(X) || X <- Q].
test()->
Q=[random:uniform() || _X <- lists:seq(1, 20)],
{T1, _}=timer:tc(test_index_of, index_of, [Q]),
io:format("~p~n", [T1]).
Problem is, I need to run the index_of function a very large number of times [10,000] on lists of length 20-30 characters; the index_of function is the performance bottleneck in my code. So although it looks to be implemented reasonably efficiently to me, I'm not convinced it's the fastest solution.
Can anyone out there improve [performance-wise] on the current implementation of index_of ? [Zed mentioned gb_trees]
Thanks!
You are optimizing an operation on the wrong data type.
If you are going to make 10 000 lookups on the same list of 20-30 items, then it really pays off to do pre-computation to speed up those lookups. For example, lets make a tuple sorted on the key in a tuples of {key, index}.
1> Ls = [x,y,z,f,o,o].
[x,y,z,f,o,o]
2> Ls2 = lists:zip(Ls, lists:seq(1, length(Ls))).
[{x,1},{y,2},{z,3},{f,4},{o,5},{o,6}]
3> Ts = list_to_tuple(lists:keysort(1, Ls2)).
{{f,4},{o,5},{o,6},{x,1},{y,2},{z,3}}
A recursive binary search for a key on this tuple will very quickly home in on the right index.
Use proplists:normalize to remove duplicates, that is, if it is wrong to return 6 when looking up 'o' instead of 5. Or use folding and sets to implement your own filter that removes duplicates.
Try building a dict with dict:from_list/1 and make lookups on that dict instead.
But this still begs the question: Why do you want the index into a list of something? Lookups with lists:nth/2 has O(n) complexity.
Not sure if I understand this completely, but if the above is your actual usecase, then...
First of all, you could generate Q as the following, and you already save the zipping part.
Q=[{N,random:uniform()} || N <- lists:seq(1, 20)]
Taking this further on, you could generate a tree indexed by the values from the beginning:
Tree = lists:foldl(
fun(T, N) -> gb_trees:enter(uniform:random(), N, T) end,
gb_trees:empty(),
lists:seq(1, 20)
).
Then looking up your index becomes:
index_of(Item, Tree) ->
case gb_trees:lookup(Item, Tree) of
{value, Index} -> Index;
_ -> not_found
end.
I think you need custom sort function which record permutations it makes to input list. For example you can use lists:sort source. This should give you O(N*log N) performance.
Just one question: WTF are you trying do?
I just can't found what is practical purpose of this function. I think you do something odd. It seems that you just improved from O(NM^2) to O(NM*logM) but it is still very bad.
EDIT:
When I synthesize what is goal, It seems that you are trying use Monte Carlo method to determine probabilities of team's 'finishing positions' in English Premiere League. But I'm still not sure. You can determine most probable position [1,1,2] -> 1 or as fractional number as some sort of average 1.33 - for example this last one can be achieve with less effort than others.
In functional programing languages data structures are more important that in procedural or OO ones. They are more about work-flow. You will do this and than this and than ... In functional language as Erlang you should think in manner, I have this input and I want that output. Required output I can determine from this and this and so. There may be not necessary have list of things as you used to be in procedural approaches.
In procedural approaches you are used to use arrays for storage with constant random access. List is not that such thing. There are not arrays in Erlang where you can write (even array module which is balanced tree in reality). You can use tuple or binary for read only array but no one read write. I can write a lot about that there doesn't exist data structure with constant access time at all (from RAM, through arrays in procedural languages to HASH maps) but there is not enough space to explain it in detail here (from RAM technology, through L{1,2,3} CPU caches to necessity increase HASH length when number of keys increase and key HASH computation dependency of key length).
List is data structure which have O(N) random access time. It is best structure for store data which you want take one by one in same order as stored in list. For small N it can be capable structure for random access for small N when corresponding constant is small. For example when N is number of teams (20) in your problem it can be faster than O(logN) access to some sort of tree. But you must take care how big your constant is.
One of common component of algorithms are Key-Value lookups. There can be used arrays as supporting data structure in procedural world in some circumstances. Key must be integer number, space of possible key must not be to sparse and so. List doesn't serve as its substitution well for this purpose except for very small N here. I learn that best way how write functional code is avoid Key-Value lookups where is unnecessary. It often needs rearrange work-flow or refactoring data structures and so. Sometimes it looks like flip over problem solution like glove.
If I ignore that your probability model is wrong. From information you provide it seems that in your model team's season points are independent random events which is not true of course. There is impossible that all teams have some high amount of point, 82 for example just because there is some limit of points taken by all teams in one season. So forgot for this for now. Then I will simulate one 'path' - season and take result in form [{78,'Liverpool'}, {81,'Man Utd'}, ...], then I can sort it using lists:sort without loosing information which team is where. Results I would collect using iteration by path. For each path I would iterate over sorted simulation result and collect it in dictionary where team is key (constant and very cheap hash computation from atom and constant storage because key set is fixed, there is possibility to use tuples/records but seems like premature optimization). Value can be tuple of size 20 and position in tuple is final position and value is count of it.
Something like:
% Process simulation results.
% Results = [[{Points, Team}]]
process(Results) ->
lists:foldl(fun process_path/2,
dict:from_list([{Team, {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}} ||
Team <- ['Liverpool', 'Man Utd', ...]]),
Results).
% process simulation path result
process_path(R, D) ->
process_path(lists:reverse(lists:sort(R)), D, 1).
process_path([], _, D) -> D;
process_path([{_, Team}|R], D, Pos) ->
process_path(R, update_team(Team, Pos, D), Pos + 1).
% update team position count in dictionary
update_team(Team, Pos, D) ->
dict:update(Team, fun(T) -> add_count(T, Pos) end, D).
% Add final position Pos to tuple T of counts
add_count(T, P) -> setelement(P, T, element(P, T) + 1).
Notice that there is nothing like lists:index_of or lists:nth function. Resulting complexity will look like O(NM) or O(NMlogM) for small number M of Teams, but real complexity is O(NM^2) for O(M) setelement/3 in add_count/2. For bigger M you should change add_count/2 to some more reasonable.

Resources