How to find the kth smallest element within an interval of an array - segment-tree

Suppose I have an unsorted integer array a[] with length N.
Now I want to find the k-th smallest integer within a given interval a[i]-a[j] (1 <= i <= j <= N).
Ex: I have an array a[10]={10,15,3,8,17,11,9,25,38,29}.
Now I want to find the 3-rd smallest element within a[2]-a[7] interval.
The answer is 9.
I know this can be done by sorting that interval. But this costs O(Mlog(M)) (M = j - i + 1) time. Also, I know that, this can be done by segment tree, but I can't understand how to modify it to handle such query.

You can use Quickselect, a modification of Quicksort to find the kth smallest/largest value in a list. Let's just assume that you're trying to do this for the entire array, for the sake of simplicity (note that there is no difference).
Essentially you use quicksort, but only use one recursive call instead of two. Once your pivot it placed, you only need to call quicksort for one of the partitions, depending on the placement of pivot. This is O(N2), but average case of O(N). If you use random pivots, it's pretty much always going to be O(N).

Related

What is the sorting complexity time in the streaming API for the sorted() and thenComparing () methods (sorting by multiple fields (conditions))?

I have an information that the method sorted() in Stream API maybe to use merging sort (mergesort).
Then time complexity for the kind of sort :
Big Θ (n (log n) ) - best
Big Ω (n (log n) ) - average
Big O (n (log n) ) - worst
space complecity – O (n) - worst
And what is the time complexity if we use sorting by multiple fields of a custom object, using then.comparing() to build a chain of comparisons ?
How would you calculate the time complexity in such a case ?
While the actual algorithm used in Stream.sorted is intentionally unspecified, there are obvious reasons not to implement another sorting algorithm, but use the existing implementation of Arrays.sort.
The current implementation uses TimSort, a variation of merge sort that can exploit ranges of pre-sorted elements within the input, with a best case of being entirely linear, when the input is already sorted, which also applies to the possible case that the input is sorted backwards. In these cases, no additional memory is needed. The average case is somewhere between that best case and the unchanged worst case of O(n log n).
As explained in this answer, general statements about the algorithms used in Arrays.sort are tricky, because all of them are hybrid sort algorithms and constantly improved.
Normally, a comparison function does not depend on the size of the input (the array or collection to sort), which doesn’t change when using Comparator.comparing(…).thenComparing(…), as more expensive comparison functions only add a constant factor that doesn’t affect the overall time complexity, as long as the comparator still doesn’t depend on the input size.

Trying to understand the VITERBI algorithm a bit better

I'm currently trying to implement the viterbi algorithm in python, more specifically the version presented in an online course.
As it stands, the algorithm is presented that way:
given a sentence with K tokens, we have to generate K tags .
We assume that tag K-1 = tag K-2 = '*', then for k going from 0 to K,
we set the tag for the token as follows :
tag(WORD_k) = argmax(p(k-1, tag_k-2, tag_k-1) * e( word_k, tag_k) * q(tag_k, tag_k-1, tag_k-1))
From my understanding this is straightforward because the p parameters are already calculated on each step (we go from 1 forward, and we already know p0), and max for the e and q params can be calculated by one iteration through the tags (since we can't come up with 2 different tags, we basically have to find the tag T for which the q * e product is maximal, and return that). This saves a lot of time, since we are almost at linear time in terms in big O notation, instead of exponential complexity, which we would get if we iterated through all possible word/tag combinations.
Am I getting the core of the algorithm correctly or am I missing something out?
Thanks in advance
since we can't come up with 2 different tags, we basically have to
find the tag T for which the q * e product is maximal, and return that
Yeah, sounds about right. q is the trigram (transition) probability and e is named the emission probability. As you said is unchanged between different paths in each stage, so the max is only dependent on the other two.
Each tag sequence should start with two asterisks at positions -2 and -1. So the first assumption is correct:
If we assume to be the maximum probability that the last two tags at position k are u and v, based on what we just said about the beginning asterisks, the base case would be
.
You had two errors in the general case though. The emission probability is a conditional. Also in the trigram, is repeated two times and the formula given is incorrect:

How to effectively use the Levenshtein algorithm for text auto-completion

I'm using the Levenshtein distance algorithm to filter through some text in order to determine the best matching result for the purpose of text field auto-completion (and top 5 best results).
Currently, I have an array of strings, and apply the algorithm to each one in an attempt to determine how close of a match it is to the text which was typed by the user. The problem is that I'm not too sure how to interpret the values outputted by the algorithm to effectively rank the results as expected.
For example: (Text typed = "nvmb")
Result: "game" ; levenshtein distance = 3 (best match)
Result: "number the stars" ; levenshtein distance = 13 (second best match)
This technically makes sense; the second result needs many more 'edits', because of it's length. The problem is that the second result is logically and visually a much closer match than the first one. It's almost as if I should ignore any characters longer than the length of the typed text.
Any ideas on how I could achieve this?
Levenshtein distance itself is good for correcting query, not for auto-completion.
I can propose alternative solution:
First, store your strings in prefix tree instead of array, so you will have no need to analyze all of them.
Second, given user input enumerate strings with fixed distance from it and suggest completions for any.
Your example: Text typed = "nvmb"
Distance is 0, no completions
Enumerate strings with distance 1
Only "numb" will have some completions
Another example:Text typed="gamb"
For distance 0 you have only one completion, "gambling", make it first suggestion, and continue to get 4 more
For distance 1 you will get "game" and some completions for it
Of course, this approach sometimes gives more than 5 results, but you can order them by another criterion, not depending on current query.
I think it is more efficient because typically you can limit distance with at maximum two, i.e. check order of 1000*n prefixes, where n is length of input, most times less than number of stored strings.
The Levenshtein distance corresponds to the number of single-character insertions, deletions and substitutions in an optimal global pairwise alignment of two sequences if the gap and mismatch costs are all 1.
The Needleman-Wunsch DP algorithm will find such an alignment, in addition to its score (it's essentially the same DP algorithm as the one used to calculate the Levenshtein distance, but with the option to weight gaps, and mismatches between any given pair of characters, arbitrarily). But there are more general models of alignment that allow reduced penalties for gaps at the start or the end (and reduced penalties for contiguous blocks of gaps, which may also be useful here, although it doesn't directly answer the question). At one extreme, you have local alignment, which is where you pay no penalty at all for gaps at the ends -- this is computed by the Smith-Waterman DP algorithm. I think what you want here is in-between: You want to penalise gaps at the start of both the query and test strings, and gaps in the test string at the end, but not gaps in the query string at the end. That way, trailing mismatches cost nothing, and the costs will look like:
Query: nvmb
Costs: 0100000000000000 = 1 in total
Against: number the stars
Query: nvmb
Costs: 1101 = 3 in total
Against: game
Query: number the stars
Costs: 0100111111111111 = 13 in total
Against: nvmb
Query: ber star
Costs: 1110001111100000 = 8 in total
Against: number the stars
Query: some numbor
Costs: 111110000100000000000 = 6 in total
Against: number the stars
(In fact you might want to give trailing mismatches a small nonzero penalty, so that an exact match is always preferred to a prefix-only match.)
The Algorithm
Suppose the query string A has length n, and the string B that you are testing against has length m. Let d[i][j] be the DP table value at (i, j) -- that is, the cost of an optimal alignment of the length-i prefix of A with the length-j prefix of B. If you go with a zero penalty for trailing mismatches, you only need to modify the NW algorithm in a very simple way: instead of calculating and returning the DP table value d[n][m], you just need to calculate the table as before, and find the minimum of any d[n][j], for 0 <= j <= m. This corresponds to the best match of the query string against any prefix of the test string.

Splitting and runtime of log n

Sorry, I made a mistake in my earlier question. Because of that I didn't get the answer I wanted.
The teacher told us that every time you divide something by 2, the run-time is likely to be log n. For instance, if we divide an array into two, each time we traverse one of the array, the run-time would be log n. However, we may run into a case with LinkedList where we may be easily misled. For instance, we may have an algorithm to set the nth element of the list to something else by starting from either the head or the tail in order to have a run-time of less than n. Logically, we may think that the run time would be log n, but it's not. Why is that? And how do you determine that?
Do we need to absolutely have splitting to get a run-time of log n? I don't think it makes any logical sense to say the run-time of n when the maximum run-time of the loop is n/2.
I think some concepts need a bit of refining here, because the time complexity is only related to algorithm, not to the size of the data structure you're operating on.
The teacher told us that every time you divide something by 2, the run-time is likely to be log n. For instance, if we divide an array into two, each time we traverse one of the array, the run-time would be log n.
Now, traversing an array, like
for (int i = 0; i < array.size; i++) {
variable = array[i];
}
runs in O(n): the time needed to perform such an operation varies linearly with the size of the array. You will have O(log n) for operations like a binary search on an array, but you cannot generalize this concept to all array operations, and especially not to those who need to iterate over the array.
Now, this sentence
For instance, we may have an algorithm to set the nth element of the list to something else by starting from either the head or the tail in order to have a run-time of less than n.
leads me to believe that you think that the n as used in big O and what you call the "nth element" are directly related. They aren't. On a linked list your only option to go to element n is to go to the start of the list and follow the links down the element you're looking for (or in the case of a double linked list, go to the start or end depending on the position of the element you're looking for), so this operation has a time complexity of O(n), ie linearly related to the length of the collection.

Speeding up Erlang indexation function

So following on from this question:
Erlang lists:index_of function?
I have the following code which works just fine:
-module(test_index_of).
-compile(export_all).
index_of(Q)->
N=length(Q),
Qs=lists:zip(lists:sort(Q), lists:seq(1, N)),
IndexFn=fun(X)->
{_, {_, I}}=lists:keysearch(X, 1, Qs),
I
end,
[IndexFn(X) || X <- Q].
test()->
Q=[random:uniform() || _X <- lists:seq(1, 20)],
{T1, _}=timer:tc(test_index_of, index_of, [Q]),
io:format("~p~n", [T1]).
Problem is, I need to run the index_of function a very large number of times [10,000] on lists of length 20-30 characters; the index_of function is the performance bottleneck in my code. So although it looks to be implemented reasonably efficiently to me, I'm not convinced it's the fastest solution.
Can anyone out there improve [performance-wise] on the current implementation of index_of ? [Zed mentioned gb_trees]
Thanks!
You are optimizing an operation on the wrong data type.
If you are going to make 10 000 lookups on the same list of 20-30 items, then it really pays off to do pre-computation to speed up those lookups. For example, lets make a tuple sorted on the key in a tuples of {key, index}.
1> Ls = [x,y,z,f,o,o].
[x,y,z,f,o,o]
2> Ls2 = lists:zip(Ls, lists:seq(1, length(Ls))).
[{x,1},{y,2},{z,3},{f,4},{o,5},{o,6}]
3> Ts = list_to_tuple(lists:keysort(1, Ls2)).
{{f,4},{o,5},{o,6},{x,1},{y,2},{z,3}}
A recursive binary search for a key on this tuple will very quickly home in on the right index.
Use proplists:normalize to remove duplicates, that is, if it is wrong to return 6 when looking up 'o' instead of 5. Or use folding and sets to implement your own filter that removes duplicates.
Try building a dict with dict:from_list/1 and make lookups on that dict instead.
But this still begs the question: Why do you want the index into a list of something? Lookups with lists:nth/2 has O(n) complexity.
Not sure if I understand this completely, but if the above is your actual usecase, then...
First of all, you could generate Q as the following, and you already save the zipping part.
Q=[{N,random:uniform()} || N <- lists:seq(1, 20)]
Taking this further on, you could generate a tree indexed by the values from the beginning:
Tree = lists:foldl(
fun(T, N) -> gb_trees:enter(uniform:random(), N, T) end,
gb_trees:empty(),
lists:seq(1, 20)
).
Then looking up your index becomes:
index_of(Item, Tree) ->
case gb_trees:lookup(Item, Tree) of
{value, Index} -> Index;
_ -> not_found
end.
I think you need custom sort function which record permutations it makes to input list. For example you can use lists:sort source. This should give you O(N*log N) performance.
Just one question: WTF are you trying do?
I just can't found what is practical purpose of this function. I think you do something odd. It seems that you just improved from O(NM^2) to O(NM*logM) but it is still very bad.
EDIT:
When I synthesize what is goal, It seems that you are trying use Monte Carlo method to determine probabilities of team's 'finishing positions' in English Premiere League. But I'm still not sure. You can determine most probable position [1,1,2] -> 1 or as fractional number as some sort of average 1.33 - for example this last one can be achieve with less effort than others.
In functional programing languages data structures are more important that in procedural or OO ones. They are more about work-flow. You will do this and than this and than ... In functional language as Erlang you should think in manner, I have this input and I want that output. Required output I can determine from this and this and so. There may be not necessary have list of things as you used to be in procedural approaches.
In procedural approaches you are used to use arrays for storage with constant random access. List is not that such thing. There are not arrays in Erlang where you can write (even array module which is balanced tree in reality). You can use tuple or binary for read only array but no one read write. I can write a lot about that there doesn't exist data structure with constant access time at all (from RAM, through arrays in procedural languages to HASH maps) but there is not enough space to explain it in detail here (from RAM technology, through L{1,2,3} CPU caches to necessity increase HASH length when number of keys increase and key HASH computation dependency of key length).
List is data structure which have O(N) random access time. It is best structure for store data which you want take one by one in same order as stored in list. For small N it can be capable structure for random access for small N when corresponding constant is small. For example when N is number of teams (20) in your problem it can be faster than O(logN) access to some sort of tree. But you must take care how big your constant is.
One of common component of algorithms are Key-Value lookups. There can be used arrays as supporting data structure in procedural world in some circumstances. Key must be integer number, space of possible key must not be to sparse and so. List doesn't serve as its substitution well for this purpose except for very small N here. I learn that best way how write functional code is avoid Key-Value lookups where is unnecessary. It often needs rearrange work-flow or refactoring data structures and so. Sometimes it looks like flip over problem solution like glove.
If I ignore that your probability model is wrong. From information you provide it seems that in your model team's season points are independent random events which is not true of course. There is impossible that all teams have some high amount of point, 82 for example just because there is some limit of points taken by all teams in one season. So forgot for this for now. Then I will simulate one 'path' - season and take result in form [{78,'Liverpool'}, {81,'Man Utd'}, ...], then I can sort it using lists:sort without loosing information which team is where. Results I would collect using iteration by path. For each path I would iterate over sorted simulation result and collect it in dictionary where team is key (constant and very cheap hash computation from atom and constant storage because key set is fixed, there is possibility to use tuples/records but seems like premature optimization). Value can be tuple of size 20 and position in tuple is final position and value is count of it.
Something like:
% Process simulation results.
% Results = [[{Points, Team}]]
process(Results) ->
lists:foldl(fun process_path/2,
dict:from_list([{Team, {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}} ||
Team <- ['Liverpool', 'Man Utd', ...]]),
Results).
% process simulation path result
process_path(R, D) ->
process_path(lists:reverse(lists:sort(R)), D, 1).
process_path([], _, D) -> D;
process_path([{_, Team}|R], D, Pos) ->
process_path(R, update_team(Team, Pos, D), Pos + 1).
% update team position count in dictionary
update_team(Team, Pos, D) ->
dict:update(Team, fun(T) -> add_count(T, Pos) end, D).
% Add final position Pos to tuple T of counts
add_count(T, P) -> setelement(P, T, element(P, T) + 1).
Notice that there is nothing like lists:index_of or lists:nth function. Resulting complexity will look like O(NM) or O(NMlogM) for small number M of Teams, but real complexity is O(NM^2) for O(M) setelement/3 in add_count/2. For bigger M you should change add_count/2 to some more reasonable.

Resources