Related
I'm reading how the probabilistic data structure count-min-sketch is used in finding the top k elements in a data stream. But I cannot seem to wrap my head around the step where we maintain a heap to get our final answer.
The problem:
We have a stream of items [B, C, A, B, C, A, C, A, A, ...]. We are asked to find out the top k most frequently appearing
items.
My understanding is that, this can be done using micro-batching, in which we accumulate N items before we start doing some real work.
The hashmap+heap approach is easy enough for me to understand. We traverse the micro-batch and build a frequency map (e.g. {B:34, D: 65, C: 9, A:84, ...}) by counting the elements. Then we maintain a min-heap of size k by traversing the frequency map, adding to and evicting from the heap with each [item]:[freq] as needed. Straightforward enough and nothing fancy.
Now with CMS+heap, instead of a hashmap, we have this probabilistic lossy 2D array, which we build by traversing the micro-batch. The question is: how do we maintain our min-heap of size k given this CMS?
The CMS only contains a bunch of numbers, not the original items. Unless I also keep a set of unique elements from the micro-batch, there is no way for me to know which items I need to build my heap against at the end. But if I do, doesn't that defeat the purpose of using CMS to save memory space?
I also considered building the heap in real-time when we traverse the list. With each item coming in, we can quickly update the CMS and get the cumulative frequency of that item at that point. But the fact that this frequency number is cumulative does not help me much. For example, with the example stream above, we would get [B:1, C:1, A:1, B:2, C:2, A:2, C:3, A:3, A:4, ...]. If we use the same logic to update our min-heap, we would get incorrect answers (with duplicates).
I'm definitely missing something here. Please help me understand.
Keep a hashmap of size k, key is id, value is Item(id, count)
Keep a minheap of size k with Item
As events coming in, update the count-min 2d array, get the min, update Item in the hashmap, bubble up/bubble down the heap to recalculate the order of the Item. If heap size > k, poll min Item out and remove id from hashmap as well
Below explanation comes from a comment from this Youtube video:
We need to store the keys, but only K of them (or a bit more). Not all.
When every key comes, we do the following:
Add it to the count-min sketch.
Get key count from the count-min sketch.
Check if the current key is in the heap. If it presents in the heap, we update its count value there. If it not present in the heap, we check if heap is already full. If not full, we add this key to the heap. If heap is full, we check the minimal heap element and compare its value with the current key count value. At this point we may remove the minimal element and add the current key (if current key count > minimal element value).
I read this problem multiple times and still don't quite understand it. I just need some help understanding what's going on here.
So, I understand that there are three types of "species": A, B and C. Are these species alphabets Σ? Also, in the first DFA listed in the problem, the 110 state: what do those numbers represent exactly? I know they're
xyz where x, y and z are respectively the number of individual of breeds
But I don't understand what 110 means in the first state. Does it mean A and B have 2 children of its own or that A and B mated?
The questions from this problem are:
(a) What is the alphabet Σ in the DFA’s associated with this strange
planet? Also, describe what are the strings in the language specified
by these automata.
(b) Describe the rule(s) that specifies whether a string belongs to
the language.
(c) Any DFA can be modified so that we have at most one trap state (by
easily modifying the original DFA such that any transition leading to
a trap state leads to a single particular trap state). Write the
transition matrix of the automaton above.
(d) Draw all other DFA’s for the planet if we know that initially
there were ex- actly two individuals on the planet (one possible
automaton is provided in the problem description above. Draw the other
“two”)
(e) Draw all DFA’s for the planet if they were initially exactly three
individuals on the planet. If some of them look exactly like each
other except for the “initial” state, you can just simply draw it once
without specifying which state is the initial state.
(f) We define three types of states as follows: i. must-fail states:
Those states that certainly will lead to a failed planet. ii.
might-fail states: Those states that can lead to a failed planet. iii.
cannot-fail states: Those states that can never lead to a failed
planet. List all must-fail, might-fail and cannot-fail states with
three individuals.
(g) Draw the automaton with the initial state of 121. What type each
of the state in this automaton are?
If I could get some help understanding this problem and help with the first 2 questions, I would greatly appreciate it! I'm trying to solve it but I just can't quite understand it. Thank you!
But I don't understand what 110 means in the first state. Does it mean A and B have 2 children of its own or that A and B mated?
It means the planet has 1 A, 1 B and 0 C. The resulting transition c, indicates that 1 A and 1 B mated and died, producing 2 C. Ergo, the planet then has 0 A, 0 B and 2 C, so the state is called 002.
(a) What is the alphabet Σ in the DFA’s associated with this strange planet? Also, describe what are the strings in the language specified by these automata.
From the diagram, the alphabet is {a, b, c}. a represents a B and C mating, b represents a A and C mating, c represents a B and C mating.
(b) Describe the rule(s) that specifies whether a string belongs to the language.
This is quite complicated. The DFA is going to depend on the number of individuals of each species in the starting state. Basically, transitions are doing math to three variables A, B and C. Each transition subtracts 1 from two variables, and adds 2 to the third - this means that transitions are only possible when at least two of the variables are non-zero.
Consider state 111. There are three possible transitions to three possible states: a -> 300, b -> 030 and c -> 003. We consider these "failed" planets, because no more breeding can happen, so these are accepting states.
Consider state 220. Transition c gets us to state 112. a gets us to 301. b gets us back to 220. This is a loop! So infinite strings are possible.
Notice that the sum of the variables A, B and C are constant. Every state has to have the same sum. So the number of states, at most, will be every possible way to express that sum with three non-negative integer terms.
Continue to explore properties like the above to understand more about the transitions. I'll leave that for you as an exercise.
I'd read efficiency guide and erlang-questions mailing list archive &
all of the available books in erlang. But I haven't found the precise description of efficient
binaries pattern matching. Though, I haven't read sources yet :) But I hope
that people, who already have read them, would read this post. Here are my questions.
How many match contexts does an erlang binary have?
a) if we match parts of a binary sequentially and just once
A = <<1,2,3,4>>.
<<A1,A2,A3,A4>> = A.
Do we have just one binary match context(moving from the beginning of A to the end), or four?
b) if we match parts of a binary sequentially from the beginning to the end for the first time
and(sequentially again) from the beginning to the end for the second time
B = <<1,2,3,4>>.
<<B1,B2,B3,B4>> = B.
<<B11,B22,B33,B44>> = B.
Do we have just a single match context, which is moving from the beginning of B to the end
of B and then moving again from the beginning of B to the end of B,
or
we have 2 match contexts, one is moving from the beginning of B to the end of B,
and another - again from the beginning of B to the end of B (as first can't move
to the beginning again)
or we have 8 match contexts?
According to documentation, if I write:
my_binary_to_list(<<H,T/binary>>) ->
[H|my_binary_to_list(T)];
my_binary_to_list(<<>>) -> [].
there will be only 1 match context for the whole recursion tree, even though, this
function isn't tail-recursive.
a) Did I get it right, that there would be only 1 match context in this case?
b) Did I get right, that if I match an erlang binary sequentially(from the beginning to
the end), it doesn't matter which recursion type(tail or body) has to be used?(from the binary-matching efficiency point of view)
c) What if I'm going to process erlang binary NOT sequentially, say, I'm travelling through
a binary - first I match first byte, then 1000th, then 5th, then 10001th, then 10th...
In this case,
d1) If I used body-recursion, how many matching contexts for this binary would I have -
one or >1?
d2) if I used tail-recursion, how many matching contexts for this binary would I have -
one or >1?
If I pass a large binary(say 1 megabyte) via tail recursion, Will all the 1 megabyte data be copied? Or only a some kind of pointer to the beginning of this binary is being passed between calls?
Does it matter which binary I'm matching - big or small - a match context will be created for binary of any size or only for large ones?
I am only a beginner in erlang, so take this answer with a grain of salt.
How many match contexts does an erlang binary have?
a) Only one context is created, but it is entirely consumed in that instance, since there's nothing left to match, and thus it may not be reused.
b) Likewise, the whole binary is split, there are no context left after matching, though one context has been created for each line: the assignments of B1 up to B4 creates one context, and the second set of assignments from B11 to B44 does also create a context. So in total we get 2 context created and consumed.
According to documentation [...]
This section isn't quite totally clear for me as well, but this is what I could figure out.
a) Yes, there will be only one context allocated for the whole duration of the function recursive execution.
b) Indeed no mention is made of distinguishing tail recursion vs non tail recursion. However, the example given is clearly a function which can be transformed (though it's not trivial) into a tail-recursive one. I suppose that the compiler decides to duplicate a matching context when a clause contains more than one path for the context to follow. In that case, the compiler detects that the function is tail optimizable, and goes without doing the allocation.
c) We see the opposite situation happening in the example following the one you've reproduced, which contains a case expression: there, the context may follow 2 different paths, thus the compiler has to force the allocation at each recursion level.
If I pass a large binary (say 1 megabyte) via tail recursion [...]
From § 4.1:
A sub binary is created by split_binary/2 and when a binary is matched out in a binary pattern. A sub binary is a reference into a part of another binary (refc or heap binary, never into a another sub binary). Therefore, matching out a binary is relatively cheap because the actual binary data is never copied.
When dealing with binaries, a buffer is used to store the actual data, and any matching of sub-part is implemented as a structure containing a pointer to the original buffer, plus an offset and a length indicating which sub part is being considered. That's the sub binary type being mentioned in the docs.
Does it matter which binary I'm matching - big or small - ...
From that same § 4.1:
The binary containers are called refc binaries (short for reference-counted binaries) and heap binaries.
Refc binaries consist of two parts: an object stored on the process heap, called a ProcBin, and the binary object itself stored outside all process heaps.
[...]
Heap binaries are small binaries, up to 64 bytes, that are stored directly on the process heap. They will be copied when the process is garbage collected and when they are sent as a message. They don't require any special handling by the garbage collector.
This indicates that depending on the size of the binary, it may be stored as a big buffer outside of processes, and referenced in the processes through a proxy structure, or if that binary is 64 bytes of less, it will be stored directly in the process memory dealing with it. The first case avoids copying the binary when processes sharing it are running on the same node.
i am trying to follow these slides on bayesian networks.
Can anybody explain me what it means that a node in a bayesian network is "instantiated"?
It means the node is created. Spawned. Brought to existence. If B isn't represented by an instance (roughly: does not exist), then the path is different than if B exists (is instantiated).
You can get evidence either by instantiating a node (in which case its truth value is known) or by arriving to this node from some other node. So either the node is instantiated and you get evidence from its truth value, or it is not and you get evidence from the flow.
Instantiating a node in Bayesian networks is different from object oriented programming. A node is instantiated when it's value is known through observing what it represents. If it is not instantiated then it's value can be updated through Bayesian inference.
In the example A -> B -> C. Assuming nodes are boolean (either true or false) then if you instantiate C (e.g. C = true) then the values of B and A will update using Bayesian inference. However, if B is also instantiated then it d-separates A and C, so instantiating C will not update A. The rules of d-separation depend on the type of node configuration, so that instantiating a node may either d-separate or d-connect nodes.
So following on from this question:
Erlang lists:index_of function?
I have the following code which works just fine:
-module(test_index_of).
-compile(export_all).
index_of(Q)->
N=length(Q),
Qs=lists:zip(lists:sort(Q), lists:seq(1, N)),
IndexFn=fun(X)->
{_, {_, I}}=lists:keysearch(X, 1, Qs),
I
end,
[IndexFn(X) || X <- Q].
test()->
Q=[random:uniform() || _X <- lists:seq(1, 20)],
{T1, _}=timer:tc(test_index_of, index_of, [Q]),
io:format("~p~n", [T1]).
Problem is, I need to run the index_of function a very large number of times [10,000] on lists of length 20-30 characters; the index_of function is the performance bottleneck in my code. So although it looks to be implemented reasonably efficiently to me, I'm not convinced it's the fastest solution.
Can anyone out there improve [performance-wise] on the current implementation of index_of ? [Zed mentioned gb_trees]
Thanks!
You are optimizing an operation on the wrong data type.
If you are going to make 10 000 lookups on the same list of 20-30 items, then it really pays off to do pre-computation to speed up those lookups. For example, lets make a tuple sorted on the key in a tuples of {key, index}.
1> Ls = [x,y,z,f,o,o].
[x,y,z,f,o,o]
2> Ls2 = lists:zip(Ls, lists:seq(1, length(Ls))).
[{x,1},{y,2},{z,3},{f,4},{o,5},{o,6}]
3> Ts = list_to_tuple(lists:keysort(1, Ls2)).
{{f,4},{o,5},{o,6},{x,1},{y,2},{z,3}}
A recursive binary search for a key on this tuple will very quickly home in on the right index.
Use proplists:normalize to remove duplicates, that is, if it is wrong to return 6 when looking up 'o' instead of 5. Or use folding and sets to implement your own filter that removes duplicates.
Try building a dict with dict:from_list/1 and make lookups on that dict instead.
But this still begs the question: Why do you want the index into a list of something? Lookups with lists:nth/2 has O(n) complexity.
Not sure if I understand this completely, but if the above is your actual usecase, then...
First of all, you could generate Q as the following, and you already save the zipping part.
Q=[{N,random:uniform()} || N <- lists:seq(1, 20)]
Taking this further on, you could generate a tree indexed by the values from the beginning:
Tree = lists:foldl(
fun(T, N) -> gb_trees:enter(uniform:random(), N, T) end,
gb_trees:empty(),
lists:seq(1, 20)
).
Then looking up your index becomes:
index_of(Item, Tree) ->
case gb_trees:lookup(Item, Tree) of
{value, Index} -> Index;
_ -> not_found
end.
I think you need custom sort function which record permutations it makes to input list. For example you can use lists:sort source. This should give you O(N*log N) performance.
Just one question: WTF are you trying do?
I just can't found what is practical purpose of this function. I think you do something odd. It seems that you just improved from O(NM^2) to O(NM*logM) but it is still very bad.
EDIT:
When I synthesize what is goal, It seems that you are trying use Monte Carlo method to determine probabilities of team's 'finishing positions' in English Premiere League. But I'm still not sure. You can determine most probable position [1,1,2] -> 1 or as fractional number as some sort of average 1.33 - for example this last one can be achieve with less effort than others.
In functional programing languages data structures are more important that in procedural or OO ones. They are more about work-flow. You will do this and than this and than ... In functional language as Erlang you should think in manner, I have this input and I want that output. Required output I can determine from this and this and so. There may be not necessary have list of things as you used to be in procedural approaches.
In procedural approaches you are used to use arrays for storage with constant random access. List is not that such thing. There are not arrays in Erlang where you can write (even array module which is balanced tree in reality). You can use tuple or binary for read only array but no one read write. I can write a lot about that there doesn't exist data structure with constant access time at all (from RAM, through arrays in procedural languages to HASH maps) but there is not enough space to explain it in detail here (from RAM technology, through L{1,2,3} CPU caches to necessity increase HASH length when number of keys increase and key HASH computation dependency of key length).
List is data structure which have O(N) random access time. It is best structure for store data which you want take one by one in same order as stored in list. For small N it can be capable structure for random access for small N when corresponding constant is small. For example when N is number of teams (20) in your problem it can be faster than O(logN) access to some sort of tree. But you must take care how big your constant is.
One of common component of algorithms are Key-Value lookups. There can be used arrays as supporting data structure in procedural world in some circumstances. Key must be integer number, space of possible key must not be to sparse and so. List doesn't serve as its substitution well for this purpose except for very small N here. I learn that best way how write functional code is avoid Key-Value lookups where is unnecessary. It often needs rearrange work-flow or refactoring data structures and so. Sometimes it looks like flip over problem solution like glove.
If I ignore that your probability model is wrong. From information you provide it seems that in your model team's season points are independent random events which is not true of course. There is impossible that all teams have some high amount of point, 82 for example just because there is some limit of points taken by all teams in one season. So forgot for this for now. Then I will simulate one 'path' - season and take result in form [{78,'Liverpool'}, {81,'Man Utd'}, ...], then I can sort it using lists:sort without loosing information which team is where. Results I would collect using iteration by path. For each path I would iterate over sorted simulation result and collect it in dictionary where team is key (constant and very cheap hash computation from atom and constant storage because key set is fixed, there is possibility to use tuples/records but seems like premature optimization). Value can be tuple of size 20 and position in tuple is final position and value is count of it.
Something like:
% Process simulation results.
% Results = [[{Points, Team}]]
process(Results) ->
lists:foldl(fun process_path/2,
dict:from_list([{Team, {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}} ||
Team <- ['Liverpool', 'Man Utd', ...]]),
Results).
% process simulation path result
process_path(R, D) ->
process_path(lists:reverse(lists:sort(R)), D, 1).
process_path([], _, D) -> D;
process_path([{_, Team}|R], D, Pos) ->
process_path(R, update_team(Team, Pos, D), Pos + 1).
% update team position count in dictionary
update_team(Team, Pos, D) ->
dict:update(Team, fun(T) -> add_count(T, Pos) end, D).
% Add final position Pos to tuple T of counts
add_count(T, P) -> setelement(P, T, element(P, T) + 1).
Notice that there is nothing like lists:index_of or lists:nth function. Resulting complexity will look like O(NM) or O(NMlogM) for small number M of Teams, but real complexity is O(NM^2) for O(M) setelement/3 in add_count/2. For bigger M you should change add_count/2 to some more reasonable.