Lowering memory usage in HashMap in Rust - memory

I'm trying to parse a very long file by using a fixed-size sliding window over it. For each such window I'd like to either insert it to a HashMap as the key with custom struct as value, or modify existing value for the window. My main problem is memory usage, since it should scale into very large quantities (up to several billions of distinct windows) and I want to reuse existing keys.
I would like to append windows (or more specifically bytes) to a vector and use the index as a key in the HashMap, but use the window under index for hash computation and key comparison. Because windows are overlapping, I will append only the part of the window which is new (if I have an input AAAB and size 3 I would have 2 windows: AAA and AAB, but would only store 4 bytes - AAAB; windows would have indices 0 and 1 respectively), which is the reason behind not keying the HM with window itself.
Here's the simplified pseudo-code, in which I omitted the minimal-input problem:
let file = // content of the file on which i can call windows()
let vec = Rc::new(RefCell::new(Vec::new())); // RefCell allows me to store Rc in the hashmap while mutating the underlying vector
let hm: HashMap<KeyStruct, ValueStruct> = HashMap::new();
for i in file.windows(FIXED_SIZE) {
let idx = vec.len();
vec.borrow_mut().push(i);
if hm.contains_key(KeyStruct::new(idx)) {
// get the key associated with i
// modify the value associated with i
// do something with the key
vec.borrow_mut().pop(); // window is already in the vector
}
else {
hm.insert(KeyStruct::new(idx), ValueStruct::new(...));
}
}
I have came up with 2 different approaches: either modifying the existing HashMap implementation so that it works as intended, or using a custom struct as key to the HashMap. Since I would only use one vector in order to store windows, I could store a Rc to it in the HashMap and then use that for lookups.
I could also create a struct which would hold both a Rc and index, using it as a key to the HashMap. The latter solution works with a vanilla HashMap, but stores a lot of redundant Rcs to the same vector. I also thought about storing a static pointer to Rc and then get Rc in unsafe blocks, but I would have to guarantee that the position of the Rc on the stack never changes and I'm not sure if I can guarantee that.
I tried to implement the first approach (custom HashMap), but it turns out that Buckets use a lot of features which are gated, and I can't compile the project using the stable compiler.
What's even worse is that I would like to get the key that is already in the HashMap on a successful lookup (because different indices can store the same window, for which the hash/cmp would be identical) and use it inside the value structure. I couldn't find a way to do this using the provided API for HashMap - the closest I get is by using entry(), which can contain an OccupiedEntry, but it doesn't have any way to retrieve the key, and there's no way to get it by unsafe memory lookups, because documentation on repr() says that the order in structs is not guaranteed in the default representation. I can store the key (or only the index) in the value struct, but that adds yet another size_of::<usize>() bytes per entry, only to store the index/key in a reachable manner, which is kept with that entry either way.
My questions are:
Is it possible to compile/reuse parts of std::collections which are not pub, such that I could modify few methods of HashMap and compile the whole project?
Is there any way of getting the key after successful lookup in the HashMap? (I even found out that libs team decided against implementing method over Entry which would allow me to get the key...)
Can you see any alternative to solutions that I mentioned?
EDIT
To clarify the problem let's consider a simple example - input ABABCBACBC and window size of 2. We should give index as a key to the HashMap, and it should get the window-size number of bytes as window starting from that index: with vector [A, A, C], index 1 and window-size 2 HashMap should try to find a hash/key for AC.
We get windows like this:
AB -> BA -> AB -> BC -> CB -> BA -> AC -> CB -> BC
First pair is AB, we append it into the empty vector and give it an index of 0.
vec = [A, B]
hm = [(0, val)]
The next pair is BA:
start with vec = [A, B]
using algorithm not shown here, I know that I have a common part between last inserted window (AB) and current window (BA), namely B
append part of the window to the existing vector, so we have vec = [A, B, A]
perform a lookup using index 1 as the index of window
it has not been found so the new key, val is inserted to HashMap
vec = [A, B, A]
hm = [(0, val0), (1, val1)]
Next up is window AB:
once again we have a common part - A
append: vec = [A, B, A, B]
lookup using index 2
it is successful, so I should delete the newly inserted part of window and get the index of the window inside vector - in this case 0
modify value, do something with the key etc...
vec = [A, B, A]
hm = [(0, val0_modified), (1, val1)]
After looping over this input i should end up with:
vec = [A, B, A, B, C, B, A, C]
and indices for pairs could be represented as: [(AB, 0), (BA, 1), (BC, 3), (CB, 4), (AC, 6)]
I do not want to modify keys. I also don't want to modify the vector with the exception of pushing/popping the window during lookup/insertion.
Sidenote: even though I still have redundant information in this particular example after putting everything into vector, it won't be the case while working with the original data.

Related

Dart - Pass by value for int but reference for list?

In Dart, looking at the code below, does it 'pass by reference' for list and 'pass by value' for integers? If that's the case, what type of data will be passed by reference/value? If that isn't the case, what's the issue that causes such output?
void main() {
var foo = ['a','b'];
var bar = foo;
bar.add('c');
print(aoo); // [a, b, c]
print(bar); // [a, b, c]
var a = 3;
int b = a;
b += 2;
print(a); // 3
print(b); // 5
}
The question your asking can be answered by looking at the difference between a value and a reference type.
Dart like almost every other programming langue makes a distinction between the two. The reason for this is that you divide memory into the so called stack and the heap. The stack is fast but very limited so it cannot hold that much data. (By the way, if you have too much data stored in the stack you will get a Stack Overflow exception which is where the name of this site comes from ;) ). The heap on the other hand is slower but can hold nearly infinite data.
This is why you have value and reference types. The value types are all your primitive data types (in Dart all the data type that are written small like int, bool, double and so on). Their values are small enough to be stored directly in the stack. On the other hand you have all the other data types that may potentially be much bigger so they cannot be stored in the stack. This is why all the other so called reference types are basically stored in the heap and only an address or a reference is stored in the stack.
So when you are setting the reference type bar to foo you're essentially just copying the storage address from bar to foo. Therefore if you change the data stored under that reference it seems like your changing both values because both have the same reference. In contrast when you say b = a your not transferring the reference but the actual value instead so it is not effected if you make any changes to the original value.
I really hope I could help answering your question :)
In Dart, all type are reference types. All parameters are passed by value. The "value" of a reference type is its reference. (That's why it's possible to have two variables containing the "same object" - there is only one object, but both variables contain references to that object). You never ever make a copy of an object just by passing the reference around.
Dart does not have "pass by reference" where you pass a variable as an argument (so the called function can change the value bound to the variable, like C#'s ref parameters).
Dart does not have primitive types, at all. However (big caveat), numbers are always (pretending to be) canonicalized, so there is only ever one 1 object in the program. You can't create a different 1 object. In a way it acts similarly to other languages' primitive types, but it isn't one. You can use int as a type argument to List<int>, unlike in Java where you need to do List<Integer>, you can ask about the identity of an int like identical(1, 2), and you can call methods on integers like 1.hashCode.
If you want to clone or copy a list
var foo = ['a', 'b'];
var bar = [...foo];
bar.add('c');
print(bar); // [a, b, c]
print(foo); // [a, b]
var bar_two = []; //or init an empty list
bar_two.addAll([...bar]);
print(bar_two); // [a, b, c]
Reference link
Clone a List, Map or Set in Dart

Dask Delayed ignores name for dependent variables

When creating a graph of calculations using delayed I'm trying to assign names so that if I visualize the graph it's readable. However, for delayed variables that are dependent on functions the name parameter doesn't seem to affect the key. Here's a toy example:
def calc_avg(a, b):
return pd.concat([a, b], axis=1).mean(axis=1)
def calc_ratio(a, b):
return a / b
a = delayed(pd.Series(np.random.rand(10)), name='a')
b = delayed(pd.Series(np.random.rand(10)), name='b')
c = delayed(pd.Series(np.random.rand(10)), name='c')
x = delayed(calc_avg, name='avg_result')(a,b)
y = delayed(calc_ratio, name='ratio_result')(x,c)
y.visualize()
You can see the visualization here (I can't embed images), but rather than seeing 'avg_result' I see 'calc_avg-#0' and rather than see 'ratio_result' I see 'calc_ratio-#1'. If I look at x.key or y.key they do not match the names that I provided. Is this the expected behavior?
The key of a dask result needs to be unique for every combination of the function that was delayed, and the inputs you give it. What you see above is the expected behaviour: you are naming the function, but a call with different inputs would expect a different output, so the key must be different.
You can specify the key you'd like associated not when you define the delayed function, but when you call it:
x = delayed(calc_avg)(a, b, dask_key_name='avg_result')
y = delayed(calc_ratio)(x, c, dask_key_name='ratio_result')

Generating a random rule for property based test

I am using Triq (erlang quickcheck) and I am having trouble generating a set of nice rules for my program.
What I want to generate are things that looks like this:
A -> B
where I would like to provide A and the size of B, with latter not having any dupicates.
For example, if I say generate me rules with L.H.S. of [a] and R.H.S. of size 4 (ie. A = [a] and size(B) = 4) I would like to get something like this:
{rule, [a], [1,2,4,5]}
{rule, [a], [a,d,c,e]}
{rule, [a], [q,d,3,4]}
Note, I don't want any dupicates in B (this is the part I'm having trouble with). Also, it doesn't really matter what B is made up of - it can be anything as long as it is distinct and without dupicates.
My spec is far too messy to show here, so I'd rather not.
I am not familiar with Triq, but in PropEr and Quviq's Qickcheck you can use ?SUCHTHAT conditions that filter 'bad' instances.
If a generated instance does not satisfy a ?SUCHTHAT constraint it is discarded and not counted as a valid test. You could use this mechanism to generate lists of the specified size (i.e. what PropEr calls 'vectors') and then discard those that have duplicates, but I think that too many instances would then be discarded (see also the link).
It is usually more efficient to tinker with the generator so that all instances are valid, in your case by e.g. generating (3) X-times as many elements, removing duplicates and keeping as many as you need. This can still fail, and it will fail, so you need to guard against it.
Here is a generator for your case, in PropEr, together with a dummy property:
-module(dummy).
-export([rule_prop/0]).
-include_lib("proper/include/proper.hrl").
-define(X, 5).
rule_prop() ->
?FORALL(_, rule_gen(integer(), 4, integer()), true).
rule_gen(A, SizeB, TypeB) ->
?LET(
EnoughB,
?SUCHTHAT(
NoDupB,
?LET(
ManyB,
vector(?X * SizeB, TypeB),
no_dups(ManyB)
),
length(NoDupB) >= SizeB
),
begin
B = lists:sublist(EnoughB, SizeB),
{rule, A, B}
end).
no_dups([]) ->
[];
no_dups([A|B]) ->
[A | no_dups([X || X <- B, X =/= A])].

maps,filter,folds and more? Do we really need these in Erlang?

Maps, filters, folds and more : http://learnyousomeerlang.com/higher-order-functions#maps-filters-folds
The more I read ,the more i get confused.
Can any body help simplify these concepts?
I am not able to understand the significance of these concepts.In what use cases will these be needed?
I think it is majorly because of the syntax,diff to find the flow.
The concepts of mapping, filtering and folding prevalent in functional programming actually are simplifications - or stereotypes - of different operations you perform on collections of data. In imperative languages you usually do these operations with loops.
Let's take map for an example. These three loops all take a sequence of elements and return a sequence of squares of the elements:
// C - a lot of bookkeeping
int data[] = {1,2,3,4,5};
int squares_1_to_5[sizeof(data) / sizeof(data[0])];
for (int i = 0; i < sizeof(data) / sizeof(data[0]); ++i)
squares_1_to_5[i] = data[i] * data[i];
// C++11 - less bookkeeping, still not obvious
std::vec<int> data{1,2,3,4,5};
std::vec<int> squares_1_to_5;
for (auto i = begin(data); i < end(data); i++)
squares_1_to_5.push_back((*i) * (*i));
// Python - quite readable, though still not obvious
data = [1,2,3,4,5]
squares_1_to_5 = []
for x in data:
squares_1_to_5.append(x * x)
The property of a map is that it takes a collection of elements and returns the same number of somehow modified elements. No more, no less. Is it obvious at first sight in the above snippets? No, at least not until we read loop bodies. What if there were some ifs inside the loops? Let's take the last example and modify it a bit:
data = [1,2,3,4,5]
squares_1_to_5 = []
for x in data:
if x % 2 == 0:
squares_1_to_5.append(x * x)
This is no longer a map, though it's not obvious before reading the body of the loop. It's not clearly visible that the resulting collection might have less elements (maybe none?) than the input collection.
We filtered the input collection, performing the action only on some elements from the input. This loop is actually a map combined with a filter.
Tackling this in C would be even more noisy due to allocation details (how much space to allocate for the output array?) - the core idea of the operation on data would be drowned in all the bookkeeping.
A fold is the most generic one, where the result doesn't have to contain any of the input elements, but somehow depends on (possibly only some of) them.
Let's rewrite the first Python loop in Erlang:
lists:map(fun (E) -> E * E end, [1,2,3,4,5]).
It's explicit. We see a map, so we know that this call will return a list as long as the input.
And the second one:
lists:map(fun (E) -> E * E end,
lists:filter(fun (E) when E rem 2 == 0 -> true;
(_) -> false end,
[1,2,3,4,5])).
Again, filter will return a list at most as long as the input, map will modify each element in some way.
The latter of the Erlang examples also shows another useful property - the ability to compose maps, filters and folds to express more complicated data transformations. It's not possible with imperative loops.
They are used in almost every application, because they abstract different kinds of iteration over lists.
map is used to transform one list into another. Lets say, you have list of key value tuples and you want just the keys. You could write:
keys([]) -> [];
keys([{Key, _Value} | T]) ->
[Key | keys(T)].
Then you want to have values:
values([]) -> [];
values([{_Key, Value} | T}]) ->
[Value | values(T)].
Or list of only third element of tuple:
third([]) -> [];
third([{_First, _Second, Third} | T]) ->
[Third | third(T)].
Can you see the pattern? The only difference is what you take from the element, so instead of repeating the code, you can simply write what you do for one element and use map.
Third = fun({_First, _Second, Third}) -> Third end,
map(Third, List).
This is much shorter and the shorter your code is, the less bugs it has. Simple as that.
You don't have to think about corner cases (what if the list is empty?) and for experienced developer it is much easier to read.
filter searches lists. You give it function, that takes element, if it returns true, the element will be on the returned list, if it returns false, the element will not be there. For example filter logged in users from list.
foldl and foldr are used, when you have to do additional bookkeeping while iterating over the list - for example summing all the elements or counting something.
The best explanations, I've found about those functions are in books about Lisp: "Structure and Interpretation of Computer Programs" and "On Lisp" Chapter 4..

Find all possible pairs between the subsets of N sets with Erlang

I have a set S. It contains N subsets (which in turn contain some sub-subsets of various lengths):
1. [[a,b],[c,d],[*]]
2. [[c],[d],[e,f],[*]]
3. [[d,e],[f],[f,*]]
N. ...
I also have a list L of 'unique' elements that are contained in the set S:
a, b, c, d, e, f, *
I need to find all possible combinations between each sub-subset from each subset so, that each resulting combination has exactly one element from the list L, but any number of occurrences of the element [*] (it is a wildcard element).
So, the result of the needed function working with the above mentioned set S should be (not 100% accurate):
- [a,b],[c],[d,e],[f];
- [a,b],[c],[*],[d,e],[f];
- [a,b],[c],[d,e],[f],[*];
- [a,b],[c],[d,e],[f,*],[*];
So, basically I need an algorithm that does the following:
take a sub-subset from the subset 1,
add one more sub-subset from the subset 2 maintaining the list of 'unique' elements acquired so far (the check on the 'unique' list is skipped if the sub-subset contains the * element);
Repeat 2 until N is reached.
In other words, I need to generate all possible 'chains' (it is pairs, if N == 2, and triples if N==3), but each 'chain' should contain exactly one element from the list L except the wildcard element * that can occur many times in each generated chain.
I know how to do this with N == 2 (it is a simple pair generation), but I do not know how to enhance the algorithm to work with arbitrary values for N.
Maybe Stirling numbers of the second kind could help here, but I do not know how to apply them to get the desired result.
Note: The type of data structure to be used here is not important for me.
Note: This question has grown out from my previous similar question.
These are some pointers (not a complete code) that can take you to right direction probably:
I don't think you will need some advanced data structures here (make use of erlang list comprehensions). You must also explore erlang sets and lists module. Since you are dealing with sets and list of sub-sets, they seems like an ideal fit.
Here is how things with list comprehensions will get solved easily for you: [{X,Y} || X <- [[c],[d],[e,f]], Y <- [[a,b],[c,d]]]. Here i am simply generating a list of {X,Y} 2-tuples but for your use case you will have to put real logic here (including your star case)
Further note that with list comprehensions, you can use output of one generator as input of a later generator e.g. [{X,Y} || X1 <- [[c],[d],[e,f]], X <- X1, Y1 <- [[a,b],[c,d]], Y <- Y1].
Also for removing duplicates from a list of things L = ["a", "b", "a"]., you can anytime simply do sets:to_list(sets:from_list(L)).
With above tools you can easily generate all possible chains and also enforce your logic as these chains get generated.

Resources