RocksDB: too many comparisons when using custom comparator - comparison

I am using RocksDB for storing and indexing encrypted data of fixed size of about 3KB (key,value = encryted_data, insertion_sequence_number), and for that purpose I'm providing a custom comparator to be used by RocksDB when inserting and searching. It compares the keys by decrypting them, so it is time consuming. This works well, but I noticed that the number of comparisons is much bigger than what I would expect for a given number of values. I also noticed that the number of stored values is bigger than what I expected.
Here is an example: I insert the values 1,2,3,7,14,22. (Each value is encrypted on insertion). When I print all contents I have:
RocksDB contents:
[row000]<[00007f1ebc0079d00d0000000000000068cd01bc1e7f0000]> -> {}
[row001]<[00007f1ebc0019900d0000000000000068cd01bc1e7f0000]> -> {}
[row002]<[c750e38063a871001af286720d7943095111a0dba808d00d]> -> {00000000}
[row003]<[f146398078a31c00de0aa3026f855101c36592eae1324905]> -> {01000000}
[row004]<[d687a251eb43d1081bb2ebaf0dd9f5077029f213b814f909]> -> {02000000}
[row005]<[571050962b08280b9f5b1ca9cf2d8e04959cd6d14a4eeb08]> -> {03000000}
[row006]<[4d4f01d6a6a7000b8800af4b4480d705bd9987efdeccc307]> -> {04000000}
[row007]<[5ee699152ff3cc0a21e3e8da9071740d7559b1b5baacd80a]> -> {05000000}
[row008]<[00007f1ebc0525000d0000000000000068cd01bc1e7f0000]> -> {}
[row009]<[00007f1ebc02ef500d0000000000000068cd01bc1e7f0000]> -> {}
[row010]<[00007f1ebc0191f00d0000000000000068cd01bc1e7f0000]> -> {}
Instead of 6, I have 11, so 5 are not coming from me. The number of comparisons for the last insertion is above 100. If I make a search, my comparator function is called above 100 times too. The values compared are mine and not mine, which doesn't make any sens because not valid encrypted values can not have a valid decryption.
What could explain such behavior? Fake values are maybe leaves of a tree internally used for storage, but the huge number of comparisons?

Related

What is the most memory-efficient array of nullable vectors when most of the second dimension will be empty?

I have a large fixed-size array of variable-sized arrays of u32. Most of the second dimension arrays will be empty (i.e. the first array will be sparsely populated). I think Vec is the most suitable type for both dimensions (Vec<Vec<u32>>). Because my first array might be quite large, I want to find the most space-efficient way to represent this.
I see two options:
I could use a Vec<Option<Vec<u32>>>. I'm guessing that as Option is a tagged union, this would result each cell being sizeof(Vec<u32>) rounded up to the next word boundary for the tag.
I could directly use Vec::with_capacity(0) for all cells. Does an empty Vec allocate zero heap until it's used?
Which is the most space-efficient method?
Actually, both Vec<Vec<T>> and Vec<Option<Vec<T>>> have the same space efficiency.
A Vec contains a pointer that will never be null, so the compiler is smart enough to recognize that in the case of Option<Vec<T>>, it can represent None by putting 0 in the pointer field. What is the overhead of Rust's Option type? contains more information.
What about the backing storage the pointer points to? A Vec doesn't allocate (same link as the first) when you create it with Vec::new or Vec::with_capacity(0); in that case, it uses a special, non-null "empty pointer". Vec only allocates space on the heap when you push something or otherwise force it to allocate. Therefore, the space used both for the Vec itself and for its backing storage are the same.
Vec<Vec<T>> is a decent starting point. Each entry costs 3 pointers, even if it is empty, and for filled entries there can be additional per-allocation overhead. But depending on which trade-offs you're willing to make, there might be a better solution.
Vec<Box<[T]>> This reduces the size of an entry from 3 pointers to 2 pointers. The downside is that changing the number of elements in a box is both inconvenient (convert to and from Vec<T>) and more expensive (reallocation).
HashMap<usize, Vec<T>> This saves a lot of memory if the outer collection is sufficiently sparse. The downsides are higher access cost (hashing, scanning) and a higher per element memory overhead.
If the collection is only filled once and you never resize the inner collections you could use a split data structure:
This not only reduces the per-entry size to 1 pointer, it also eliminates the per-allocation overhead.
struct Nested<T> {
data: Vec<T>,
indices: Vec<usize>,// points after the last element of the i-th slice
}
impl<T> Nested<T> {
fn get_range(&self, i: usize) -> std::ops::Range<usize> {
assert!(i < self.indices.len());
if i > 0 {
self.indices[i-1]..self.indices[i]
} else {
0..self.indices[i]
}
}
pub fn get(&self, i:usize) -> &[T] {
let range = self.get_range(i);
&self.data[range]
}
pub fn get_mut(&mut self, i:usize) -> &mut [T] {
let range = self.get_range(i);
&mut self.data[range]
}
}
For additional memory savings you can reduce the indices to u32 limiting you to 4 billion elements per collection.

Restricting number of function iterations

I'm writing a code in Erlang which suppose to generate a random number for a random amount of time and add each number to a list. I managed a function which can generate random numbers and i kinda managed a method for adding it to a list,but my main problem is restricting the number of iterations of the function. I like the function to produce several numbers and add them to the list and then kill that process or something like that.
Here is my code so far:
generator(L1)->
random:seed(now()),
A = random:uniform(100),
L2 = lists:append(L1,A),
generator(L2),
producer(B,L) ->
receive
{last_element} ->
consumer ! {lists:droplast(B)}
end
consumer()->
timer:send_after(random:uniform(1000),producer,{last_element,self()}),
receive
{Answer, Producer_PID} ->
io:format("the last item is:~w~n",[Answer])
end,
consumer().
start() ->
register(consumer,spawn(lis,consumer,[])),
register(producer,spawn(lis,producer,[])),
register(generator,spawn(lis,generator,[random:uniform(10)])).
I know it's a little bit sloppy and incomplete but that's not the case.
First, you should use rand to generate random numbers instead of random, it is an improved module.
In addition, when using rand:uniform/1 you won't need to change the seed every time you run your program. From erlang documentation:
If a process calls uniform/0 or uniform/1 without setting a seed
first, seed/1 is called automatically with the default algorithm and
creates a non-constant seed.
Finally, in order to create a list of random numbers, take a look at How to create a list of 1000 random number in erlang.
If I conclude all this, you can just do:
[rand:uniform(100) || _ <- lists:seq(1, 1000)].
There have some issue in your code:
timer:send_after(random:uniform(1000),producer,{last_element,self()}),, you send {last_element,self()} to producer process, but in producer, you just receive {last_element}, these messages are not matched.
you can change
producer(B,L) ->
receive
{last_element} ->
consumer ! {lists:droplast(B)}
end.
to
producer(B,L) ->
receive
{last_element, FromPid} ->
FromPid! {lists:droplast(B)}
end.
the same reason for consumer ! {lists:droplast(B)} and {Answer, Producer_PID} ->.

Erlang Dialyzer: only accept certain integers?

Say I have a function,foo/1, whose spec is -spec foo(atom()) -> #r{}., where #r{} is a record defined as -record(r, {a :: 1..789})., however, I have foo(a) -> 800. in my code, when I run dialyzer against it, it didn't warn me about this, (800 is not a "valid" return value for function foo/1), can I make dialyzer warn me about this?
Edit
Learn You Some Erlang says:
Dialyzer reserves the right to expand this range into a bigger one.
But I couldn't find how to disable this.
As of Erlang 18, the handling of integer ranges is done by erl_types:t_from_range/2. As you can see, there are a lot of generalizations happening to get a "safe" overapproximation of a range.
If you tried to ?USE_UNSAFE_RANGES (see the code) it is likely that your particular error would be caught, but at a terrible cost: native compilation and dialyzing of recursive integer functions would not ever finish!
The reason is that the type analysis for recursive functions uses a simple fixpoint approach, where the initial types accept the base cases and are repeatedly expanded using the recursive cases to include more values. At some point overapproximations must happen if the process is to terminate. Here is a concrete example:
fact(1) -> 1;
fact(N) -> N * fact(N - 1).
Initially fact/1 is assumed to have type fun(none()) -> none(). Using that to analyse the code, the second clause is 'failing' and only the first one is ok. Therefore after the first iteration the new type is fun(1) -> 1. Using the new type the second clause can succeed, expanding the type to fun(1|2) -> 1|2. Then fun(1|2|3) -> 1|2|6 this continues until the ?SET_LIMIT is reached in which case t_from_range stops using the individual values and type becomes fun(1..255) -> pos_integer(). The next iteration expands 1..255 to pos_integer() and then fun(pos_integer()) -> pos_integer() is a fixpoint!
Incorrect answer follows (explains the first comment below):
You should get a warning for this code if you use the -Woverspecs option. This option is not enabled by default, since Dialyzer operates under the assumption that it is 'ok' to over-approximate the return values of a function. In your particular case, however, you actually want any extra values to produce warnings.

Immutable members on objects

I have an object that can be neatly described by a discriminated union. The tree that it represents has some properties that can be easily updated when the tree is modified (but remaining immutable) but that are relatively expensive to recalculate.
I would like to store those properties along with the object as cached values but I don't want to put them into each of the discriminated union cases so I figured a member variable would fit here.
The question is then, how do I change the member value (when I modify the tree) without mutating the actual object? I know I could modify the tree and then mutate that copy without ruining purity but that seems like a wrong way to go about it to me. It would make sense to me if there was some predefined way to change a property but so that the result of the operation is a new object with that property changed.
To clarify, when I say modify I mean doing it in a functional way. Like (::) "appends" to the beginning of a list. I'm not sure what the correct terminology is here.
F# actually has syntax for copy and update records.
The syntax looks like this:
let myRecord3 = { myRecord2 with Y = 100; Z = 2 }
(example from the MSDN records page - http://msdn.microsoft.com/en-us/library/dd233184.aspx).
This allows the record type to be immutable, and for large parts of it to be preserved, whilst only a small part is updated.
The cleanest way to go about it would really be to carry the 'cached' value attached to the DU (as part of the case) in one way or another. I could think of several ways to implement this, I'll just give you one, where there are separate cases for the cached and non-cached modes:
type Fraction =
| Frac of int * int
| CachedFrac of (int * int) * decimal
member this.AsFrac =
match this with
| Frac _ -> this
| CachedFrac (tup, _) -> Frac tup
An entirely different option would be to keep the cached values in a separate dictionary, this is something that makes sense if all you want to do is save some time recalculating them.
module FracCache =
let cache = System.Collections.Generic.Dictionary<Fraction, decimal>()
let modify (oldFrac: Fraction) (newFrac: Fraction) =
cache.[newFrac] <- cache.[oldFrac] + 1 // need to check if oldFrac has a cached value as well.
Basically what memoize would give you plus you have more control over it.

Creating a valid function declaration from a complex tuple/list structure

Is there a generic way, given a complex object in Erlang, to come up with a valid function declaration for it besides eyeballing it? I'm maintaining some code previously written by someone who was a big fan of giant structures, and it's proving to be error prone doing it manually.
I don't need to iterate the whole thing, just grab the top level, per se.
For example, I'm working on this right now -
[[["SIP",47,"2",46,"0"],32,"407",32,"Proxy Authentication Required","\r\n"],
[{'Via',
[{'via-parm',
{'sent-protocol',"SIP","2.0","UDP"},
{'sent-by',"172.20.10.5","5060"},
[{'via-branch',"z9hG4bKb561e4f03a40c4439ba375b2ac3c9f91.0"}]}]},
{'Via',
[{'via-parm',
{'sent-protocol',"SIP","2.0","UDP"},
{'sent-by',"172.20.10.15","5060"},
[{'via-branch',"12dee0b2f48309f40b7857b9c73be9ac"}]}]},
{'From',
{'from-spec',
{'name-addr',
[[]],
{'SIP-URI',
[{userinfo,{user,"003018CFE4EF"},[]}],
{hostport,"172.20.10.11",[]},
{'uri-parameters',[]},
[]}},
[{tag,"b7226ffa86c46af7bf6e32969ad16940"}]}},
{'To',
{'name-addr',
[[]],
{'SIP-URI',
[{userinfo,{user,"3966"},[]}],
{hostport,"172.20.10.11",[]},
{'uri-parameters',[]},
[]}},
[{tag,"a830c764"}]},
{'Call-ID',"90df0e4968c9a4545a009b1adf268605#172.20.10.15"},
{'CSeq',1358286,"SUBSCRIBE"},
["date",'HCOLON',
["Mon",44,32,["13",32,"Jun",32,"2011"],32,["17",58,"03",58,"55"],32,"GMT"]],
{'Contact',
[[{'name-addr',
[[]],
{'SIP-URI',
[{userinfo,{user,"3ComCallProcessor"},[]}],
{hostport,"172.20.10.11",[]},
{'uri-parameters',[]},
[]}},
[]],
[]]},
["expires",'HCOLON',3600],
["user-agent",'HCOLON',
["3Com",[]],
[['LWS',["VCX",[]]],
['LWS',["7210",[]]],
['LWS',["IP",[]]],
['LWS',["CallProcessor",[['SLASH',"v10.0.8"]]]]]],
["proxy-authenticate",'HCOLON',
["Digest",'LWS',
["realm",'EQUAL',['SWS',34,"3Com",34]],
[['COMMA',["domain",'EQUAL',['SWS',34,"3Com",34]]],
['COMMA',
["nonce",'EQUAL',
['SWS',34,"btbvbsbzbBbAbwbybvbxbCbtbzbubqbubsbqbtbsbqbtbxbCbxbsbybs",
34]]],
['COMMA',["stale",'EQUAL',"FALSE"]],
['COMMA',["algorithm",'EQUAL',"MD5"]]]]],
{'Content-Length',0}],
"\r\n",
["\n"]]
Maybe https://github.com/etrepum/kvc
I noticed your clarifying comment. I'd prefer to add a comment myself, but don't have enough karma. Anyway, the trick I use for that is to experiment in the shell. I'll iterate a pattern against a sample data structure until I've found the simplest form. You can use the _ match-all variable. I use an erlang shell inside an emacs shell window.
First, bind a sample to a variable:
A = [{a,b},[{c,d}, {e,f}]].
Now set the original structure against the variable:
[{a,b},[{c,d},{e,f}]] = A.
If you hit enter, you'll see they match. Hit alt-p (forget what emacs calls alt, but it's alt on my keyboard) to bring back the previous line. Replace some tuple or list item with an underscore:
[_,[{c,d},{e,f}]].
Hit enter to make sure you did it right and they still match. This example is trivial, but for deeply nested, multiline structures it's trickier, so it's handy to be able to just quickly match to test. Sometimes you'll want to try to guess at whole huge swaths, like using an underscore to match a tuple list inside a tuple that's the third element of a list. If you place it right, you can match the whole thing at once, but it's easy to misread it.
Anyway, repeat to explore the essential shape of the structure and place real variables where you want to pull out values:
[_, [_, _]] = A.
[_, _] = A.
[_, MyTupleList] = A. %% let's grab this tuple list
[{MyAtom,b}, [{c,d}, MyTuple]] = A. %% or maybe we want this atom and tuple
That's how I efficiently dissect and pattern match complex data structures.
However, I don't know what you're doing. I'd be inclined to have a wrapper function that uses KVC to pull out exactly what you need and then distributes to helper functions from there for each type of structure.
If I understand you correctly you want to pattern match some large datastructures of unknown formatting.
Example:
Input: {a, b} {a,b,c,d} {a,[],{},{b,c}}
function({A, B}) -> do_something;
function({A, B, C, D}) when is_atom(B) -> do_something_else;
function({A, B, C, D}) when is_list(B) -> more_doing.
The generic answer is of course that it is undecidable from just data to know how to categorize that data.
First you should probably be aware of iolists. They are created by functions such as io_lib:format/2 and in many other places in the code.
One example is that
[["SIP",47,"2",46,"0"],32,"407",32,"Proxy Authentication Required","\r\n"]
will print as
SIP/2.0 407 Proxy Authentication Required
So, I'd start with flattening all those lists, using a function such as
flatten_io(List) when is_list(List) ->
Flat = lists:map(fun flatten_io/1, List),
maybe_flatten(Flat);
flatten_io(Tuple) when is_tuple(Tuple) ->
list_to_tuple([flatten_io(Element) || Element <- tuple_to_list(Tuple)];
flatten_io(Other) -> Other.
maybe_flatten(L) when is_list(L) ->
case lists:all(fun(Ch) when Ch > 0 andalso Ch < 256 -> true;
(List) when is_list(List) ->
lists:all(fun(X) -> X > 0 andalso X < 256 end, List);
(_) -> false
end, L) of
true -> lists:flatten(L);
false -> L
end.
(Caveat: completely untested and quite inefficient. Will also crash for inproper lists, but you shouldn't have those in your data structures anyway.)
On second thought, I can't help you. Any data structure that uses the atom 'COMMA' for a comma in a string should be taken out and shot.
You should be able to flatten those things as well and start to get a view of what you are looking at.
I know that this is not a complete answer. Hope it helps.
Its hard to recommend something for handling this.
Transforming all the structures in a more sane and also more minimal format looks like its worth it. This depends mainly on the similarities in these structures.
Rather than having a special function for each of the 100 there must be some automatic reformatting that can be done, maybe even put the parts in records.
Once you have records its much easier to write functions for it since you don't need to know the actual number of elements in the record. More important: your code won't break when the number of elements changes.
To summarize: make a barrier between your code and the insanity of these structures by somehow sanitizing them by the most generic code possible. It will be probably a mix of generic reformatting with structure speicific stuff.
As an example already visible in this struct: the 'name-addr' tuples look like they have a uniform structure. So you can recurse over your structures (over all elements of tuples and lists) and match for "things" that have a common structure like 'name-addr' and replace these with nice records.
In order to help you eyeballing you can write yourself helper functions along this example:
eyeball(List) when is_list(List) ->
io:format("List with length ~b\n", [length(List)]);
eyeball(Tuple) when is_tuple(Tuple) ->
io:format("Tuple with ~b elements\n", [tuple_size(Tuple)]).
So you would get output like this:
2> eyeball({a,b,c}).
Tuple with 3 elements
ok
3> eyeball([a,b,c]).
List with length 3
ok
expansion of this in a useful tool for your use is left as an exercise. You could handle multiple levels by recursing over the elements and indenting the output.
Use pattern matching and functions that work on lists to extract only what you need.
Look at http://www.erlang.org/doc/man/lists.html:
keyfind, keyreplace, L = [H|T], ...

Resources