I have a large fixed-size array of variable-sized arrays of u32. Most of the second dimension arrays will be empty (i.e. the first array will be sparsely populated). I think Vec is the most suitable type for both dimensions (Vec<Vec<u32>>). Because my first array might be quite large, I want to find the most space-efficient way to represent this.
I see two options:
I could use a Vec<Option<Vec<u32>>>. I'm guessing that as Option is a tagged union, this would result each cell being sizeof(Vec<u32>) rounded up to the next word boundary for the tag.
I could directly use Vec::with_capacity(0) for all cells. Does an empty Vec allocate zero heap until it's used?
Which is the most space-efficient method?
Actually, both Vec<Vec<T>> and Vec<Option<Vec<T>>> have the same space efficiency.
A Vec contains a pointer that will never be null, so the compiler is smart enough to recognize that in the case of Option<Vec<T>>, it can represent None by putting 0 in the pointer field. What is the overhead of Rust's Option type? contains more information.
What about the backing storage the pointer points to? A Vec doesn't allocate (same link as the first) when you create it with Vec::new or Vec::with_capacity(0); in that case, it uses a special, non-null "empty pointer". Vec only allocates space on the heap when you push something or otherwise force it to allocate. Therefore, the space used both for the Vec itself and for its backing storage are the same.
Vec<Vec<T>> is a decent starting point. Each entry costs 3 pointers, even if it is empty, and for filled entries there can be additional per-allocation overhead. But depending on which trade-offs you're willing to make, there might be a better solution.
Vec<Box<[T]>> This reduces the size of an entry from 3 pointers to 2 pointers. The downside is that changing the number of elements in a box is both inconvenient (convert to and from Vec<T>) and more expensive (reallocation).
HashMap<usize, Vec<T>> This saves a lot of memory if the outer collection is sufficiently sparse. The downsides are higher access cost (hashing, scanning) and a higher per element memory overhead.
If the collection is only filled once and you never resize the inner collections you could use a split data structure:
This not only reduces the per-entry size to 1 pointer, it also eliminates the per-allocation overhead.
struct Nested<T> {
data: Vec<T>,
indices: Vec<usize>,// points after the last element of the i-th slice
}
impl<T> Nested<T> {
fn get_range(&self, i: usize) -> std::ops::Range<usize> {
assert!(i < self.indices.len());
if i > 0 {
self.indices[i-1]..self.indices[i]
} else {
0..self.indices[i]
}
}
pub fn get(&self, i:usize) -> &[T] {
let range = self.get_range(i);
&self.data[range]
}
pub fn get_mut(&mut self, i:usize) -> &mut [T] {
let range = self.get_range(i);
&mut self.data[range]
}
}
For additional memory savings you can reduce the indices to u32 limiting you to 4 billion elements per collection.
I'm writing a code in Erlang which suppose to generate a random number for a random amount of time and add each number to a list. I managed a function which can generate random numbers and i kinda managed a method for adding it to a list,but my main problem is restricting the number of iterations of the function. I like the function to produce several numbers and add them to the list and then kill that process or something like that.
Here is my code so far:
generator(L1)->
random:seed(now()),
A = random:uniform(100),
L2 = lists:append(L1,A),
generator(L2),
producer(B,L) ->
receive
{last_element} ->
consumer ! {lists:droplast(B)}
end
consumer()->
timer:send_after(random:uniform(1000),producer,{last_element,self()}),
receive
{Answer, Producer_PID} ->
io:format("the last item is:~w~n",[Answer])
end,
consumer().
start() ->
register(consumer,spawn(lis,consumer,[])),
register(producer,spawn(lis,producer,[])),
register(generator,spawn(lis,generator,[random:uniform(10)])).
I know it's a little bit sloppy and incomplete but that's not the case.
First, you should use rand to generate random numbers instead of random, it is an improved module.
In addition, when using rand:uniform/1 you won't need to change the seed every time you run your program. From erlang documentation:
If a process calls uniform/0 or uniform/1 without setting a seed
first, seed/1 is called automatically with the default algorithm and
creates a non-constant seed.
Finally, in order to create a list of random numbers, take a look at How to create a list of 1000 random number in erlang.
If I conclude all this, you can just do:
[rand:uniform(100) || _ <- lists:seq(1, 1000)].
There have some issue in your code:
timer:send_after(random:uniform(1000),producer,{last_element,self()}),, you send {last_element,self()} to producer process, but in producer, you just receive {last_element}, these messages are not matched.
you can change
producer(B,L) ->
receive
{last_element} ->
consumer ! {lists:droplast(B)}
end.
to
producer(B,L) ->
receive
{last_element, FromPid} ->
FromPid! {lists:droplast(B)}
end.
the same reason for consumer ! {lists:droplast(B)} and {Answer, Producer_PID} ->.
Is there a generic way, given a complex object in Erlang, to come up with a valid function declaration for it besides eyeballing it? I'm maintaining some code previously written by someone who was a big fan of giant structures, and it's proving to be error prone doing it manually.
I don't need to iterate the whole thing, just grab the top level, per se.
For example, I'm working on this right now -
[[["SIP",47,"2",46,"0"],32,"407",32,"Proxy Authentication Required","\r\n"],
[{'Via',
[{'via-parm',
{'sent-protocol',"SIP","2.0","UDP"},
{'sent-by',"172.20.10.5","5060"},
[{'via-branch',"z9hG4bKb561e4f03a40c4439ba375b2ac3c9f91.0"}]}]},
{'Via',
[{'via-parm',
{'sent-protocol',"SIP","2.0","UDP"},
{'sent-by',"172.20.10.15","5060"},
[{'via-branch',"12dee0b2f48309f40b7857b9c73be9ac"}]}]},
{'From',
{'from-spec',
{'name-addr',
[[]],
{'SIP-URI',
[{userinfo,{user,"003018CFE4EF"},[]}],
{hostport,"172.20.10.11",[]},
{'uri-parameters',[]},
[]}},
[{tag,"b7226ffa86c46af7bf6e32969ad16940"}]}},
{'To',
{'name-addr',
[[]],
{'SIP-URI',
[{userinfo,{user,"3966"},[]}],
{hostport,"172.20.10.11",[]},
{'uri-parameters',[]},
[]}},
[{tag,"a830c764"}]},
{'Call-ID',"90df0e4968c9a4545a009b1adf268605#172.20.10.15"},
{'CSeq',1358286,"SUBSCRIBE"},
["date",'HCOLON',
["Mon",44,32,["13",32,"Jun",32,"2011"],32,["17",58,"03",58,"55"],32,"GMT"]],
{'Contact',
[[{'name-addr',
[[]],
{'SIP-URI',
[{userinfo,{user,"3ComCallProcessor"},[]}],
{hostport,"172.20.10.11",[]},
{'uri-parameters',[]},
[]}},
[]],
[]]},
["expires",'HCOLON',3600],
["user-agent",'HCOLON',
["3Com",[]],
[['LWS',["VCX",[]]],
['LWS',["7210",[]]],
['LWS',["IP",[]]],
['LWS',["CallProcessor",[['SLASH',"v10.0.8"]]]]]],
["proxy-authenticate",'HCOLON',
["Digest",'LWS',
["realm",'EQUAL',['SWS',34,"3Com",34]],
[['COMMA',["domain",'EQUAL',['SWS',34,"3Com",34]]],
['COMMA',
["nonce",'EQUAL',
['SWS',34,"btbvbsbzbBbAbwbybvbxbCbtbzbubqbubsbqbtbsbqbtbxbCbxbsbybs",
34]]],
['COMMA',["stale",'EQUAL',"FALSE"]],
['COMMA',["algorithm",'EQUAL',"MD5"]]]]],
{'Content-Length',0}],
"\r\n",
["\n"]]
Maybe https://github.com/etrepum/kvc
I noticed your clarifying comment. I'd prefer to add a comment myself, but don't have enough karma. Anyway, the trick I use for that is to experiment in the shell. I'll iterate a pattern against a sample data structure until I've found the simplest form. You can use the _ match-all variable. I use an erlang shell inside an emacs shell window.
First, bind a sample to a variable:
A = [{a,b},[{c,d}, {e,f}]].
Now set the original structure against the variable:
[{a,b},[{c,d},{e,f}]] = A.
If you hit enter, you'll see they match. Hit alt-p (forget what emacs calls alt, but it's alt on my keyboard) to bring back the previous line. Replace some tuple or list item with an underscore:
[_,[{c,d},{e,f}]].
Hit enter to make sure you did it right and they still match. This example is trivial, but for deeply nested, multiline structures it's trickier, so it's handy to be able to just quickly match to test. Sometimes you'll want to try to guess at whole huge swaths, like using an underscore to match a tuple list inside a tuple that's the third element of a list. If you place it right, you can match the whole thing at once, but it's easy to misread it.
Anyway, repeat to explore the essential shape of the structure and place real variables where you want to pull out values:
[_, [_, _]] = A.
[_, _] = A.
[_, MyTupleList] = A. %% let's grab this tuple list
[{MyAtom,b}, [{c,d}, MyTuple]] = A. %% or maybe we want this atom and tuple
That's how I efficiently dissect and pattern match complex data structures.
However, I don't know what you're doing. I'd be inclined to have a wrapper function that uses KVC to pull out exactly what you need and then distributes to helper functions from there for each type of structure.
If I understand you correctly you want to pattern match some large datastructures of unknown formatting.
Example:
Input: {a, b} {a,b,c,d} {a,[],{},{b,c}}
function({A, B}) -> do_something;
function({A, B, C, D}) when is_atom(B) -> do_something_else;
function({A, B, C, D}) when is_list(B) -> more_doing.
The generic answer is of course that it is undecidable from just data to know how to categorize that data.
First you should probably be aware of iolists. They are created by functions such as io_lib:format/2 and in many other places in the code.
One example is that
[["SIP",47,"2",46,"0"],32,"407",32,"Proxy Authentication Required","\r\n"]
will print as
SIP/2.0 407 Proxy Authentication Required
So, I'd start with flattening all those lists, using a function such as
flatten_io(List) when is_list(List) ->
Flat = lists:map(fun flatten_io/1, List),
maybe_flatten(Flat);
flatten_io(Tuple) when is_tuple(Tuple) ->
list_to_tuple([flatten_io(Element) || Element <- tuple_to_list(Tuple)];
flatten_io(Other) -> Other.
maybe_flatten(L) when is_list(L) ->
case lists:all(fun(Ch) when Ch > 0 andalso Ch < 256 -> true;
(List) when is_list(List) ->
lists:all(fun(X) -> X > 0 andalso X < 256 end, List);
(_) -> false
end, L) of
true -> lists:flatten(L);
false -> L
end.
(Caveat: completely untested and quite inefficient. Will also crash for inproper lists, but you shouldn't have those in your data structures anyway.)
On second thought, I can't help you. Any data structure that uses the atom 'COMMA' for a comma in a string should be taken out and shot.
You should be able to flatten those things as well and start to get a view of what you are looking at.
I know that this is not a complete answer. Hope it helps.
Its hard to recommend something for handling this.
Transforming all the structures in a more sane and also more minimal format looks like its worth it. This depends mainly on the similarities in these structures.
Rather than having a special function for each of the 100 there must be some automatic reformatting that can be done, maybe even put the parts in records.
Once you have records its much easier to write functions for it since you don't need to know the actual number of elements in the record. More important: your code won't break when the number of elements changes.
To summarize: make a barrier between your code and the insanity of these structures by somehow sanitizing them by the most generic code possible. It will be probably a mix of generic reformatting with structure speicific stuff.
As an example already visible in this struct: the 'name-addr' tuples look like they have a uniform structure. So you can recurse over your structures (over all elements of tuples and lists) and match for "things" that have a common structure like 'name-addr' and replace these with nice records.
In order to help you eyeballing you can write yourself helper functions along this example:
eyeball(List) when is_list(List) ->
io:format("List with length ~b\n", [length(List)]);
eyeball(Tuple) when is_tuple(Tuple) ->
io:format("Tuple with ~b elements\n", [tuple_size(Tuple)]).
So you would get output like this:
2> eyeball({a,b,c}).
Tuple with 3 elements
ok
3> eyeball([a,b,c]).
List with length 3
ok
expansion of this in a useful tool for your use is left as an exercise. You could handle multiple levels by recursing over the elements and indenting the output.
Use pattern matching and functions that work on lists to extract only what you need.
Look at http://www.erlang.org/doc/man/lists.html:
keyfind, keyreplace, L = [H|T], ...