Does String.to_atom("some-known-string") create a new atom in the atom-table each time? - erlang

Does String.to_atom("some-known-string") create a new atom in the atom-table each time?
If NO, then what is the point of String.to_existing_atom/1?
If YES, then why? since String.to_atom("some-known-string") will always give the same result ... and the atom-table is never garbage collected

Assuming you are always using the same string, it may only create a new atom the first time it is run. After that, assuming continued use of the same string, it will not create new atoms.
The reason there is also to_existing_atom is to help prevent filling the atom table with unknown information.
iex(1)> String.to_existing_atom("foo")
** (ArgumentError) argument error
:erlang.binary_to_existing_atom("foo", :utf8)
iex(1)> String.to_atom("foo")
:foo
iex(2)> String.to_existing_atom("foo")
:foo
As you can see, when I first try to call to_existing_atom, the process actually crashes because that atom is not in the atom table. However, if I use to_atom to ensure it exists, I can now call to_existing_atom and I do not get a crash.

An example use-case:
For process isolation, I need to dynamically generate a series of ets tables by partition number. I will have a fixed number of partitions -- but I can't name ets tables using anything but an atom, so {:my_table, num} is not an option.
Therefore, each process with a partition creates an atom based on a {name, number} combo.
String.to_atom("my_table" <> Integer.to_string(i))
Creating atoms from a source outside your direct control is dangerous, though, since it could crash your BEAM. Thus, to_existing_atom is a nice way to sanitize incoming data.

In elixir atoms are immutable.
field(q, ^(String.to_existing_atom k))
In this example we are using existing_atom because we are fetching data form DB and existing make sure the field is valid. It is useful and in such scenarios.

Related

Does it make sense to use `ordered_set` for `select` statement with `>` and `<=` for lowering time compexity

I use ETS table of type ordered_set, and row looks like {{integer_value, string}} (basically it has no value, only key).
When I perform ets:select(tab, [match_spec]), what match_spec does is selecting all rows, where integer_value meets greater than and lower than comprehensions.
I wonder, do I benefit and instead of scanning whole table, find both lower and upper bounds in logarithmic time and then get all elements in between, like I would expect from SQL table, or such functionality is not implemented in ETS and there is no particular benefit from using ordered_set instead of ordinary set?
Simple way is using timer:tc/3 function for getting execution time of your functions or ets module's functions.
You can profile your code using fprof or eprof for undestanding what function called and how much time it takes for its execution.
This can help you.
If you are not familiar with erlang profilers, i can show simple example of ets set and ordered_set with profilers.

Redis Lua Script Unpack Returning Different Results

Setup by running sadd a b c
When I execute this code against the set a
keystoclear1 has a single value of "b" in it.
keystoclear2 as both values in it.
local keystoclear = unpack(redis.call('smembers', KEYS[1]))
redis.call('sadd', 'keystoclear1', keystoclear)
redis.call('sadd', 'keystoclear2', unpack(redis.call('smembers', KEYS[1])))
I am by no means a lua expert, so I could just have some strange behavior here, but I would like to know what is causing it.
I tested this on both the windows and linux version of redis, with redis-cli and the stackexchange.redis client. Same behavior in all cases. This is a trivial example, I actually would like to store the results of the unpack because I need to perform several operations with it.
UPDATE: I understand the issue.
table.unpack() only returns the first element
Lua always adjusts the number of results from a function to the circumstances of the call. When we call a function as a statement, Lua discards all of its results. When we use a call as an expression, Lua keeps only the first result. We get all results only when the call is the last (or the only) expression in a list of expressions.
This case is slightly different from the one you referenced in your update. In this case unpack (may) return several elements, but you only store one and discard the rest. You can get other elements if you use local keytoclear1, keytoclear2 = ..., but it's much easier to store the table itself and unpack it as needed:
local keystoclear = redis.call('smembers', KEYS[1])
redis.call('sadd', 'keystoclear1', unpack(keystoclear))
As long as unpack is the last parameter, you'll get all the elements that are present in the table being unpacked.

mnesia memory allocation

i was testing the application by inserting some 1000 users and each user having 1000 contacts in a database table under mnesia and during insertion at some part the error i got is as follows:
Crash dump was written to: erl_crash.dump
binary_alloc: Cannot allocate 422879872 bytes of memory (of type "binary").
Aborted
i started the erl emulator with erl +MBas af (B-binary allocator af- a fit) and tried again but the error was same,
note:: i am using erlang r12b version and the system ram is 8gb on ubuntu 10.04
so may i know how to solve it?
the records definitions are:
%% database
-record(database,{dbid,guid,data}).
%% changelog
-record(changelog,{dbid,timestamp,changelist,type}).
here data is a vcard(contact info) , dbid and type is "contacts", guid is an integer automatically generated by the server
the database record contains all the vcard data of all users.if there are 1000 users and each user having 1000 contacts then we will have 10^6 records.
the changelog record will contain what are the changes done on the database table at that timestamp
the code for creation of tables are::
mnesia:create_table(database, [{type,bag}, {attributes,Record_of_database},
{record_name,database},
{index,guid},
{disc_copies,[node()]}])
mnesia:create_table(changelog, [{type,set}, {attributes,Record_of_changelog},
{record_name,changelog},
{index,timestamp},
{disc_copies,[node()]}])
the insertion of records on table is:
commit_data(DataList = [#database{dbid=DbID}|_]) ->
io:format("commit data called~n"),
[mnesia:dirty_write(database,{database,DbId,Guid,Key})|| {database,DbId,Guid,X}<-DataList].
write_changelist(Username,Dbname,Timestamp,ChangeList) ->
Type="contacts",
mnesia:dirty_write(changelog,{changelog,DbID,Timestamp,ChangeList,Type}).
I suppose that the list DataList is huge and should not be sent at once from a remote node. It should be sent in small pieces. The client can send one by one item from the DataList generated at the client. Also, because this problem occurs during insertion, i think that we should parallelise the list comprehension. We could have a parallel map where for each item in the list, the insertion is done in a separate process. Then, i also think that something is still wrong with the list comprehension. Variable Key is unbound and variable X is unused. Otherwise, probably the entire methodology needs a change. Lets see what others think. Thanks
This error normally occurs when there is no memory to allocate for binary heap by ERTS memory allocator called binary_alloc. Check the current binary heap size using erlang:system_info() or erlang:memory() or erlang:memory(binary) commands. If the binary heap size is huge then run erlang:garbage_collect() to free all non-referenced binary objects in binary heap. This will free the memory ..
In case you use long strings (it is just list in erlang) for vcard or somewehre else, they consumes much memory.
If this is the case, you change them to binary to suppress memory usage (use list_to_binary before insert to mnesia).
This may be not helpfull, because I don't know about your data structure (type, length and so on)...

Does erlang implement record copy-and-modify in any clever way?

given:
-record(foo, {a, b, c}).
I do something like this:
Thing = #foo{a={1,2}, b={3,4}, c={5,6}},
Thing1 = Thing#foo{a={7,8}}.
From a semantic view, Thing and Thing1 are unique entities. However, from a language implementation standpoint, making a full copy of Thing to generate Thing1 would be intensely wasteful. For example, if the record were a megabyte in size and I made a thousand "copies," each modifying a couple of bytes, I've just burned a gigabyte. If the internal structure kept track of a representation of the parent structure and each derivative marked up that parent in a way that indicated its own change but preserved everyone elses' versions, the derivates could be created with a minimum of memory overhead.
My question is this: is erlang doing anything clever - internally - to keep the overhead of the usual erlang scribble;
Thing = #ridiculously_large_record,
Thing1 = make_modified_copy(Thing),
Thing2 = make_modified_copy(Thing1),
Thing3 = make_modified_copy(Thing2),
Thing4 = make_modified_copy(Thing3),
Thing5 = make_modified_copy(Thing4)
...to a minimum?
I ask because there would be a number of changes to the way that I did cross-process communications if this were the case.
The exact workings of the garbage collection and memory allocation is only known to a few. Thankfully, they are very happy to share their knowledge and the following is based on what I have learnt from the erlang-questions mailing list and by discussing with OTP developers.
When messaging between processes, the content is always copied as there is no shared heap between processes. The only exception is binaries bigger than 64 bytes, where only a reference is copied.
When executing code in one process, only parts are updated. Let's analyze tuples, as that is the example you provided.
A tuple is actually a structure that keeps references to the actual data somewhere on the heap (except for small integers and maybe one more data type which I can't remember). When you update a tuple, using for example setelement/3, a new tuple is created with the given element replaced, however for all other elements only the reference is copied. There is one exception which I have never been able to take advantage of.
The garbage collector keeps track of each tuple and understands when it is safe to reclaim any tuple that is no longer used. It might be that the data referenced by the tuple is still in use, in which case the data itself is not collected.
As always, Erlang gives you some tools to understand exactly what is going on. The efficiency guide details how to use erts_debug:size/1 and erts_debug:flat_size/1 to understand the size of the data structure when used internally in a process and when copied. The trace tools also allows you to understand when, what and how much was garbage collected.
The record foo is of arity four (holding four words), but the whole structure is 14 words in size. Any immediate (pids, ports, small integers, atoms, catch and nil) can be stored directly in the tuples array. Any other term which can't fit into a word, such as other tuples, are not stored directly but referenced by boxed pointers (a boxed pointer is an erlang term with a forwarding address to the real eterm ... just internals).
In your case a new tuple of same arity is created and the atom foo and all the pointers are copied from the previous tuple except for index two, a, which points to the new tuple {7,8} which constitutes 3 words. In all 5 + 3 new words are created on the heap and only 3 words are copied from the old tuple the other 9 words are not touched.
Excessively large tuples are not recommended. When updating a tuple, the whole tuple, i.e the array and not the deep content, needs to copied and then updated in other to preserve a persistent data structure. This will also generate increased garbage, forcing the garbage collector to heat up which also hurts performance. The dict and array modules avoids using large tuples for this reason and have a shallow tuple tree instead.
I can definitely verify what people have already pointed out:
a record is just a tuple with the record name as the first element and all the fields just the following tuple element
when an element of a tuple is changed, updating a field in a record in your case, only the top level tuple is new, all the elements are just reused
This works just because we have immutable data. So in your example each time you update a value in a #foo record none of the data in the elements are copied and only a new 4-element tuple (5 words) is created. Erlang will never does a deep copy in this type of operation or when passing arguments in function calls.
In conclusion:
Thing = #foo{a={1,2}, b={3,4}, c={5,6}},
Thing1 = Thing#foo{a={7,8}}.
Here, if Thing is not used again, it will probably be updated in place and copying of the tuple will be avoided, as the Efficiency Guide says. (tuple and record syntax is complied into something like setelement, I think)
Thing = #ridiculously_large_record,
Thing1 = make_modified_copy(Thing),
Thing2 = make_modified_copy(Thing1),
...
Here the tuples are actually copied every time.
I guess that it would be theoretically possible make an interesting optimization to this. If the compiler could perform escape analysis on the return value of make_modified_copy and detect that the only reference to it is the one returned, in could save this information about the function. When it encounter a call the that function it would know that it is safe to modify the return value in place.
This would only be possible to do on inter module calls, because of the code replace feature.
Maybe one day we will have it.

Using ets:foldl as a poor man's forEach on every record

Short version: is it safe to use ets:foldl to delete every ETS record as one is iterating through them?
Suppose an ETS table is accumulating information and now it's time to process it all. A record is read from the table, used in some way, then deleted. (Also, assume the table is private, so no concurrency issues.)
In another language, with a similar data structure, you might use a for...each loop, processing every record and then deleting it from the hash/dict/map/whatever. However, the ets module does not have foreach as e.g. lists does.
But this might work:
1> ets:new(ex, [named_table]).
ex
2> ets:insert(ex, {alice, "high"}).
true
3> ets:insert(ex, {bob, "medium"}).
true
4> ets:insert(ex, {charlie, "low"}).
true
5> ets:foldl(fun({Name, Adjective}, DontCare) ->
io:format("~p has a ~p opinion of you~n", [Name, Adjective]),
ets:delete(ex, Name),
DontCare
end, notused, ex).
bob has a "medium" opinion of you
alice has a "high" opinion of you
charlie has a "low" opinion of you
notused
6> ets:info(ex).
[...
{size,0},
...]
7> ets:lookup(ex, bob).
[]
Is this the preferred approach? Is it at least correct and bug-free?
I have a general concern about modifying a data structure while processing it, however the ets:foldl documentation implies that ETS is pretty comfortable with you modifying records inside foldl. Since I am essentially wiping the table clean, I want to be sure.
I am using Erlang R14B with a set table however I'd like to know if there are any caveats with any Erlang version, or any type of table as well. Thanks!
Your approach is safe. The reason it is safe is that ets:foldl/3 internally use ets:first/1, ets:next/2 and ets:safe_fixtable/2. These have the guarantee you want, namely that you can kill elements and still get the full traverse. See the CONCURRENCY section of erl -man ets.
For your removal of all elements from the table, there is a simpler one-liner however:
ets:match_delete(ex, '_').
although it doesn't work should you want to do the IO-formatting for each row in which case your approach with foldl is probably easier.
For cases like this we will alternate between two tables or just create a new table every time we start processing. When we want to start a processing cycle we switch the writers to start using the alternate or new table, then we do our processing and clear or delete the old table.
We do this because there might otherwise be concurrent updates to a tuple that we might miss. We're working with high frequency concurrent counters when we use this technique.

Resources