How to remove lines from string without allocating memory for new strings? - erlang

I want to remove some lines from a large text file, but I want to do this without allocating more memory than is required for holding the original string. So far, I can only manage the following:
s
|> StringIO.open()
|> then(fn {:ok, device} ->
IO.binstream(device, :line)
end)
|> Stream.reject(&Regex.match?(~r{<date_of_creation>.*</date_of_creation>\n}, &1))
|> Enum.join()
But this ends up doubling the memory required for the original string, because of the join at the end. Is there a better way to do this with just Elixir/Erlang?

Depending on what you want to do with the result, a way to considerably reduce the memory usage is to avoid build the string in the first place and use IO data.
In your example, this could be achieved by removing the call to Enum.join/1 and replace Stream.reject/2 by Enum.reject/2:
|> Enum.reject(&Regex.match?(~r{<date_of_creation>.*</date_of_creation>\n}, &1))
This will return an IO list which can be used directly for I/O, and you might not need the string at all. This is how Phoenix is able to render templates/JSON efficiently: avoiding expensive string concatenation in the first place.
Assuming you had to join with a separator (Enum.join/2), Enum.intersperse/2 could be used to build an IO list.

Related

List Cons-Into Function?

I am often wanting to take one list and cons every element into an existing list.
MyList = [3,2,1],
MyNewElements = [4,5,6],
MyNewList = lists:foldl(fun(A, B) -> [A | B] end, MyList, MyNewElements).
%% [6,5,4,3,2,1]
Assuming MyList has 1M elements and MyNewElements only has a few, I want to do this efficiently.
I couldn't figure out which of these functions- if any- did what I was trying to do:
https://www.erlang.org/doc/man/lists.html
Adding a short list to the beginning of a long list is cheap - the execution time of the ++ operator is proportional to the length of the first list. The first list is copied, and the second list is added as the tail without modification.
So in your example, that would be:
lists:reverse(MyNewElements) ++ MyList
(The execution time of lists:reverse/1 is also proportional to the length of the argument.)
Another option, aside from those already provided, would be just to have
NewDeepList = [MyList | DeepList]
and modify the reading/traversing to be able to handle [[element()]] instead of [element()].
Because erlang is function language and is different from c, javascript, it copy variable and modify it, not just modify it. Therefore it is impossible compression to o(A).length(A) is length of new added elements.

What's the use of cache in csv type provider?

I am a bit confused about Cache and CacheRows.
It seems MyCsvType.Load(path).Take(30000).Cache() doesn't actually read the 30000 rows immediately. (unlike Seq.cache)
Then, why do we need Cache given we have already CacheRows
Additionally, if I am only interested in the first 30000 rows, should I use MyCsvType.Load(path).Take(30000) or MyCsvType.Load(path).Rows |> Seq.take 30000
If you look at F# Data source code, you can see that Cache, Take and other operators are just calling the corresponding Seq.xyz operations under the cover (this is in CsvRuntime.fs).
The key difference is that when you create a type provider without specifying CacheRows=false, it will actually call Cache by default. So, the trick is to create a type provider using CacheRows=false and then you can use Seq.cache or the Cache method (and other operations) interchangeably.
let stocks = CsvProvider<"sample.csv", CacheRows=false>.GetSample()
stocks.Take(10).Cache() // Using methods is now exactly
stocks |> Seq.take 10 |> Seq.cache // the same as using functions

Can I get a list of all currently-registered atoms?

My project has blown through the max 1M atoms, we've cranked up the limit, but I need to apply some sanity to the code that people are submitting with regard to list_to_atom and its friends. I'd like to start by getting a list of all the registered atoms so I can see where the largest offenders are. Is there any way to do this. I'll have to be creative about how I do it so I don't end up trying to dump 1-2M lines in a live console.
You can get hold of all atoms by using an undocumented feature of the external term format.
TL;DR: Paste the following line into the Erlang shell of your running node. Read on for explanation and a non-terse version of the code.
(fun F(N)->try binary_to_term(<<131,75,N:24>>) of A->[A]++F(N+1) catch error:badarg->[]end end)(0).
Elixir version by Ivar Vong:
for i <- 0..:erlang.system_info(:atom_count)-1, do: :erlang.binary_to_term(<<131,75,i::24>>)
An Erlang term encoded in the external term format starts with the byte 131, then a byte identifying the type, and then the actual data. I found that EEP-43 mentions all the possible types, including ATOM_INTERNAL_REF3 with type byte 75, which isn't mentioned in the official documentation of the external term format.
For ATOM_INTERNAL_REF3, the data is an index into the atom table, encoded as a 24-bit integer. We can easily create such a binary: <<131,75,N:24>>
For example, in my Erlang VM, false seems to be the zeroth atom in the atom table:
> binary_to_term(<<131,75,0:24>>).
false
There's no simple way to find the number of atoms currently in the atom table*, but we can keep increasing the number until we get a badarg error.
So this little module gives you a list of all atoms:
-module(all_atoms).
-export([all_atoms/0]).
atom_by_number(N) ->
binary_to_term(<<131,75,N:24>>).
all_atoms() ->
atoms_starting_at(0).
atoms_starting_at(N) ->
try atom_by_number(N) of
Atom ->
[Atom] ++ atoms_starting_at(N + 1)
catch
error:badarg ->
[]
end.
The output looks like:
> all_atoms:all_atoms().
[false,true,'_',nonode#nohost,'$end_of_table','','fun',
infinity,timeout,normal,call,return,throw,error,exit,
undefined,nocatch,undefined_function,undefined_lambda,
'DOWN','UP','EXIT',aborted,abs_path,absoluteURI,ac,accessor,
active,all|...]
> length(v(-1)).
9821
* In Erlang/OTP 20.0, you can call erlang:system_info(atom_count):
> length(all_atoms:all_atoms()) == erlang:system_info(atom_count).
true
I'm not sure if there's a way to do it on a live system, but if you can run it in a test environment you should be able to get a list via crash dump. The atom table is near the end of the crash dump format. You can create a crash dump via erlang:halt/1, but that will bring down the whole runtime system.
I dare say that if you use more than 1M atoms, then you are doing something wrong. Atoms are intended to be static as soon as the application runs or at least upper bounded by some small number, 3000 or so for a medium sized application.
Be very careful when an enemy can generate atoms in your vm. especially calls like list_to_atom/1 is somewhat dangerous.
EDITED (wrong answer..)
You can adjust number of atoms with +t
http://www.erlang.org/doc/efficiency_guide/advanced.html
..but I know very few use cases when it is necessary.
You can track atom stats with erlang:memory()

Creating a valid function declaration from a complex tuple/list structure

Is there a generic way, given a complex object in Erlang, to come up with a valid function declaration for it besides eyeballing it? I'm maintaining some code previously written by someone who was a big fan of giant structures, and it's proving to be error prone doing it manually.
I don't need to iterate the whole thing, just grab the top level, per se.
For example, I'm working on this right now -
[[["SIP",47,"2",46,"0"],32,"407",32,"Proxy Authentication Required","\r\n"],
[{'Via',
[{'via-parm',
{'sent-protocol',"SIP","2.0","UDP"},
{'sent-by',"172.20.10.5","5060"},
[{'via-branch',"z9hG4bKb561e4f03a40c4439ba375b2ac3c9f91.0"}]}]},
{'Via',
[{'via-parm',
{'sent-protocol',"SIP","2.0","UDP"},
{'sent-by',"172.20.10.15","5060"},
[{'via-branch',"12dee0b2f48309f40b7857b9c73be9ac"}]}]},
{'From',
{'from-spec',
{'name-addr',
[[]],
{'SIP-URI',
[{userinfo,{user,"003018CFE4EF"},[]}],
{hostport,"172.20.10.11",[]},
{'uri-parameters',[]},
[]}},
[{tag,"b7226ffa86c46af7bf6e32969ad16940"}]}},
{'To',
{'name-addr',
[[]],
{'SIP-URI',
[{userinfo,{user,"3966"},[]}],
{hostport,"172.20.10.11",[]},
{'uri-parameters',[]},
[]}},
[{tag,"a830c764"}]},
{'Call-ID',"90df0e4968c9a4545a009b1adf268605#172.20.10.15"},
{'CSeq',1358286,"SUBSCRIBE"},
["date",'HCOLON',
["Mon",44,32,["13",32,"Jun",32,"2011"],32,["17",58,"03",58,"55"],32,"GMT"]],
{'Contact',
[[{'name-addr',
[[]],
{'SIP-URI',
[{userinfo,{user,"3ComCallProcessor"},[]}],
{hostport,"172.20.10.11",[]},
{'uri-parameters',[]},
[]}},
[]],
[]]},
["expires",'HCOLON',3600],
["user-agent",'HCOLON',
["3Com",[]],
[['LWS',["VCX",[]]],
['LWS',["7210",[]]],
['LWS',["IP",[]]],
['LWS',["CallProcessor",[['SLASH',"v10.0.8"]]]]]],
["proxy-authenticate",'HCOLON',
["Digest",'LWS',
["realm",'EQUAL',['SWS',34,"3Com",34]],
[['COMMA',["domain",'EQUAL',['SWS',34,"3Com",34]]],
['COMMA',
["nonce",'EQUAL',
['SWS',34,"btbvbsbzbBbAbwbybvbxbCbtbzbubqbubsbqbtbsbqbtbxbCbxbsbybs",
34]]],
['COMMA',["stale",'EQUAL',"FALSE"]],
['COMMA',["algorithm",'EQUAL',"MD5"]]]]],
{'Content-Length',0}],
"\r\n",
["\n"]]
Maybe https://github.com/etrepum/kvc
I noticed your clarifying comment. I'd prefer to add a comment myself, but don't have enough karma. Anyway, the trick I use for that is to experiment in the shell. I'll iterate a pattern against a sample data structure until I've found the simplest form. You can use the _ match-all variable. I use an erlang shell inside an emacs shell window.
First, bind a sample to a variable:
A = [{a,b},[{c,d}, {e,f}]].
Now set the original structure against the variable:
[{a,b},[{c,d},{e,f}]] = A.
If you hit enter, you'll see they match. Hit alt-p (forget what emacs calls alt, but it's alt on my keyboard) to bring back the previous line. Replace some tuple or list item with an underscore:
[_,[{c,d},{e,f}]].
Hit enter to make sure you did it right and they still match. This example is trivial, but for deeply nested, multiline structures it's trickier, so it's handy to be able to just quickly match to test. Sometimes you'll want to try to guess at whole huge swaths, like using an underscore to match a tuple list inside a tuple that's the third element of a list. If you place it right, you can match the whole thing at once, but it's easy to misread it.
Anyway, repeat to explore the essential shape of the structure and place real variables where you want to pull out values:
[_, [_, _]] = A.
[_, _] = A.
[_, MyTupleList] = A. %% let's grab this tuple list
[{MyAtom,b}, [{c,d}, MyTuple]] = A. %% or maybe we want this atom and tuple
That's how I efficiently dissect and pattern match complex data structures.
However, I don't know what you're doing. I'd be inclined to have a wrapper function that uses KVC to pull out exactly what you need and then distributes to helper functions from there for each type of structure.
If I understand you correctly you want to pattern match some large datastructures of unknown formatting.
Example:
Input: {a, b} {a,b,c,d} {a,[],{},{b,c}}
function({A, B}) -> do_something;
function({A, B, C, D}) when is_atom(B) -> do_something_else;
function({A, B, C, D}) when is_list(B) -> more_doing.
The generic answer is of course that it is undecidable from just data to know how to categorize that data.
First you should probably be aware of iolists. They are created by functions such as io_lib:format/2 and in many other places in the code.
One example is that
[["SIP",47,"2",46,"0"],32,"407",32,"Proxy Authentication Required","\r\n"]
will print as
SIP/2.0 407 Proxy Authentication Required
So, I'd start with flattening all those lists, using a function such as
flatten_io(List) when is_list(List) ->
Flat = lists:map(fun flatten_io/1, List),
maybe_flatten(Flat);
flatten_io(Tuple) when is_tuple(Tuple) ->
list_to_tuple([flatten_io(Element) || Element <- tuple_to_list(Tuple)];
flatten_io(Other) -> Other.
maybe_flatten(L) when is_list(L) ->
case lists:all(fun(Ch) when Ch > 0 andalso Ch < 256 -> true;
(List) when is_list(List) ->
lists:all(fun(X) -> X > 0 andalso X < 256 end, List);
(_) -> false
end, L) of
true -> lists:flatten(L);
false -> L
end.
(Caveat: completely untested and quite inefficient. Will also crash for inproper lists, but you shouldn't have those in your data structures anyway.)
On second thought, I can't help you. Any data structure that uses the atom 'COMMA' for a comma in a string should be taken out and shot.
You should be able to flatten those things as well and start to get a view of what you are looking at.
I know that this is not a complete answer. Hope it helps.
Its hard to recommend something for handling this.
Transforming all the structures in a more sane and also more minimal format looks like its worth it. This depends mainly on the similarities in these structures.
Rather than having a special function for each of the 100 there must be some automatic reformatting that can be done, maybe even put the parts in records.
Once you have records its much easier to write functions for it since you don't need to know the actual number of elements in the record. More important: your code won't break when the number of elements changes.
To summarize: make a barrier between your code and the insanity of these structures by somehow sanitizing them by the most generic code possible. It will be probably a mix of generic reformatting with structure speicific stuff.
As an example already visible in this struct: the 'name-addr' tuples look like they have a uniform structure. So you can recurse over your structures (over all elements of tuples and lists) and match for "things" that have a common structure like 'name-addr' and replace these with nice records.
In order to help you eyeballing you can write yourself helper functions along this example:
eyeball(List) when is_list(List) ->
io:format("List with length ~b\n", [length(List)]);
eyeball(Tuple) when is_tuple(Tuple) ->
io:format("Tuple with ~b elements\n", [tuple_size(Tuple)]).
So you would get output like this:
2> eyeball({a,b,c}).
Tuple with 3 elements
ok
3> eyeball([a,b,c]).
List with length 3
ok
expansion of this in a useful tool for your use is left as an exercise. You could handle multiple levels by recursing over the elements and indenting the output.
Use pattern matching and functions that work on lists to extract only what you need.
Look at http://www.erlang.org/doc/man/lists.html:
keyfind, keyreplace, L = [H|T], ...

Can the lock function be used to implement thread-safe enumeration?

I'm working on a thread-safe collection that uses Dictionary as a backing store.
In C# you can do the following:
private IEnumerable<KeyValuePair<K, V>> Enumerate() {
if (_synchronize) {
lock (_locker) {
foreach (var entry in _dict)
yield return entry;
}
} else {
foreach (var entry in _dict)
yield return entry;
}
}
The only way I've found to do this in F# is using Monitor, e.g.:
let enumerate() =
if synchronize then
seq {
System.Threading.Monitor.Enter(locker)
try for entry in dict -> entry
finally System.Threading.Monitor.Exit(locker)
}
else seq { for entry in dict -> entry }
Can this be done using the lock function? Or, is there a better way to do this in general? I don't think returning a copy of the collection for iteration will work because I need absolute synchronization.
I don't think that you'll be able to do the same thing with the lock function, since you would be trying to yield from within it. Having said that, this looks like a dangerous approach in either language, since it means that the lock can be held for an arbitrary amount of time (e.g. if one thread calls Enumerate() but doesn't enumerate all the way through the resulting IEnumerable<_>, then the lock will continue to be held).
It may make more sense to invert the logic, providing an iter method along the lines of:
let iter f =
if synchronize then
lock locker (fun () -> Seq.iter f dict)
else
Seq.iter f dict
This brings the iteration back under your control, ensuring that the sequence is fully iterated (assuming that f doesn't block, which seems like a necessary assumption in any case) and that the lock is released immediately thereafter.
EDIT
Here's an example of code that could hold the lock forever.
let cached = enumerate() |> Seq.cache
let firstFive = Seq.take 5 cached |> Seq.toList
We've taken the lock in order to start enumerating through the first 5 items. However, we haven't continued through the rest of the sequence, so the lock won't be released (maybe we would enumerate the rest of the way later based on user feedback or something, in which case the lock would finally be released).
In most cases, correctly written code will ensure that it disposes of the original enumerator, but there's no way to guarantee that in general. Therefore, your sequence expressions should be designed to be robust to only being enumerated part way. If you intend to require your callers to enumerate the collection all at once, then forcing them to pass you the function to apply to each element is better than returning a sequence which they can enumerate as they please.
I agree with kvb that the code is suspicious and that you probably don't want to hold the lock. However, there is a way to write the locking in a more comfortable way using the use keyword. It's worth mentioning it, because it may be useful in other situations.
You can write a function that starts holding a lock and returns IDisposable, which releases the lock when it is disposed:
let makeLock locker =
System.Threading.Monitor.Enter(locker)
{ new System.IDisposable with
member x.Dispose() =
System.Threading.Monitor.Exit(locker) }
Then you can write for example:
let enumerate() = seq {
if synchronize then
use l0 = makeLock locker
for entry in dict do
yield entry
else
for entry in dict do
yield entry }
This is essentially implementing C# like lock using the use keyword, which has similar properties (allows you to do something when leaving the scope). So, this is much closer to the original C# version of the code.

Resources