Tricky pattern matching of a binary string in Erlang - erlang

I am using Erlang to send message between an email server and Spamassassin.
What I want to achieve is retrieving the tests done by SA to generate a report (I am doing some kind of mail-tester program)
When SpamAssassin answers (through raw TCP) it sends a binary string like this one:
<<"SPAMD/1.1 0 EX_OK\r\nContent-length: 728\r\nSpam: True ; 6.3 / 5.0\r\n\r\nReceived: from localhost by debpub1.cs2cloud.internal\r\n\twith SpamAssassin (version 3.4.2);\r\n\tSat, 04 Jan 2020 18:24:37 +0100\r\nFrom: bibi <bibi#XXXXX.local>\r\nTo: <aZphki8N05#XXXXXXXX>\r\nSubject: i\r\nDate: Sat, 4 Jan 2020 18:24:36 +0100\r\nMessage-Id: <3b68dede-f1c3-4f04-62dc-f0b2de6e980a#PPPPPP.local>\r\nX-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on\r\n\tdebpub1.cs2cloud.internal\r\nX-Spam-Flag: YES\r\nX-Spam-Level: ******\r\nX-Spam-Status: Yes, score=6.3 required=5.0 tests=BODY_SINGLE_WORD,\r\n\tDKIM_ADSP_NXDOMAIN,DOS_RCVD_IP_TWICE_C,HELO_MISC_IP,\r\n\tNO_FM_NAME_IP_HOSTN autolearn=no autolearn_force=no version=3.4.2\r\nMIME-Version: 1.0\r\nContent-Type: multipart/mixed; boundary=\"----------=_5E10CA56.0200B819\"\r\n\r\n">>
I put in bold the items I want to pick up:
BODY_SINGLE_WORD
DKIM_ADSP_NXDOMAIN
DOS_RCVD_IP_TWICE_C
HELO_MISC_IP
NO_FM_NAME_IP_HOSTN
I then want to serialize like that:
[<<"DKIM_ADSP_NXDOMAIN">>,<<"DOS_RCVD_IP_TWICE_C">>,…]
But that's not easy, terms have no regular "delimitors", has \r\n or \r\n\t
I give a start with that expression (splitting on ',' on a binary string) but result is incomplete
split(BinaryString, ",", all),
case lists:member(<<"HELO_MISC_IP">>, Data3 ) of
true -> ; %push the result in a database
false -> ok
end;
I wish I could take another start, and using looping through recursion (and becausee it is a clean and nice way to loop) but it looks pointless to me regarding that scenario …
split(BinaryString, Idx, Acc) ->
case BinaryString of
<<"tests=",_This:Idx/binary, Char, Tail/binary>> ->
case lists:member(Char, BinaryString ) of
false ->
split(BinaryString, Idx+1, Acc);
true ->
case Tail of
<<Y/binary, _Tail/binary>> ->
%doing something
<<_Yop2/binary>> ->
%doing somethin else
end
end;
The thing is I don't see how achieve this in a acceptable and clean way
If anyone could give me a hand that would be very very appreciable.
Yours

One solution is to match the parts of the binary you're looking for:
Data = <<"SPAMD/1.1 0 EX_OK\r\nContent-length: 728\r\nSpam: True ; 6.3 / 5.0\r\n\r\nReceived: from localhost by debpub1.cs2cloud.internal\r\n\twith SpamAssassin (version 3.4.2);\r\n\tSat, 04 Jan 2020 18:24:37 +0100\r\nFrom: bibi <bibi#XXXXX.local>\r\nTo: <aZphki8N05#XXXXXXXX>\r\nSubject: i\r\nDate: Sat, 4 Jan 2020 18:24:36 +0100\r\nMessage-Id: <3b68dede-f1c3-4f04-62dc-f0b2de6e980a#PPPPPP.local>\r\nX-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on\r\n\tdebpub1.cs2cloud.internal\r\nX-Spam-Flag: YES\r\nX-Spam-Level: ******\r\nX-Spam-Status: Yes, score=6.3 required=5.0 tests=BODY_SINGLE_WORD,\r\n\tDKIM_ADSP_NXDOMAIN,DOS_RCVD_IP_TWICE_C,HELO_MISC_IP,\r\n\tNO_FM_NAME_IP_HOSTN autolearn=no autolearn_force=no version=3.4.2\r\nMIME-Version: 1.0\r\nContent-Type: multipart/mixed; boundary=\"----------=_5E10CA56.0200B819\"\r\n\r\n">>,
Matches = binary:compile_pattern([<<"BODY_SINGLE_WORD">>,<<"DKIM_ADSP_NXDOMAIN">>,<<"DOS_RCVD_IP_TWICE_C">>,<<"HELO_MISC_IP">>,<<"NO_FM_NAME_IP_HOSTN">>]),
[binary:part(Data, PosLen) || PosLen <- binary:matches(Data, Matches)].
Executing the three lines above in an Erlang shell returns:
[<<"BODY_SINGLE_WORD">>,<<"DKIM_ADSP_NXDOMAIN">>, <<"DOS_RCVD_IP_TWICE_C">>,<<"HELO_MISC_IP">>, <<"NO_FM_NAME_IP_HOSTN">>]
This provides the desired result, but it might not be safe since it doesn't do anything to try to verify whether the input is valid or whether the matches occur on valid boundaries.
A potentially safer approach relies on the fact that the input binary resembles an HTTP result, and so it can be partially parsed with built-in Erlang decoders. The parse/1,2 functions below use erlang:decode_packet/3 to extract information from the input:
parse(Data) ->
{ok, Line, Rest} = erlang:decode_packet(line, Data, []),
parse(Line, Rest).
parse(<<"SPAMD/", _/binary>>, Data) ->
parse(Data, []);
parse(<<>>, Hdrs) ->
Result = [{Key,Value} || {http_header, _, Key, _, Value} <- Hdrs],
process_results(Result);
parse(Data, Hdrs) ->
case erlang:decode_packet(httph, Data, []) of
{ok, http_eoh, Rest} ->
parse(Rest, Hdrs);
{ok, Hdr, Rest} ->
parse(Rest, [Hdr|Hdrs]);
Error ->
Error
end.
The parse/1 function initially decodes the first line of the input using the line decoder, passing the results to parse/2. The first clause of parse/2 matches the "SPAMD/" prefix of the initial line of the input data just to verify we're looking in the right place, then recursively invokes parse/2 passing the remaining Data and an empty accumulator list. The second and third clauses of parse/2 treat the data as HTTP headers. The second clause of parse/2 matches when the input data is exhausted; it maps the accumulated header list to a list of {Key,Value} pairs and passes it to a process_results/1 function, described below, to finish the data extraction. The third clause of parse/2 tries to decode the data via the httph HTTP header decoder, accumulating each matched header and ignoring any http_eoh end-of-headers markers that result from "\r\n" sequences embedded at odd places in the input.
For the input data provided in the question, the parse/1,2 functions ultimately pass the following list of key-value pairs to process_results/1:
[{'Content-Type',"multipart/mixed; boundary=\"----------=_5E10CA56.0200B819\""},{"Mime-Version","1.0"},{"X-Spam-Status","Yes, score=6.3 required=5.0 tests=BODY_SINGLE_WORD,\r\n\tDKIM_ADSP_NXDOMAIN,DOS_RCVD_IP_TWICE_C,HELO_MISC_IP,\r\n\tNO_FM_NAME_IP_HOSTN autolearn=no autolearn_force=no version=3.4.2"},{"X-Spam-Level","******"},{"X-Spam-Flag","YES"},{"X-Spam-Checker-Version","SpamAssassin 3.4.2 (2018-09-13) on\r\n\tdebpub1.cs2cloud.internal"},{"Message-Id","<3b68dede-f1c3-4f04-62dc-f0b2de6e980a#PPPPPP.local>"},{'Date',"Sat, 4 Jan 2020 18:24:36 +0100"},{"Subject","i"},{"To","<aZphki8N05#XXXXXXXX>"},{'From',"bibi <bibi#XXXXX.local>"},{"Received","from localhost by debpub1.cs2cloud.internal\r\n\twith SpamAssassin (version 3.4.2);\r\n\tSat, 04 Jan 2020 18:24:37 +0100"},{"Spam","True ; 6.3 / 5.0"},{'Content-Length',"728"}]
The process_results/1,2 functions first match the key of interest, which is "X-Spam-Status", and then extract the desired data from its value. The three functions below implement process_results/1 to look for that key and process it, or return {error, not_found} if no such key is seen. The second clause matches the desired key, splits its associated value on the space, comma, carriage return, newline, tab, and equal sign characters, and passes the split result along with an empty accumulator to process_results/2:
process_results([]) ->
{error, not_found};
process_results([{"X-Spam-Status", V}|_]) ->
process_results(string:lexemes(V, " ,\r\n\t="), []);
process_results([_|T]) ->
process_results(T).
For the input data in the question, the list of strings passed to process_results/2 is
["Yes","score","6.3","required","5.0","tests","BODY_SINGLE_WORD","\r\n","DKIM_ADSP_NXDOMAIN","DOS_RCVD_IP_TWICE_C","HELO_MISC_IP","\r\n","NO_FM_NAME_IP_HOSTN","autolearn","no","autolearn_force","no","version","3.4.2"]
The clauses of process_results/2 below recursively walk this list of strings and accumulate the matched results. Each of the second through sixth clauses matches one of the values we seek, and each converts the matched string to a binary before accumulating it.
process_results([], Results) ->
{ok, lists:reverse(Results)};
process_results([V="BODY_SINGLE_WORD"|T], Results) ->
process_results(T, [list_to_binary(V)|Results]);
process_results([V="DKIM_ADSP_NXDOMAIN"|T], Results) ->
process_results(T, [list_to_binary(V)|Results]);
process_results([V="DOS_RCVD_IP_TWICE_C"|T], Results) ->
process_results(T, [list_to_binary(V)|Results]);
process_results([V="HELO_MISC_IP"|T], Results) ->
process_results(T, [list_to_binary(V)|Results]);
process_results([V="NO_FM_NAME_IP_HOSTN"|T], Results) ->
process_results(T, [list_to_binary(V)|Results]);
process_results([_|T], Results) ->
process_results(T, Results).
The final clause ignores unwanted data. The first clause of process_results/2 is invoked when the list of strings is empty, and it just returns the reversed accumulator. For the input data in the question, the final result of process_results/2 is:
{ok, [<<"BODY_SINGLE_WORD">>,<<"DKIM_ADSP_NXDOMAIN">>,<<"DOS_RCVD_IP_TWICE_C">>,<<"HELO_MISC_IP">>,<<"NO_FM_NAME_IP_HOSTN">>]}

Related

Storing ejabberd packets in Riak

I'm trying to save offline ejabberd messages in Riak. I earlier had problems connecting to Riak but those are resolved now with the help of this forum. But now, with my limited Erlang / ejabberd understanding, I'm failing to get the ejabberd packet saved as a string and then put on Riak. Essentially, when the offline_message_hook is latched, I take the Packet argument and then want to put a backslash for each double quote, so that I can then take this revised string and save as a string value on Riak. However I seem to be struggling with modifying the incoming packet to replace the " chars with \".
Is this the right approach? Or am I missing something here in my design? My application relies on the xml format, so should I instead parse the packet using the p1_xml module and reconstruct the xml using the extracted data elements before storing it on Riak.
Apologies for the very basic and multiple questions but appreciate if someone can throw some light here!
The code that I use to try and replace the " with \" in the incoming packet is: (it doesnt quite work):
NewPacket = re:replace(Packet, "\"", "\\\"", [{return, list}, global]),
So essentially, I would be passing the NewPacket as a value to my Riak calls.
ejabberd is compliant with Riak and it is already storing packets in Riak. For example, mod_offline does that.
You can look directly in ejabberd code to find how to do that. For example, in mod_offline, here is how ejabberd store the offline message:
store_offline_msg(Host, {User, _}, Msgs, Len, MaxOfflineMsgs,
riak) ->
Count = if MaxOfflineMsgs =/= infinity ->
Len + count_offline_messages(User, Host);
true -> 0
end,
if
Count > MaxOfflineMsgs ->
discard_warn_sender(Msgs);
true ->
lists:foreach(
fun(#offline_msg{us = US,
timestamp = TS} = M) ->
ejabberd_riak:put(M, offline_msg_schema(),
[{i, TS}, {'2i', [{<<"us">>, US}]}])
end, Msgs)
end.
The code of ejabberd_riak:put/3 is:
put(Rec, RecSchema, IndexInfo) ->
Key = encode_key(proplists:get_value(i, IndexInfo, element(2, Rec))),
SecIdxs = [encode_index_key(K, V) ||
{K, V} <- proplists:get_value('2i', IndexInfo, [])],
Table = element(1, Rec),
Value = encode_record(Rec, RecSchema),
case put_raw(Table, Key, Value, SecIdxs) of
ok ->
ok;
{error, _} = Error ->
log_error(Error, put, [{record, Rec},
{index_info, IndexInfo}]),
Error
end.
put_raw(Table, Key, Value, Indexes) ->
Bucket = make_bucket(Table),
Obj = riakc_obj:new(Bucket, Key, Value, "application/x-erlang-term"),
Obj1 = if Indexes /= [] ->
MetaData = dict:store(<<"index">>, Indexes, dict:new()),
riakc_obj:update_metadata(Obj, MetaData);
true ->
Obj
end,
catch riakc_pb_socket:put(get_random_pid(), Obj1).
You should have already the proper API to do what you want in ejabberd regarding Riak packet storage.

confusion regarding erlang maps, lists and ascii

This code is an excerpt from this book.
count_characters(Str) ->
count_characters(Str, #{}).
count_characters([H|T], #{ H => N }=X) ->
count_characters(T, X#{ H := N+1 });
count_characters([H|T], X) ->
count_characters(T, X#{ H => 1 });
count_characters([], X) ->
X.
So,
1> count_characters("hello").
#{101=>1,104=>1,108=>2,111=>1}
What I understand from this is that, count_characters() takes an argument hello, and place it to the first function, i.e count_characters(Str).
What I don't understand is, how the string characters are converted into ascii value without using $, and got incremented. I am very new to erlang, and would really appreciate if you could help me understand the above. Thank you.
In erlang the string literal "hello" is just a more convenient way of writing the list [104,101,108,108,111]. The string format is syntactic sugar and nothing erlang knows about internally. An ascii string is internally string is internally stored as a list of 32-bit integers.
This also becomes confusing when printing lists where the values happen to be within the ascii range:
io:format("~p~n", [[65,66]]).
will print
"AB"
even if you didn't expect a string as a result.
As said previously, there is no string data type in Erlang, it uses the internal representation of an integer list, so
"hello" == [$h,$e,$l,$l,$o] == [104|[101|[108|[108|[111|[]]]]]]
Which are each a valid representation of an integer list.
To make the count of characters, the function use a new Erlang data type: a map. (available only since R17)
A map is a collection of key/value pairs, in your case the keys will be the characters, and the values the occurrence of each characters.
The function is called with an empty map:count_characters(Str, #{}).
Then it goes recursively through the list, and for each head H, 2 cases are posible:
The character H was already found, then the current map X will match with the pattern #{ H => N } telling us that we already found N times H, so we continue the recursion with the rest of the list and a new map where the value associated to H is now N+1: count_characters(T, X#{ H := N+1 }.
The character H is found for the first time, then we continue the recursion with the rest of the list and a new map where the key/value pair H/1 is added: count_characters(T, X#{ H => 1 }).
When the end of the list is reached, simply return the map: count_characters([], X) -> X.

How to provide value and get a Key back

So I have made 2 databases:
Db1 that contains: [{james,london}]
Db2 that contains: [{james,london},{fredrik,berlin},{fred,berlin}]
I have a match function that looks like this:
match(Element, Db) -> proplists:lookup_all(Element, Db).
When I do: match(berlin, Db2) I get: [ ]
What I am trying to get is a way to input the value and get back the keys in this way: [fredrik,fred]
Regarding to documentation proplists:lookup_all works other way:
Returns the list of all entries associated with Key in List.
So, you can lookup only by keys:
(kilter#127.0.0.1)1> Db = [{james,london},{fredrik,berlin},{fred,berlin}].
[{james,london},{fredrik,berlin},{fred,berlin}]
(kilter#127.0.0.1)2> proplists:lookup_all(berlin, Db).
[]
(kilter#127.0.0.1)3> proplists:lookup_all(fredrik, Db).
[{fredrik,berlin}]
You can use lists:filter and lists:map instead:
(kilter#127.0.0.1)7> lists:filter(fun ({K, V}) -> V =:= berlin end, Db).
[{fredrik,berlin},{fred,berlin}]
(kilter#127.0.0.1)8> lists:map(fun ({K,V}) -> K end, lists:filter(fun ({K, V}) -> V =:= berlin end, Db)).
[fredrik,fred]
So, finally
match(Element, Db) -> lists:map(
fun ({K,V}) -> K end,
lists:filter(fun ({K, V}) -> V =:= Element end, Db)
).
proplists:lookup_all/2 takes as a first argument a key; in your example, berlin is a value and it's not a key therefore an empty list is returned.
Naturally, you can use recursion and find all the elements (meaning that you will use it like an ordinary list and not a proplist).
Another solution is to change the encoding scheme:
[{london,james},{berlin,fredrik},{berlin,fred}]
and then use proplists:lookup_all/2
The correct way to encode it depends on the way you will access the data (what kind of "queries" you will perform most); but unless you manipulate large amounts of data (in which case you might want to use some other datastructure) it isn't really worth analyzing.

Querying mnesia Fragmentated Tables using QLC returns wrong results

am josh in Uganda. i created a mnesia fragmented table (64 fragments), and managed to populate it upto 9948723 records. Each fragment was a disc_copies type, with two replicas.
Now, using qlc (query list comprehension), was too slow in searching for a record, and was returning inaccurate results.
I found out that this overhead is that qlc uses the select function of mnesia which traverses the entire table in order to match records. i tried something else below.
-define(ACCESS_MOD,mnesia_frag).
-define(DEFAULT_CONTEXT,transaction).
-define(NULL,'_').
-record(address,{tel,zip_code,email}).
-record(person,{name,sex,age,address = #address{}}).
match()-> Z = fun(Spec) -> mnesia:match_object(Spec) end,Z.
match_object(Pattern)->
Match = match(),
mnesia:activity(?DEFAULT_CONTEXT,Match,[Pattern],?ACCESS_MOD).
Trying this functionality gave me good results. But i found that i have to dynamically build patterns for every search that may be made in my stored procedures.
i decided to go through the havoc of doing this, so i wrote functions which will dynamically build wild patterns for my records depending on which parameter is to be searched.
%% This below gives me the default pattern for all searches ::= {person,'_','_','_'}
pattern(Record_name)->
N = length(my_record_info(Record_name)) + 1,
erlang:setelement(1,erlang:make_tuple(N,?NULL),Record_name).
%% this finds the position of the provided value and places it in that
%% position while keeping '_' in the other positions.
%% The caller function can use this function recursively until
%% it has built the full search pattern of interest
pattern({Field,Value},Pattern_sofar)->
N = position(Field,my_record_info(element(1,Pattern_sofar))),
case N of
-1 -> Pattern_sofar;
Int when Int >= 1 -> erlang:setelement(N + 1,Pattern_sofar,Value);
_ -> Pattern_sofar
end.
my_record_info(Record_name)->
case Record_name of
staff_dynamic -> record_info(fields,staff_dynamic);
person -> record_info(fields,person);
_ -> []
end.
%% These below,help locate the position of an element in a list
%% returned by "-record_info(fields,person)"
position(_,[]) -> -1;
position(Value,List)->
find(lists:member(Value,List),Value,List,1).
find(false,_,_,_) -> -1;
find(true,V,[V|_],N)-> N;
find(true,V,[_|X],N)->
find(V,X,N + 1).
find(V,[V|_],N)-> N;
find(V,[_|X],N) -> find(V,X,N + 1).
This was working very well though it was computationally intensive.
It could still work even after changing the record definition since at compile time, it gets the new record info
The problem is that when i initiate even 25 processes on a 3.0 GHz pentium 4 processor running WinXP, It hangs and takes a long time to return results.
If am to use qlc in these fragments, to get accurate results, i have to specify which fragment to search in like this.
find_person_by_tel(Tel)->
select(qlc:q([ X || X <- mnesia:table(Frag), (X#person.address)#address.tel == Tel])).
select(Q)->
case ?transact(fun() -> qlc:e(Q) end) of
{atomic,Val} -> Val;
{aborted,_} = Error -> report_mnesia_event(Error)
end.
Qlc was returning [], when i search for something yet when i use match_object/1 i get accurate results. I found that using match_expressions can help.
mnesia:table(Tab,Props).
where Props is a data structure that defines the match expression, the chunk size of return values e.t.c
I got a problem when i tried building match expressions dynamically.
Function mnesia:read/1 or mnesia:read/2 requires that you have the primary key
Now am asking myself, how can i efficiently use QLC to search for records in a large fragmented table? Please help.
I know that using tuple representation of records makes code hard to upgrade. This is why
i hate using mnesia:select/1, mnesia:match_object/1 and i want to stick to QLC. QLC is giving me wrong results in my queries from a mnesia table of 64 fragments even on the same node.
Has anyone ever used QLC to query a fragmented table?, please help
Do you invoke the qlc in the activity context?
tfn_match(Id) ->
Search = #person{address=#address{tel=Id, _ = '_'}, _ = '_'},
trans(fun() -> mnesia:match_object(Search) end).
tfn_qlc(Id) ->
Q = qlc:q([ X || X <- mnesia:table(person), (X#person.address)#address.tel == Id]),
trans(fun() -> qlc:e(Q) end).
trans(Fun) ->
try Res = mnesia:activity(transaction, Fun, mnesia_frag),
{atomic, Res}
catch exit:Error ->
{aborted, Error}
end.

Exception error in Erlang

So I've been using Erlang for the last eight hours, and I've spent two of those banging my head against the keyboard trying to figure out the exception error my console keeps returning.
I'm writing a dice program to learn erlang. I want it to be able to call from the console through the erlang interpreter. The program accepts a number of dice, and is supposed to generate a list of values. Each value is supposed to be between one and six.
I won't bore you with the dozens of individual micro-changes I made to try and fix the problem (random engineering) but I'll post my code and the error.
The Source:
-module(dice2).
-export([d6/1]).
d6(1) ->
random:uniform(6);
d6(Numdice) ->
Result = [],
d6(Numdice, [Result]).
d6(0, [Finalresult]) ->
{ok, [Finalresult]};
d6(Numdice, [Result]) ->
d6(Numdice - 1, [random:uniform(6) | Result]).
When I run the program from my console like so...
dice2:d6(1).
...I get a random number between one and six like expected.
However when I run the same function with any number higher than one as an argument I get the following exception...
**exception error: no function clause matching dice2:d6(1, [4|3])
... I know I I don't have a function with matching arguments but I don't know how to write a function with variable arguments, and a variable number of arguments.
I tried modifying the function in question like so....
d6(Numdice, [Result]) ->
Newresult = [random:uniform(6) | Result],
d6(Numdice - 1, Newresult).
... but I got essentially the same error. Anyone know what is going on here?
This is basically a type error. When Result is a list, [Result] is a list with one element. E.g., if your function worked, it would always return a list with one element: Finalresult.
This is what happens (using ==> for "reduces to"):
d6(2) ==> %% Result == []
d6(2, [[]]) ==> %% Result == [], let's say random:uniform(6) gives us 3
d6(1, [3]) ==> %% Result == 3, let's say random:uniform(6) gives us 4
d6(0, [4|3]) ==> %% fails, since [Result] can only match one-element lists
Presumably, you don't want [[]] in the first call, and you don't want Result to be 3 in the third call. So this should fix it:
d6(Numdice) -> Result = [], d6(Numdice, Result). %% or just d6(Numdice, []).
d6(0, Finalresult) -> {ok, Finalresult};
d6(Numdice, Result) -> d6(Numdice - 1, [random:uniform(6) | Result]).
Lesson: if a language is dynamically typed, this doesn't mean you can avoid getting the types correct. On the contrary, it means that the compiler won't help you in doing this as much as it could.

Resources