Let a binary string composed of messages separated by one null byte:
<message><null><message><null> ... <message><null>
I would like to split them. Easy, I do:
binary:split(Bin,<<0>>,[global]),
But ...
But one message is composed of two parts:
<length><texte>
length has a 4-bytes fixed size and the length can have null bytes !
Then the split function cannot cut correctly the string.
Does exist a way according to erlang state of art ?
If all messages have a 4 byte length header, I'd recommend using erlang:decode_packet(Type,Bin,Options) where Type is set to 4. This will return {ok, Message, Rest} where Message is your first message and Rest is the rest of the binary. Just rinse and repeate until you reach the end of the binary (you might have to take care of the null bytes yourself inbetween).
If, however, not all messages have a 4 byte length prefix and there's no deterministic way of detecting that header it is probably impossible to reliably parse such a list.
Related
Is there an erlang equivalent of codePointAt from js? One that gets the code point starting at a byte offset, without modifying the underlying string/binary?
You can use bit syntax pattern matching to skip the first N bytes and decode the first character from the remaining bytes as UTF-8:
1> CodePointAt = fun(Binary, Offset) ->
<<_:Offset/binary, Char/utf8, _/binary>> = Binary,
Char
end.
Test:
2> CodePointAt(<<"πr²"/utf8>>, 0).
960
3> CodePointAt(<<"πr²"/utf8>>, 1).
** exception error: no match of right hand side value <<207,128,114,194,178>>
4> CodePointAt(<<"πr²"/utf8>>, 2).
114
5> CodePointAt(<<"πr²"/utf8>>, 3).
178
6> CodePointAt(<<"πr²"/utf8>>, 4).
** exception error: no match of right hand side value <<207,128,114,194,178>>
7> CodePointAt(<<"πr²"/utf8>>, 5).
** exception error: no match of right hand side value <<207,128,114,194,178>>
As you can see, if the offset is not in a valid UTF-8 character boundary, the function will throw an error. You can handle that differently using a case expression if needed.
First, remember that only binary strings are using UTF-8 in Erlang. Plain double-quote strings are already just lists of code points (much like UTF-32). The unicode:chardata() type represents both of these kinds of strings, including mixed lists like ["Hello", $\s, [<<"Filip"/utf8>>, $!]]. You can use unicode:characters_to_list(Chardata) or unicode:characters_to_binary(Chardata) to get a flattened version to work with if needed.
Meanwhile, the JS codePointAt function works on UTF-16 encoded strings, which is what JavaScript uses. Note that the index in this case is not a byte position, but the index of the 16-bit units of the encoding. And UTF-16 is also a variable length encoding: code points that need more than 16 bits use a kind of escape sequence called "surrogate pairs" - for example emojis like 👍 - so if such characters can occur, the index is misleading: in "a👍z" (in JavaScript), the a is at 0, but the z is not at 2 but at 3.
What you want is probably what's called the "grapheme clusters" - those that look like a single thing when printed (see the docs for Erlang's string module: https://www.erlang.org/doc/man/string.html). And you can't really use numerical indexes to dig the grapheme clusters out from a string - you need to iterate over the string from the start, getting them out one at a time. This can be done with string:next_grapheme(Chardata) (see https://www.erlang.org/doc/man/string.html#next_grapheme-1) or if you for some reason really need to index them numerically, you could insert the individual cluster substrings in an array (see https://www.erlang.org/doc/man/array.html). For example: array:from_list(string:to_graphemes(Chardata)).
I want to use Antlr4 to parse a format that stores the length of segments in the serialised form
For example, to parse:
"6,Hello 5,World"
I tried to create a grammar like this
grammar myGrammar;
sequence:
(LEN ',' TEXT)*;
LEN: [0-9]+;
TEXT: // I dont know what to put in here but it should match LEN number of chars
Is this even possible with Antlr?
A real world example of this would be parsing the messagePack binary format which has several types that serialise the length of the data into the serialised form.
For example there is the str8:
str 8 stores a byte array whose length is upto (2^8)-1 bytes:
+--------+--------+========+
| 0xd9 |YYYYYYYY| data |
+--------+--------+========+
And str16 type
str16 stores a byte array whose length is upto (2^16)-1 bytes:
+--------+--------+--------+========+
| 0xda |ZZZZZZZZ|ZZZZZZZZ| data |
+--------+--------+--------+========+
In these examples the first byte identifies the type, then we have 1 byte for str8 and 2 bytes for str16 which contain the length of the data. Then finally there is the data.
I think a rule might look something like this but dont know how to match the right amount of data
str8 : '\u00d9' BYTE DATA ;
str16: '\u00da' BYTE BYTE DATA ;
BYTE : '\u0000'..'\u00FF' ;
DATA : ???
The data format you describe is usually called TLV (tag/type–length–value). TLV cannot be recognised with a regular expression (or even with a context-free grammar) so it's not usually supported by standard tokenisers.
Fortunately, it's easy to tokenise. Standard libraries may exist for particular formats, and some formats even have automated code generators for more efficient parsing. But you should be able to write a simple tokeniser for a particular format in a few lines of code.
Once you have writen the datastream tokeniser, you could use a parser generator like Antlr to build a datastructure from the parse, but it's rarely nevessary. Most TLV-encoded streams are simple sequences of components, although you occasionally run into formats (like Google protobufs or ASN.1) which include nested subsequences. Even with those, the parse is straight-forward (although for both of those examples, standard tools exist).
In any event, using context-free grammar tools like Antlr is rarely the simplest solution, because TLV formats are mostly order-independent. (If the order were fixed, the tags wouldn't be necessary.) Context-free grammars do not have any way of handling a language such as "at most one of A, B, C, D, and E in any order" other than enumerating the alternatives, of which there are an exponential number.
I have been looking around and have read a lot of different answers but none seems to answer my specific request.
I make watchfaces for Wear OS 2 with an app called ''WATCHMAKER'' witch uses LUA as language. I want to make a watch face with a special clock pointing to a number depending on a blood sugar value sent by an transmitter connected to the body.
The string values I want to parse follows this syntax:
<DECIMAL NUMBER> <ARROW> (<TIME>)
One example would be
5,6 -> (1m)
I want to extract the <DECIMAL NUMBER> part of the reading. In the above example, I want the value 5,6.
Every 5 minutes, the transmitter sends another reading, all of those informations change:
5,8 - (30 secondes)
Thank you so much
Say you have a string, in LUA, s="14,11 -> (something)" and you want this first number of the string to be parsed to a float so you can do maths on it.
s='9,6 -> (24m)'
-- Now we use so called regular expressions
-- to parse the string
new_s=string.match(s, '[0-9]+,[0-9]+')
-- news now has the number 9,6. Which is now parsed
-- however it's still a string and to be able to treat
-- it like a number, we have to do more:
-- But we have to switch the comma for a period
new_s=new_s:gsub(",",".")
-- Now s has "9.6" as string
-- now we convert it to a number
number = string.format('%.10g', tonumber(new_s))
print(number)
Now number contains the number 9.6.
I use an ODBC table handler to read data from Excel and CSV files into an AMPL model. But the thing I encountered probably doesn't have much to do with the precise programs and programming language I use.
Among the data are two specific types of strings: three-digit alphabetic and six-digit alphanumeric.
When the three-digit alphabetic type includes a NAN string, AMPL throws an error. As I found out, the reason is that it understands NAN as "NaN" (not a number). It cannot use this as an index.
The six-digit alphanumeric type sometimes include strings like 3E1234. This seems to be a problem because AMPL (or the handler) understands this as a number in scientific notation. So it reads 3*10^1234, which is handled as infinity. So when there is one 3E1234 entry and one 3E1235 entry, it sees them both as infinity.
I understand these two. And although they are annoying, I can work with that. Now I encountered that a string SK1234 is parsed as the number 1234. I have learned a bit of programming in college, but I don't have any idea why this happens. Is the prefix SK anything special?
EDIT: Here is an example that reproduces the error:
The model file:
set INDEX;
param value;
The "run" file:
table Table1 IN "tableproxy" "odbc" "DSN=NDE" "Test.csv": INDEX <- [Index], value ~ Value;
read table Table1;
NDE is a user DSN that uses the Microsoft Text Driver in the appropriate folder.
And the CSV file:
Index,Value
SK1202,1
SK1445,2
SK0124,3
SK7896,4
SK1,5
AB1234,6
After running all this code, I type display INDEX and get
set INDEX := 1202 1445 124 7896 1 Missing;
So the field Index is treated as a numeric field with the first five entries converted to a number. The last entry cannot be converted so it is treated as Missing.
The DSN's setting is that it sets the type according to the first 25 lines. For some reason, it understands the SK... entries as numbers and therefore reads all as numbers.
For the Text ODBC driver to detect column type correctly, values should be quoted:
Index,Value
'SK1202',1
'SK1445',2
'SK0124',3
'SK7896',4
'SK1',5
'AB1234',6
I read through similar stackoverflow questions to understand financial track card data.
I think the issue I am facing might be slightly different or maybe I am really weak in regex.
Now we have a service that returns track data accidentally instead of the guest name.
My goal is every time I receive track data I display "" empty string, else return the guest name.( This is a temp solution until we fix the root cause)
This is what my regular expressions is but looks like it doesn't detect track data.
irb(main):043:0> guestname="%4234242xx12^TEST/GUEST L ^324532635645744646462"
irb(main):044:0> (/[(%[bB])(;)]\d{3,}.{9,}[(^.+^)(=)].+\?.{,2}/.match(guestname)) ? "" : guestname
=> "%4234242xx12^TEST/GUEST L ^324532635645744646462"
(Not real data)
Now, looking at the wiki for track data information I want to cover most cases, if not all:
https://en.wikipedia.org/wiki/Magnetic_stripe_card#Financial_cards
Could some help with my regex. This is what I have:
/[(%[bB])(;)]\d{3,}.{9,}[(^.+^)(=)].+\?.{,2}/
Track 1, Format B:
Start sentinel — one character (generally '%')
Format code="B" — one character (alpha only)
Primary account number (PAN) — up to 19 characters. Usually, but not
always, matches the credit card number printed on the front of the
card.
Field Separator — one character (generally '^')
Name — 2 to 26 characters
Field Separator — one character (generally '^')
Expiration date — four characters in the form YYMM.
Service code — three characters
Discretionary data — may include Pin Verification Key Indicator (PVKI,
1 character), PIN Verification Value (PVV, 4 characters), Card
Verification Value or Card Verification Code (CVV or CVC, 3
characters)
End sentinel — one character (generally '?')
Longitudinal redundancy check (LRC) — it is one character and a
validity character calculated from other data on the track.
Track 2: This format was developed by the banking industry (ABA). This
track is written with a 5-bit scheme (4 data bits + 1 parity), which
allows for sixteen possible characters, which are the numbers 0-9,
plus the six characters : ; < = > ? . The selection of six
punctuation symbols may seem odd, but in fact the sixteen codes simply
map to the ASCII range 0x30 through 0x3f, which defines ten digit
characters plus those six symbols. The data format is as follows:
Start sentinel — one character (generally ';')
Primary account number (PAN) — up to 19 characters. Usually, but not
always, matches the credit card number printed on the front of the
card.
Separator — one char (generally '=')
Expiration date — four characters in the form YYMM.
Service code — three digits. The first digit specifies the interchange
rules, the second specifies authorisation processing and the third
specifies the range of services
Discretionary data — as in track one
End sentinel — one character (generally '?')
Longitudinal redundancy check (LRC) — it is one character and a
validity character calculated from other data on the track. Most
reader devices do not return this value when the card is swiped to the
presentation layer, and use it only to verify the input internally to
the reader.
Your example input string does not contain format code after first sentinel.
You are trying to parse html-encoded version, which is weird.
So, I would start with html decoding. E.g. with Nokogiri:
▶ guestname="%4234242xx12^TEST/GUEST L ^324532635645744646462"
#⇒ "%4234242xx12^TEST/GUEST L ^324532635645744646462"
▶ parsed = Nokogiri::HTML.parse(guestname).text
#⇒ "%4234242xx12^TEST/GUEST L ^324532635645744646462"
OK, now we at least have a leading percent. Now let us ask ourselves: how many users have a guest name starting with a percent sign? I bet none. You might re-check yourself by running a query against your database. Since it is a temporary solution, I would definitely shut the perfectionism up and go with:
▶ parsed =~ /\A%/ ? '' : parsed
Hope it helps.