Which characters are included in the Lua punctuation string pattern (%p)? - lua

I haven't been able to find documentation of which characters compound the punctuation set "%p" in Lua.

The answer is locale dependent, it is a direct interface to the C function.
Actually, if there is a C standard function which does something similar to the Lua function, it is near-certain that the Lua function just wraps the C function, warts and all, even without looking at the specific case.
(This is part of the reason file:read() still has trouble reading text with embedded zeroes in 5.2, maybe even will have in 5.3)
While Amaden gave a good answer for the "C" locale, and ColonelThirtyTwo gave the right way to check for the current locale, the C standard only says:
ispunct(): The ispunct function tests for any printing character that is one of a locale-specific set of punctuation characters for which neither isspace nor isalnum is true. In the "C" locale, ispunct returns true for every printing character for which neither isspace nor isalnum is true.

A small script to find them:
for i=0,255 do
if string.match(string.char(i), "%p") then
io.write(string.char(i))
end
end
io.write("\n")
-- $ luajit test.lua
-- !"#$%&'()*+,-./:;<=>?#[\]^_`{|}~

%p is matched by the C function ispunct (C source v 5.2), which matches the following:
041 ‘‘!’’ 042 ‘‘ ’’ 043 ‘‘#’’ 044 ‘‘$’’ 045 ‘‘%’’
046 ‘‘&’’ 047 ‘‘’’’ 050 ‘‘(’’ 051 ‘‘)’’ 052 ‘‘*’’
053 ‘‘+’’ 054 ‘‘,’’ 055 ‘‘-’’ 056 ‘‘.’’ 057 ‘‘/’’
072 ‘‘:’’ 073 ‘‘;’’ 074 ‘‘<’’ 075 ‘‘=’’ 076 ‘‘>’’
077 ‘‘?’’ 100 ‘‘#’’ 133 ‘‘[’’ 134 ‘‘\’’ 135 ‘‘]’’
136 ‘‘^’’ 137 ‘‘_’’ 140 ‘‘‘’’ 173 ‘‘{’’ 174 ‘‘|’’
175 ‘‘}’’ 176 ‘‘~’’
(From man ispunct)

Related

Repetitive regular expression in Lua

I need to find a pattern of 6 pairs of hexadecimal numbers (without 0x), eg.
"00 5a 4f 23 aa 89"
This pattern works for me, but the question is if there any way to simplify it?
[%da-f][%da-f]%s[%da-f][%da-f]%s[%da-f][%da-f]%s[%da-f][%da-f]%s[%da-f][%da-f]%s[%da-f][%da-f]
Lua patterns do not support limiting quantifiers and many more features that regular expressions support (hence, Lua patterns are not even regular expressions).
You can build the pattern dynamically since you know how many times you need to repeat a part of a pattern:
local text = '00 5a 4f 23 aa 89'
local answer = text:match('[%da-f][%da-f]'..('%s[%da-f][%da-f]'):rep(5) )
print (answer)
-- => 00 5a 4f 23 aa 89
See the Lua demo.
The '[%da-f][%da-f]'..('%s[%da-f][%da-f]'):rep(5) can be further shortened with %x hex char shorthand:
'%x%x'..('%s%x%x'):rep(5)
Lua supports %x for hexadecimal digits, so you can replace all every [%da-f] with %x:
%x%x%s%x%x%s%x%x%s%x%x%s%x%x%s%x%x
Lua doesn't support specific quantifiers {n}. If it did, you could make it quite a lot shorter.
Also you can use a "One or more" with the Plus-Sign to shorten up...
print(('Your MAC is: 00 5a 4f 23 aa 89'):match('%x+%s%x+%s%x+%s%x+%s%x+%s%x+'))
-- Tested in Lua 5.1 up to 5.4
It is described under "Pattern Item:" in...
https://www.lua.org/manual/5.4/manual.html#6.4.1
final solution:
local text = '00 5a 4f 23 aa 89'
local pattern = '%x%x'..('%s%x%x'):rep(5)
local answer = text:match(pattern)
print (answer)

how to use seq2seq to decode concatenated string

Am trying to decode a concatenated String like below ...
SQCB7A750BATWE SQ CB 7 A 750 B A T WE
PT05A1219PY023 PT 05 A 12 19 P Y 023
PT55A1019PX02 PT 55 A 10 19 P X 02
PT33SE2215SW023 PT 33 SE 22 15 S W 023
PT05A2216PW023(LC) PT 05 A 22 16 P W 023 (LC)
am looking for a smarter way rather than hard-coded rules as the input will have variations(number of characters and digits), I came across SEQ2SEQ model and I want to know if it's possible to use it in such problem
I already followed some tutorials to get a taste of it, but the results weren't even close
it also seems there are 2 approaches character level and word level as per this tutorial
Character level:
Input sentence: SQCACA333BA71A
Decoded sentence: P 9(PDD366AZ2IDD4K )F)F(L)L)1)1)1) 6A
-
Input sentence: SQCAAC152DA71A
Decoded sentence: P 9(PDD366AZ2IDD4K )F)F(L)L)1)1)1) 6A
am still trying to implement the word level, but I'd like to know if the problem can be solved using this approach (seq2seq)

Normalize street addresses in Ruby

I would like to detect "numbers" but not only in a string, and format the string depending country settings.
Some countries put the number at the beginning of the string, other put it at the end.
Examples with current strings for Italia:
Via Treviso Mare 2 => need to detect 2
8C via Sergio Leone => need to detect 8C
Strada Provinciale 22 C => need to detect 22 C
19-20 Frazione Santa Maria => need to detect 19-20
9 - 11 via Giare => need to detect 9 - 11
Via Cesare Taiti 18-B => need to detect 18-B
What I want to obtain (put all numbers/groups at end in Italy):
Via Treviso Mare 2
via Sergio Leone 8C
Strada Provinciale 22 C
Frazione Santa Maria 19-20
via Giare 9 - 11
Via Cesare Taiti 18-B
This is an example for Italia, for other countries, it's the contrary, so I will create 2 cases.
The problem is to create the regexp to match all these possibilities in my string:
2
8C
22 C
19-20
9 - 11
18-B
Thanks for your suggestions.
Here's one regex, tailored for your examples :
/\b\d[\d\- ]*[A-Z]?\b/
It means :
a digit
followed by 0 or more digits, spaces or -
followed by 0 or 1 letter.
It cannot possibly work for every italian address, though. The formats differ so wildly that no regex could understand them all.
Note that the found regex might have trailing spaces. You can call strip on the match.

Parsing complex files with Parsec

I would like to parse files with several sequences of data (same number of column, same content, ...) with Haskell.
My data sequences will be delimited by keywords before and after.
BEGIN
1 882
2 809
3 435
4 197
5 229
6 425
...
END
BEGIN
1 235 623 684
2 871 699 557
3 918 686 49
4 53 564 906
5 246 344 501
6 929 138 474
...
END
My problem is that after several tests with Parsec, I have the impression that Parsec is rather made to parse a file line by line and not the whole file.
Is Parsec the right way to make what I want or should I consider an other tool like Happy or Alex ?
Is there a website (or other ressource) providing examples of parsing complex text files with Parsec ?
Note : The example I give is a very simple one. Things would be more tricky in my files with many more keywords and combinations.
The format as you've described wouldn't be hard at all to handle in parsec.
As for learning how to use it: your first step should be to avoid whatever guide gave you the impression that parsec worked line-by-line. I recommend Chapter 16 of Real World Haskell as a good place to get started, and once you're comfortable with the basics the reference material at http://hackage.haskell.org/package/parsec is actually very clear.

Unexpected behavior of io:fread in Erlang

This is an Erlang question.
I have run into some unexpected behavior by io:fread.
I was wondering if someone could check whether there is something wrong with the way I use io:fread or whether there is a bug in io:fread.
I have a text file which contains a "triangle of numbers"as follows:
59
73 41
52 40 09
26 53 06 34
10 51 87 86 81
61 95 66 57 25 68
90 81 80 38 92 67 73
30 28 51 76 81 18 75 44
...
There is a single space between each pair of numbers and each line ends with a carriage-return new-line pair.
I use the following Erlang program to read this file into a list.
-module(euler67).
-author('Cayle Spandon').
-export([solve/0]).
solve() ->
{ok, File} = file:open("triangle.txt", [read]),
Data = read_file(File),
ok = file:close(File),
Data.
read_file(File) ->
read_file(File, []).
read_file(File, Data) ->
case io:fread(File, "", "~d") of
{ok, [N]} ->
read_file(File, [N | Data]);
eof ->
lists:reverse(Data)
end.
The output of this program is:
(erlide#cayle-spandons-computer.local)30> euler67:solve().
[59,73,41,52,40,9,26,53,6,3410,51,87,86,8161,95,66,57,25,
6890,81,80,38,92,67,7330,28,51,76,81|...]
Note how the last number of the fourth line (34) and the first number of the fifth line (10) have been merged into a single number 3410.
When I dump the text file using "od" there is nothing special about those lines; they end with cr-nl just like any other line:
> od -t a triangle.txt
0000000 5 9 cr nl 7 3 sp 4 1 cr nl 5 2 sp 4 0
0000020 sp 0 9 cr nl 2 6 sp 5 3 sp 0 6 sp 3 4
0000040 cr nl 1 0 sp 5 1 sp 8 7 sp 8 6 sp 8 1
0000060 cr nl 6 1 sp 9 5 sp 6 6 sp 5 7 sp 2 5
0000100 sp 6 8 cr nl 9 0 sp 8 1 sp 8 0 sp 3 8
0000120 sp 9 2 sp 6 7 sp 7 3 cr nl 3 0 sp 2 8
0000140 sp 5 1 sp 7 6 sp 8 1 sp 1 8 sp 7 5 sp
0000160 4 4 cr nl 8 4 sp 1 4 sp 9 5 sp 8 7 sp
One interesting observation is that some of the numbers for which the problem occurs happen to be on 16-byte boundary in the text file (but not all, for example 6890).
I'm going to go with it being a bug in Erlang, too, and a weird one. Changing the format string to "~2s" gives equally weird results:
["59","73","4","15","2","40","0","92","6","53","0","6","34",
"10","5","1","87","8","6","81","61","9","5","66","5","7",
"25","6",
[...]|...]
So it appears that it's counting a newline character as a regular character for the purposes of counting, but not when it comes to producing the output. Loopy as all hell.
A week of Erlang programming, and I'm already delving into the source. That might be a new record for me...
EDIT
A bit more investigation has confirmed for me that this is a bug. Calling one of the internal methods that's used in fread:
> io_lib_fread:fread([], "12 13\n14 15 16\n17 18 19 20\n", "~d").
{done,{ok,"\f"}," 1314 15 16\n17 18 19 20\n"}
Basically, if there's multiple values to be read, then a newline, the first newline gets eaten in the "still to be read" part of the string. Other testing suggests that if you prepend a space it's OK, and if you lead the string with a newline it asks for more.
I'm going to get to the bottom of this, gosh-darn-it... (grin) There's not that much code to go through, and not much of it deals specifically with newlines, so it shouldn't take too long to narrow it down and fix it.
EDIT^2
HA HA! Got the little blighter.
Here's the patch to the stdlib that you want (remember to recompile and drop the new beam file over the top of the old one):
--- ../erlang/erlang-12.b.3-dfsg/lib/stdlib/src/io_lib_fread.erl
+++ ./io_lib_fread.erl
## -35,9 +35,9 ##
fread_collect(MoreChars, [], Rest, RestFormat, N, Inputs).
fread_collect([$\r|More], Stack, Rest, RestFormat, N, Inputs) ->
- fread(RestFormat, Rest ++ reverse(Stack), N, Inputs, More);
+ fread(RestFormat, Rest ++ reverse(Stack), N, Inputs, [$\r|More]);
fread_collect([$\n|More], Stack, Rest, RestFormat, N, Inputs) ->
- fread(RestFormat, Rest ++ reverse(Stack), N, Inputs, More);
+ fread(RestFormat, Rest ++ reverse(Stack), N, Inputs, [$\n|More]);
fread_collect([C|More], Stack, Rest, RestFormat, N, Inputs) ->
fread_collect(More, [C|Stack], Rest, RestFormat, N, Inputs);
fread_collect([], Stack, Rest, RestFormat, N, Inputs) ->
## -55,8 +55,8 ##
eof ->
fread(RestFormat,eof,N,Inputs,eof);
_ ->
- %% Don't forget to count the newline.
- {more,{More,RestFormat,N+1,Inputs}}
+ %% Don't forget to strip and count the newline.
+ {more,{tl(More),RestFormat,N+1,Inputs}}
end;
Other -> %An error has occurred
{done,Other,More}
Now to submit my patch to erlang-patches, and reap the resulting fame and glory...
Besides the fact that it seems to be a bug in one of the erlang libs I think you could (very) easily circumvent the problem.
Given the fact your file is line-oriented I think best practice is that you process it line-by-line as well.
Consider the following construction. It works nicely on an unpatched erlang and because it uses lazy evaluation it can handle files of arbitrary length without having to read all of it into memory first. The module contains an example of a function to apply to each line - turning a line of text-representations of integers into a list of integers.
-module(liner).
-author("Harro Verkouter").
-export([liner/2, integerize/0, lazyfile/1]).
% Applies a function to all lines of the file
% before reducing (foldl).
liner(File, Fun) ->
lists:foldl(fun(X, Acc) -> Acc++Fun(X) end, [], lazyfile(File)).
% Reads the lines of a file in a lazy fashion
lazyfile(File) ->
{ok, Fd} = file:open(File, [read]),
lazylines(Fd).
% Actually, this one does the lazy read ;)
lazylines(Fd) ->
case io:get_line(Fd, "") of
eof -> file:close(Fd), [];
{error, Reason} ->
file:close(Fd), exit(Reason);
L ->
[L|lazylines(Fd)]
end.
% Take a line of space separated integers (string) and transform
% them into a list of integers
integerize() ->
fun(X) ->
lists:map(fun(Y) -> list_to_integer(Y) end,
string:tokens(X, " \n")) end.
Example usage:
Eshell V5.6.5 (abort with ^G)
1> c(liner).
{ok,liner}
2> liner:liner("triangle.txt", liner:integerize()).
[59,73,41,52,40,9,26,53,6,34,10,51,87,86,81,61,95,66,57,25,
68,90,81,80,38,92,67,73,30|...]
And as a bonus, you can easily fold over the lines of any (lineoriented) file w/o running out of memory :)
6> lists:foldl( fun(X, Acc) ->
6> io:format("~.2w: ~s", [Acc,X]), Acc+1
6> end,
6> 1,
6> liner:lazyfile("triangle.txt")).
1: 59
2: 73 41
3: 52 40 09
4: 26 53 06 34
5: 10 51 87 86 81
6: 61 95 66 57 25 68
7: 90 81 80 38 92 67 73
8: 30 28 51 76 81 18 75 44
Cheers,
h.
I noticed that there are multiple instances where two numbers are merged, and it appears to be at the line boundaries on every line starting at the fourth line and beyond.
I found that if you add a whitespace character to the beginning of every line starting at the fifth, that is:
59
73 41
52 40 09
26 53 06 34
10 51 87 86 81
61 95 66 57 25 68
90 81 80 38 92 67 73
30 28 51 76 81 18 75 44
...
The numbers get parsed properly:
39> euler67:solve().
[59,73,41,52,40,9,26,53,6,34,10,51,87,86,81,61,95,66,57,25,
68,90,81,80,38,92,67,73,30|...]
It also works if you add the whitespace to the beginning of the first four lines, as well.
It's more of a workaround than an actual solution, but it works. I'd like to figure out how to set up the format string for io:fread such that we wouldn't have to do this.
UPDATE
Here's a workaround that won't force you to change the file. This assumes that all digits are two characters (< 100):
read_file(File, Data) ->
case io:fread(File, "", "~d") of
{ok, [N] } ->
if
N > 100 ->
First = N div 100,
Second = N - (First * 100),
read_file(File, [First , Second | Data]);
true ->
read_file(File, [N | Data])
end;
eof ->
lists:reverse(Data)
end.
Basically, the code catches any of the numbers which are the concatenation of two across a newline and splits them into two.
Again, it's a kludge that implies a possible bug in io:fread, but that should do it.
UPDATE AGAIN The above will only work for two-digit inputs, but since the example packs all digits (even those < 10) into a two-digit format, that will work for this example.

Resources