Normalize street addresses in Ruby - ruby-on-rails

I would like to detect "numbers" but not only in a string, and format the string depending country settings.
Some countries put the number at the beginning of the string, other put it at the end.
Examples with current strings for Italia:
Via Treviso Mare 2 => need to detect 2
8C via Sergio Leone => need to detect 8C
Strada Provinciale 22 C => need to detect 22 C
19-20 Frazione Santa Maria => need to detect 19-20
9 - 11 via Giare => need to detect 9 - 11
Via Cesare Taiti 18-B => need to detect 18-B
What I want to obtain (put all numbers/groups at end in Italy):
Via Treviso Mare 2
via Sergio Leone 8C
Strada Provinciale 22 C
Frazione Santa Maria 19-20
via Giare 9 - 11
Via Cesare Taiti 18-B
This is an example for Italia, for other countries, it's the contrary, so I will create 2 cases.
The problem is to create the regexp to match all these possibilities in my string:
2
8C
22 C
19-20
9 - 11
18-B
Thanks for your suggestions.

Here's one regex, tailored for your examples :
/\b\d[\d\- ]*[A-Z]?\b/
It means :
a digit
followed by 0 or more digits, spaces or -
followed by 0 or 1 letter.
It cannot possibly work for every italian address, though. The formats differ so wildly that no regex could understand them all.
Note that the found regex might have trailing spaces. You can call strip on the match.

Related

Repetitive regular expression in Lua

I need to find a pattern of 6 pairs of hexadecimal numbers (without 0x), eg.
"00 5a 4f 23 aa 89"
This pattern works for me, but the question is if there any way to simplify it?
[%da-f][%da-f]%s[%da-f][%da-f]%s[%da-f][%da-f]%s[%da-f][%da-f]%s[%da-f][%da-f]%s[%da-f][%da-f]
Lua patterns do not support limiting quantifiers and many more features that regular expressions support (hence, Lua patterns are not even regular expressions).
You can build the pattern dynamically since you know how many times you need to repeat a part of a pattern:
local text = '00 5a 4f 23 aa 89'
local answer = text:match('[%da-f][%da-f]'..('%s[%da-f][%da-f]'):rep(5) )
print (answer)
-- => 00 5a 4f 23 aa 89
See the Lua demo.
The '[%da-f][%da-f]'..('%s[%da-f][%da-f]'):rep(5) can be further shortened with %x hex char shorthand:
'%x%x'..('%s%x%x'):rep(5)
Lua supports %x for hexadecimal digits, so you can replace all every [%da-f] with %x:
%x%x%s%x%x%s%x%x%s%x%x%s%x%x%s%x%x
Lua doesn't support specific quantifiers {n}. If it did, you could make it quite a lot shorter.
Also you can use a "One or more" with the Plus-Sign to shorten up...
print(('Your MAC is: 00 5a 4f 23 aa 89'):match('%x+%s%x+%s%x+%s%x+%s%x+%s%x+'))
-- Tested in Lua 5.1 up to 5.4
It is described under "Pattern Item:" in...
https://www.lua.org/manual/5.4/manual.html#6.4.1
final solution:
local text = '00 5a 4f 23 aa 89'
local pattern = '%x%x'..('%s%x%x'):rep(5)
local answer = text:match(pattern)
print (answer)

how to use seq2seq to decode concatenated string

Am trying to decode a concatenated String like below ...
SQCB7A750BATWE SQ CB 7 A 750 B A T WE
PT05A1219PY023 PT 05 A 12 19 P Y 023
PT55A1019PX02 PT 55 A 10 19 P X 02
PT33SE2215SW023 PT 33 SE 22 15 S W 023
PT05A2216PW023(LC) PT 05 A 22 16 P W 023 (LC)
am looking for a smarter way rather than hard-coded rules as the input will have variations(number of characters and digits), I came across SEQ2SEQ model and I want to know if it's possible to use it in such problem
I already followed some tutorials to get a taste of it, but the results weren't even close
it also seems there are 2 approaches character level and word level as per this tutorial
Character level:
Input sentence: SQCACA333BA71A
Decoded sentence: P 9(PDD366AZ2IDD4K )F)F(L)L)1)1)1) 6A
-
Input sentence: SQCAAC152DA71A
Decoded sentence: P 9(PDD366AZ2IDD4K )F)F(L)L)1)1)1) 6A
am still trying to implement the word level, but I'd like to know if the problem can be solved using this approach (seq2seq)

How to use F# TypeProvider to read PowerBall csv?

The powerball schema and separators are not consistent which makes it an unusual file to read. (http://www.powerball.com/powerball/winnums-text.txt)
Sample:
Draw Date WB1 WB2 WB3 WB4 WB5 PB PP
09/24/2016 15 07 29 41 20 22 2
09/21/2016 63 67 01 69 28 17 4
09/17/2016 51 19 09 62 55 14 4
Any suggestions?
This looks like a "fixed column width" file rather than an ordinary CSV (meaning that the columns are not separated by any single character, but instead have fixed number of characters, with padding spaces).
There is some early work on supporting this in F# Data in the pull request here. We'd welcome any help getting this tested - but you'd need to get the soruce code and build F# Data from source (which is just a matter of running the build script though!)
Alternatively, you could probably do some simple pre-processing on the file before reading it as an ordinary CSV file. Looking at the sample file, using a regular expression to replace 1 or more spaces with a comma would produce regular CSV that the CSV provider can consume.

Postscript file - Image instead text

With a Postscript driver (Xerox, Canon, HP, all), when I create a PS file, for example when I print the test page in the printer properties, I get :
OK :
The view of the result is correct (with GSview for example)
Not OK :
The file size is to big, more than 4 MB.
When I edit the file, I have one big image (doNimage). I think is the reason of the big size file.
The example file : https://drive.google.com/open?id=0B9bet657DEU5alV6WFZZdDFjMmc
I'm on Windows 10, similar problem with Windows server 2012 r2.
I let the configuration of the driver by default.
Anyone has an idea ?
Thanks a lot.
Regards.
I don't understand your problem, the file you posted a link to contains text. Here's an example:
360 4485 M <202530360E0F1102381030100D100B0824152D30103102020C302A1E19181B1E1730132E28301530132D3B02230B2A2E22081308>[46 16 28 70 18 42 44 44 54 32 28 32 36 32 25 39 65 40 40 28 32 44 44 44 18 28 53 45 20 47 38 45
40 28 34 40 40 28 40 28 34 40 18 44 44 25 53 40 16 39 34 0]xS
M is a moveto and xS uses the xshow operator to draw the glyphs represented by the character codes in the hexstring, using the values in the array to modify the width of each glyph.
If you were expecting to see ASCII character codes you are going to be sadly disappointed, the files uses an incrementally downloaded subset TrueType font, so the character codes are defined as they are encountered, that is the first glyph used will be given character code 1, the second will be character code 2 and so on.
Even without that, using ASCII would limit the languages that could be supported. Back in the 1980s that maybe didn't seem like a problem, but its a long time since that was considered acceptable.
If you were expecting to be able to modify the text by editing it in a text editor, forget it. PostScript is a programming language, and the output of a PostScript printer driver is a machine-generated program. Its a lengthy process for a skilled user of the language to decipher what the program is doing. The program is not amenable to alteration, if there's a fault in the output, correct the original document and recreate the PostScript program from the original.
PostScript is not an editable format.
Thanks all for your response. I see I was not very clear in my question.
Here is the state :
With the PS driver, on a windows server 2008, I get this file :
http://expirebox.com/download/0bb511565377e8b74eead67641fe7f68.html
Inside the file I can see the text "Page de test d\222imprimante"
On a Windows server 2012 R2 :
http://expirebox.com/download/60fa957cba97c82bbcd5c0e975825b52.html
I can't see any text. It's a printer page test too.
I need to see text because I'll print document with code inside. Code for a printer to identify page type. (for example a white page for the tray n° 1, yellow page for tray 2)
KenS : I understand your point. But why the same driver give different file.
I checked if it's really the same. The only difference I see is the OS, one x86, the other x64.
Thanks.
Regards.

Why K-map has states in sequence of 00,01,11,10 instead of 00,01,10,11?

Why K-map has states in sequence of 00,01,11,10 instead of 00,01,10,11?
It's because in the first sequence, each entry differs in only one bit whereas in the second sequence the transition from 01 to 10 changes two bits which produces a race condition. In asynchronous logic, nothing ever happens at the same time, so 01 to 10 is either 01 00 10 or 01 11 10 and that causes problems.
In the process of simplification, when 2 minterms, with one bit differing,r ORed, one variable gets eliminated as 1 + 0 = 1
This is because if we write 00 01 11 10 then in between two there is a difference of two bits and as smparkes told that asynchronous cannot take two values a time so that is the only way left now. As we take gray code in a similar way gray code of 00 is 00, of 01 is 01, of 11 is 10 and of 10 is 11. In this way k map is numbered.

Resources