Regular expression in Ruby - extracting from Gutenberg - ruby-on-rails

I am fairly new to Ruby and I am struggling with a regular expression to seed a database from this text file: http://www.gutenberg.org/cache/epub/673/pg673.txt.
I want the <h1> tags as the words for the dictionary database, and the <def> tags as the definitions.
I could be quite off base here (I've only ever seeded a db with copy and past ;):
require 'open-uri'
Dictionary.delete_all
g_text = open('http://www.gutenberg.org/cache/epub/673/pg673.txt')
y = g_text.read(/<h1>(.*?)<\/h1>/)
a = g_text.read(/<def>(.*?)<\/def>/)
Dictionary.create!(:word => y, :definition => a)
As you can see, there are often more than one <def> for each <h1>, which is fine, as I can just add columns to my table for definition1, definition2, etc.
But what would this regular expression look like to be sure that each definition is in the same row as the immediately preceding <h1> tag?
Thanks for an help!
Edit:
Okay, so this is what i am trying now:
doc.scan(Regexp.union(/<h1>(.*?)<\/h1>/, /<def>(.*?)<\/def>/)).map do |m, n|
p [m,n]
end
How do I get rid of all of the nil entries?
It seems like regular expression is the only way of making it through the whole document without stopping part way through when an error is encountered...at least after a couple attempts at other parsers.

what I came to (with a local extract for sandbox use):
require 'pp' # For SO to pretty print the hash at end
h1regex="h1>(.+)<\/h1" # Define the hl regex (avoid empty tags)
defregex="def>(.+)<\/def" # define the def regex (avoid empty tags)
# Initialize vars
defhash={}
key=nil
last=nil
open("./gut.txt") do |f|
f.each_line do |l|
newkey=l[/#{h1regex}/i,1] # get the next key (or nothing)
if (newkey != last && newkey != nil) then # if we changed key, update the hash (some redundant hl entries with other defs)
key = last = newkey # update current key
defhash[key] = [] # init the new entry to empty array
end
if l[/#{defregex}/i] then
defhash[key] << l[/#{defregex}/i,1] # we did match a def, add it to the current key array
end
end
end
pp defhash # print the result
Which give this output:
{"A"=>
[" The first letter of the English and of many other alphabets. The capital A of the alphabets of Middle and Western Europe, as also the small letter (a), besides the forms in Italic, black letter, etc., are all descended from the old Latin A, which was borrowed from the Greek <spn>Alpha</spn>, of the same form; and this was made from the first letter (<i>Aleph</i>, and itself from the Egyptian origin. The <i>Aleph</i> was a consonant letter, with a guttural breath sound that was not an element of Greek articulation; and the Greeks took it to represent their vowel <i>Alpha</i> with the \\'84 sound, the Ph\\'d2nician alphabet having no vowel symbols.",
"The name of the sixth tone in the model major scale (that in C), or the first tone of the minor scale, which is named after it the scale in A minor. The second string of the violin is tuned to the A in the treble staff. -- A sharp (A#) is the name of a musical tone intermediate between A and B. -- A flat (A&flat;) is the name of a tone intermediate between A and G.",
"In each; to or for each; <as>as, \"twenty leagues <ex>a</ex> day\", \"a hundred pounds <ex>a</ex> year\", \"a dollar <ex>a</ex> yard\", etc.</as>",
"In; on; at; by.",
"In process of; in the act of; into; to; -- used with verbal substantives in <i>-ing</i> which begin with a consonant. This is a shortened form of the preposition <i>an</i> (which was used before the vowel sound); as in <i>a</i> hunting, <i>a</i> building, <i>a</i> begging. \"Jacob, when he was <i>a</i> dying\" <i>Heb. xi. 21</i>. \"We'll <i>a</i> birding together.\" \" It was <i>a</i> doing.\" <i>Shak.</i> \"He burst out <i>a</i> laughing.\" <i>Macaulay</i>. The hyphen may be used to connect <i>a</i> with the verbal substantive (as, <i>a</i>-hunting, <i>a</i>-building) or the words may be written separately. This form of expression is now for the most part obsolete, the <i>a</i> being omitted and the verbal substantive treated as a participle.",
"Of.",
" A barbarous corruption of <i>have</i>, of <i>he</i>, and sometimes of <i>it</i> and of <i>they</i>."],
"Abalone"=>
["A univalve mollusk of the genus <spn>Haliotis</spn>. The shell is lined with mother-of-pearl, and used for ornamental purposes; the sea-ear. Several large species are found on the coast of California, clinging closely to the rocks."],
"Aband"=>["To abandon.", "To banish; to expel."],
"Abandon"=>
["To cast or drive out; to banish; to expel; to reject.",
"To give up absolutely; to forsake entirely ; to renounce utterly; to relinquish all connection with or concern on; to desert, as a person to whom one owes allegiance or fidelity; to quit; to surrender.",
"Reflexively : To give (one's self) up without attempt at self-control ; to yield (one's self) unrestrainedly ; -- often in a bad sense.",
"To relinquish all claim to; -- used when an insured person gives up to underwriters all claim to the property covered by a policy, which may remain after loss or damage by a peril insured against."]}
Hope it can help.
Late edit: there's probably a better way, I'm not a ruby expert. I was just giving a usual advice while reviewing, but as it seems no one has answered this is how I would do it.

Related

Rails 5 - set value on self based on matching fields with a Regex

In the app that I am building to learn Ruby and Rails, I have trouble getting below to work.
Desired result
when the content of the field self.extracted_data (here self is object Document) contains the bank account number (bank_account) of a business partner (BusinessPartner), the sender for the document (self.sender_id) needs to equal the BusinessPartner.
What I have so far:
BusinessPartner.active.each do |business_partner|
unless business_partner == self.receiver_id
if self.extracted_data =~ /\s#{Regexp.escape(business_partner.bank_account)}?\s/i # need to fix the RGX
self.sender = business_partner
self.name = "match: " + business_partner.id.to_s + /\s#{Regexp.escape(business_partner.bank_account)}?\s/i.to_s # to see RGX used
else
self.sender = nil
self.name = "NO match: " + business_partner.id.to_s + /\s#{Regexp.escape(business_partner.bank_account)}?\s/i.to_s # to see RGX used
end
end
end
It always gives me NO MATCH where I do have 100% matching records for business partners. I have been studying the pickaxe book, rails doc etc. for hours now and can find the solution. All help / advice welcome.
p.s. I could DRY the regex into a variable yet it is used multiple times only temporarily.
update
sample data for business partners
sample data for extracted_data
could include the bank-account...
enclosed in whitespace eg: ' NL15 INGB 0660 3125 06 '
enclosed in whitespace and a dot (.) eg: ' GB99 RBS1 0469 7788 99.'
enclosed in brackets () eg: (NL15 INGB 0660 3125 06)
although not allowed by the banks, could have special characters; typically dot (.) or dash (-)
or like so: ' 19.83.94.527 ' (very uncommon; no need to cater.
Note: bank account should adhere to IBAN formatting rules. These will be applied to the business_partner.bank_account field for data quality; yet what is in the extracted_data depends on what it extracted from the file (pdf) attached to the document record.
You may replace the \s whitespace patterns with word boundaries \b to avoid requiring whitespace around the pattern (word boundaries are zero-width assertions, and they only match locations in a string, so they are safe to use in the extraction scenario, similarly to lookarounds), and since there are whitespace symbols in the original string, you may just remove them with .gsub(/\s+/, '') for the sake of regex checking:
if self.extracted_data.gsub(/\s+/, '') =~ /\b#{Regexp.escape(business_partner.bank_account)}?\b/i
^^^^^^^^^^^^^^^ ^^^ ^^^
See more about word boundaries on the Word Boundaries regular-expressions.info Web page.

ruby/rails detect financial track data and return nil/empty string

I read through similar stackoverflow questions to understand financial track card data.
I think the issue I am facing might be slightly different or maybe I am really weak in regex.
Now we have a service that returns track data accidentally instead of the guest name.
My goal is every time I receive track data I display "" empty string, else return the guest name.( This is a temp solution until we fix the root cause)
This is what my regular expressions is but looks like it doesn't detect track data.
irb(main):043:0> guestname="%4234242xx12^TEST/GUEST L ^324532635645744646462"
irb(main):044:0> (/[(%[bB])(;)]\d{3,}.{9,}[(^.+^)(=)].+\?.{,2}/.match(guestname)) ? "" : guestname
=> "%4234242xx12^TEST/GUEST L ^324532635645744646462"
(Not real data)
Now, looking at the wiki for track data information I want to cover most cases, if not all:
https://en.wikipedia.org/wiki/Magnetic_stripe_card#Financial_cards
Could some help with my regex. This is what I have:
/[(%[bB])(;)]\d{3,}.{9,}[(^.+^)(=)].+\?.{,2}/
Track 1, Format B:
Start sentinel — one character (generally '%')
Format code="B" — one character (alpha only)
Primary account number (PAN) — up to 19 characters. Usually, but not
always, matches the credit card number printed on the front of the
card.
Field Separator — one character (generally '^')
Name — 2 to 26 characters
Field Separator — one character (generally '^')
Expiration date — four characters in the form YYMM.
Service code — three characters
Discretionary data — may include Pin Verification Key Indicator (PVKI,
1 character), PIN Verification Value (PVV, 4 characters), Card
Verification Value or Card Verification Code (CVV or CVC, 3
characters)
End sentinel — one character (generally '?')
Longitudinal redundancy check (LRC) — it is one character and a
validity character calculated from other data on the track.
Track 2: This format was developed by the banking industry (ABA). This
track is written with a 5-bit scheme (4 data bits + 1 parity), which
allows for sixteen possible characters, which are the numbers 0-9,
plus the six characters : ; < = > ? . The selection of six
punctuation symbols may seem odd, but in fact the sixteen codes simply
map to the ASCII range 0x30 through 0x3f, which defines ten digit
characters plus those six symbols. The data format is as follows:
Start sentinel — one character (generally ';')
Primary account number (PAN) — up to 19 characters. Usually, but not
always, matches the credit card number printed on the front of the
card.
Separator — one char (generally '=')
Expiration date — four characters in the form YYMM.
Service code — three digits. The first digit specifies the interchange
rules, the second specifies authorisation processing and the third
specifies the range of services
Discretionary data — as in track one
End sentinel — one character (generally '?')
Longitudinal redundancy check (LRC) — it is one character and a
validity character calculated from other data on the track. Most
reader devices do not return this value when the card is swiped to the
presentation layer, and use it only to verify the input internally to
the reader.
Your example input string does not contain format code after first sentinel.
You are trying to parse html-encoded version, which is weird.
So, I would start with html decoding. E.g. with Nokogiri:
▶ guestname="%4234242xx12^TEST/GUEST L ^324532635645744646462"
#⇒ "%4234242xx12^TEST/GUEST L ^324532635645744646462"
▶ parsed = Nokogiri::HTML.parse(guestname).text
#⇒ "%4234242xx12^TEST/GUEST L ^324532635645744646462"
OK, now we at least have a leading percent. Now let us ask ourselves: how many users have a guest name starting with a percent sign? I bet none. You might re-check yourself by running a query against your database. Since it is a temporary solution, I would definitely shut the perfectionism up and go with:
▶ parsed =~ /\A%/ ? '' : parsed
Hope it helps.

Strip out thousands delineator specific to the locale

I have a rails app in which users input numbers in large quantities. They often use the thousands delimiter (e.g. 1,000,000,000) to help keep their large numbers human-readable (I don't want to disallow delimiter because doing so would increase the chance of incorrect data).
ActiveSupport/Rails has the handy method number_with_delimiter so that an int 1234567 is displayed as 1,234,567. Is there a method to do the reverse?
note: I don't want to simply strip out a comma, since commas are used as a decimal point in many locales (e.g. European)
To answer your general question, you can determine the "delimiter" (thousands-separator) and the "separator" (decimal separator) from the Rails localization system directly:
I18n.t('number.format.separator') # <= '.' on a US English system
I18n.t('number.format.delimiter') # <= ',' on a US English system
So you can do this:
better = input_string.gsub(I18n.t('number.format.delimiter'), '')
Or, if you prefer be more aggressive and remove all non-numerical and non-decimal input:
better = input_string.gsub(/[^\d#{I18n.t('number.format.separator')}]/, '')
Note, though, that the second example will also remove negative signs, if that matters to you.
It is also worth noting that ActiveRecord will do this for you:
my_model.update_attributes(some_float: "1,234.50") # <= sets some_float to 1234.5

Titleize with roman numerals, dashes, apostrophes, etc. in Ruby on Rails

I'm simply trying to convert uppercased company names into proper names.
Company names can include:
Dashes
Apostrophes
Roman Numerals
Text like LLC, LP, INC which should stay uppercase.
I thought I might be able to use acronyms like this:
ACRONYMS = %W( LP III IV VI VII VIII IX GI)
ActiveSupport::Inflector.inflections(:en) do |inflect|
ACRONYMS.each { |a| inflect.acronym(a) }
end
However, the conversion does not take into account word breaks, so having VI and VII does not work. For example, the conversion of "ADVISORS".titleize is "Ad VI Sors", as the VI becomes a whole word.
Dashes get removed.
It seems like there should be a generic gem for this generic problem, but I didn't find one. Is this problem really not that common? What's the best solution besides completely hacking the current inflection library?
Company names are a little odd, since a lot of times they're Marks (as in Service Mark) more than proper names. That means precise capitalization might actually matter, and trying to titleize might not be worth it.
In any case, here's a pattern that might work. Build your list of tokens to "keep", then manually split the string up and titleize the non-token parts.
# Make sure you put long strings before short (VII before VI)
word_tokens = %w{VII VI IX XI}
# Special characters need to be separate, since they never appear as "part" of another word
special_tokens = %w{-}
# Builds a regex like /(\bVII\b|\bVI\b|-|)/ that wraps "word tokens" in a word boundary check
token_regex = /(#{word_tokens.map{|t| /\b#{t}\b/}.join("|")}|#{special_tokens.join("|")})/
title = "ADVISORS-XI"
title.split(token_regex).map{|s| s =~ token_regex ? s : s.titleize}.join

Break strings into substrings based on delimiters, with empty substrings

I am using LUA to create a table within a table, and am running into an issue. I need to also populate the NIL values that appear, but can not seem to get it right.
String being manipulated:
PatID = '07-26-27~L73F11341687Per^^^SCI^SP~N7N558300000Acc^'
for word in PatID:gmatch("[^\~w]+") do table.insert(PatIDTable,word) end
local _, PatIDCount = string.gsub(PatID,"~","")
PatIDTableB = {}
for i=1, PatIDCount+1 do
PatIDTableB[i] = {}
end
for j=1, #PatIDTable do
for word in PatIDTable[j]:gmatch("[^\^]+") do
table.insert(PatIDTableB[j], word)
end
end
This currently produces this output:
table
[1]=table
[1]='07-26-27'
[2]=table
[1]='L73F11341687Per'
[2]='SCI'
[3]='SP'
[3]=table
[1]='N7N558300000Acc'
But I need it to produce:
table
[1]=table
[1]='07-26-27'
[2]=table
[1]='L73F11341687Per'
[2]=''
[3]=''
[4]='SCI'
[5]='SP'
[3]=table
[1]='N7N558300000Acc'
[2]=''
EDIT:
I think I may have done a bad job explaining what it is I am looking for. It is not necessarily that I want the karats to be considered "NIL" or "empty", but rather, that they signify that a new string is to be started.
They are, I guess for lack of a better explanation, position identifiers.
So, for example:
L73F11341687Per^^^SCI^SP
actually translates to:
1. L73F11341687Per
2.
3.
4. SCI
5. SP
If I were to have
L73F11341687Per^12ABC^^SCI^SP
Then the positions are:
1. L73F11341687Per
2. 12ABC
3.
4. SCI
5. SP
And in turn, the table would be:
table
[1]=table
[1]='07-26-27'
[2]=table
[1]='L73F11341687Per'
[2]='12ABC'
[3]=''
[4]='SCI'
[5]='SP'
[3]=table
[1]='N7N558300000Acc'
[2]=''
Hopefully this sheds a little more light on what I'm trying to do.
Now that we've cleared up what the question is about, here's the issue.
Your gmatch pattern will return all of the matching substrings in the given string. However, your gmatch pattern uses "+". That means "one or more", which therefore cannot match an empty string. If it encounters a ^ character, it just skips it.
But, if you just tried :gmatch("[^\^]*"), which allows empty matches, the problem is that it would effectively turn every ^ character into an empty match. Which is not what you want.
What you want is to eat the ^ at the end of a substring. But, if you try :gmatch("([^\^])\^"), you'll find that it won't return the last string. That's because the last string doesn't end with ^, so it isn't a valid match.
The closest you can get with gmatch is this pattern: "([^\^]*)\^?". This has the downside of putting an empty string at the end. However, you can just remove that easily enough, since one will always be placed there.
local s0 = '07-26-27~L73F11341687Per^^^SCI^SP~N7N558300000Acc^'
local tt = {}
for s1 in (s0..'~'):gmatch'(.-)~' do
local t = {}
for s2 in (s1..'^'):gmatch'(.-)^' do
table.insert(t, s2)
end
table.insert(tt, t)
end

Resources