Parse CSV different patterns - ruby-on-rails

The regular expression I am looking for have to be able to deal with different patterns.
Those are the 3 different patterns.
"10.1234/altetric55,Awesome Steel Chair,1011-2513"
"\"Sporer, Kihn and Turner\",2885-6503"
"Bartell-Collins,1167-8230"
I will have to pass this regular expression to a ruby split method.
line.split(/regular_expression/)
The idea is to split the test when there is a comma except (like in the second expression) if the comma is part of the text
thanks

In this case, don't try to split on each commas that is not enclosed between quotes. Try to find all that is not a comma or content between quotes with this pattern:
"10.1234/altetric55,Awesome Steel Chair,1011-2513".scan(/[^,"]*(?:"[^"\\]*(?:\\.[^"\\]*)*"[^,"]*)*/)
or to avoid empty items:
"10.1234/altetric55,Awesome Steel Chair,1011-2513".scan(/[^,"]+(?:"[^"\\]*(?:\\.[^"\\]*)*"[^,"]*)*|(?:"[^"\\]*(?:\\.[^"\\]*)*")+/)
But you can avoid these complex questions using the CSV class:
require 'csv'
CSV.parse("\"Sporer, Kihn and Turner\",2885-6503")
=> [["Sporer, Kihn and Turner", "2885-6503"]]

Here's another way, using recursion:
def split_it(str)
outside_quotes = true
pos = str.size.times.find do |i|
case str[i]
when '"'
outside_quotes = !outside_quotes
false
when ','
outside_quotes
else false
end
end
ret = pos ? [str[0,pos], *split_it(str[pos+1..-1])] : [str]
end
["10.1234/altetric55,Awesome Steel Chair,1011-2513",
"\"Sporer, Kihn and Turner\",2885-6503\",,,3\"",
"Bartell-Collins,1167-8230"].map { |s| split_it(s) }
#=> [["10.1234/altetric55", "Awesome Steel Chair", "1011-2513"],
# ["\"Sporer, Kihn and Turner\"", "2885-6503\",,,3\""],
# ["Bartell-Collins", "1167-8230"]]

Related

Ruby - Find characters inside brackets

I am having a string in Ruby. I need to iterate through the brackets in the string.
My String:
(((MILK AND SOYA) OR NUT) AND COCONUT)
First Iteration should return:
((MILK AND SOYA) OR NUT) AND COCONUT
Second Iteration should return the below:
(MILK AND SOYA) OR NUT
Third Iteration should return the following text:
MILK AND SOYA
How to do this in Ruby? Thanks in advance for the help.
thy this solution:
str = "(((MILK AND SOYA) OR NUT) AND COCONUT)"
while not str.nil?
puts str = str.match(/\((.*)\)/).to_a.last
end # =>
((MILK AND SOYA) OR NUT) AND COCONUT
(MILK AND SOYA) OR NUT
MILK AND SOYA
regex /\((.*)\)/ searches for string inside brackets
gsub and regex
#DmitryCat's solution works fine with your example, but it seems you might be interested in the innermost brackets first.
So you'll need to make sure the characters between brackets aren't brackets :
str = "(((MILK AND SOYA) OR NUT) AND COCONUT)"
while str.gsub!(/\(([^\(\)]+)\)/){ p $1; ''}
end
# "MILK AND SOYA"
# " OR NUT"
# " AND COCONUT"
With "((MILK AND SOYA) OR (MILK AND NUT))"
it outputs :
# "MILK AND SOYA"
# "MILK AND NUT"
# " OR "
Boolean logic to tree
With parser gem
Regexen probably aren't the right tool for this job.
This parser gem would have no problem analysing your expression :
require 'parser/current'
str = "(((MILK AND SOYA) OR NUT) AND COCONUT)"
p Parser::CurrentRuby.parse(str.gsub(/\b(and|or)\b/i){|s| s.downcase})
# s(:begin,
# s(:and,
# s(:begin,
# s(:or,
# s(:begin,
# s(:and,
# s(:const, nil, :MILK),
# s(:const, nil, :SOYA))),
# s(:const, nil, :NUT))),
# s(:const, nil, :COCONUT)))
You now have a tree : a Root node and children method. Which you can call recursively to get any information about your expression.
With sexp gem
(Thanks to #Casper for this suggestion)
It looks like sexp gem might also work, possibly with an easier syntax than parser :
require 'sxp'
p SXP.read("(((MILK AND SOYA) OR (NUT AND SOYA)) AND COCONUT)")
# [[[:MILK, :AND, :SOYA], :OR, [:NUT, :AND, :SOYA]], :AND, :COCONUT]
Sphinx
As mentioned by #Casper in the comments (thanks again!), you're trying to reinvent the wheel. If you need full text search for Rails with boolean expressions, Sphinx is a great tool. It's fast, good, reliable and there's an adapter for Ruby/Rails : thinkingsphinx.
Use index to get the first character '(' and rindex to find the closing character ')':
s = "(((MILK AND SOYA) OR NUT) AND COCONUT)"
s = s.slice(s.index('(') + 1, s.rindex(')') - 1) unless s.index('(').nil? || s.rindex(')').nil?
With this code you will get all the string you need. Just call it in a loop until s is nil
I hope it helps

Determining length of array of strings ignoring commas/periods (letters only)

I'm trying to find the best way to determine the letter count in an array of strings. I'm splitting the string, and then looping every word, then splitting letters and looping those letters.
When I get to the point where I determine the length, the problem I have is that it's counting commas and periods too. Thus, the length in terms of letters only is inaccurate.
I know this may be a lot shorter with regex, but I'm not well versed on that yet. My code is passing most tests, but I'm stuck where it counts commas.
E.g. 'You,' should be string.length = "3"
Sample code:
def abbr(str)
new_words = []
str.split.each do |word|
new_word = []
word.split("-").each do |w| # it has to be able to handle hyphenated words as well
letters = w.split('')
if letters.length >= 4
first_letter = letters.shift
last_letter = letters.pop
new_word << "#{first_letter}#{letters.count}#{last_letter}"
else
new_word << w
end
end
new_words << new_word.join('-')
end
new_words.join(' ')
I tried doing gsub before looping the words, but that wouldn't work because I don't want to completely remove the commas. I just don't need them to be counted.
Any enlightenment is appreciated.
arr = ["Now is the time for y'all Rubiests",
"to come to the aid of your bowling team."]
arr.join.size
#=> 74
Without a regex
def abbr(arr)
str = arr.join
str.size - str.delete([*('a'..'z')].join + [*('A'..'Z')].join).size
end
abbr arr
#=> 58
Here and below, arr.join converts the array to a single string.
With a regex
R = /
[^a-z] # match a character that is not a lower-case letter
/ix # case-insenstive (i) and free-spacing regex definition (x) modes
def abbr(arr)
arr.join.gsub(R,'').size
end
abbr arr
#=> 58
You could of course write:
arr.join.gsub(/[^a-z]/i,'').size
#=> 58
Try this:
def abbr(str)
str.gsub /\b\w+\b/ do |word|
if word.length >= 4
"#{word[0]}#{word.length - 2}#{word[-1]}"
else
word
end
end
end
The regex in the gsub call says "one or more word characters preceded and followed by a word boundary". The block passed to gsub operates on each word, the return from the block is the replacement for the 'word' match in gsub.
You can check for each character that whether its ascii value lies in 97-122 or 65-90.When this condition is fulfilled increment a local variable that will give you total length of string without any number or any special character or any white space.
You can use something like that (short version):
a.map { |x| x.chars.reject { |char| [' ', ',', '.'].include? char } }
Long version with explanation:
a = ['a, ', 'bbb', 'c c, .'] # Initial array of strings
excluded_chars = [' ', ',', '.'] # Chars you don't want to be counted
a.map do |str| # Iterate through all strings in array
str.chars.reject do |char| # Split each string to the array of chars
excluded_chars.include? char # and reject excluded_chars from it
end.size # This returns [["a"], ["b", "b", "b"], ["c", "c"]]
end # so, we take #size of each array to get size of the string
# Result: [1, 3, 2]

How to remove substring matching to any element of array

I have:
str ="this is the string "
and I have an array of strings:
array =["this is" ,"second element", "third element"]
I want to process the string such that substring matching any of the element of the array should be removed and rest of the string should be returned. I want the following output:
output: "the string "
How can i do this?.
You don't say whether you want true substring matching, or substring matching at word-boundaries. There's a difference. Here's how to do it honoring word boundaries:
str = "this is the string "
array = ["this is" ,"second element", "third element"]
pattern = /\b(?:#{ Regexp.union(array).source })\b/ # => /\b(?:this\ is|second\ element|third\ element)\b/
str[pattern] # => "this is"
str.gsub(pattern, '').squeeze(' ').strip # => "the string"
Here's what's happening with union and union.source:
Regexp.union(array) # => /this\ is|second\ element|third\ element/
Regexp.union(array).source # => "this\\ is|second\\ element|third\\ element"
source returns the joined array in a form that can be more easily consumed by Regex when creating a pattern, without it injecting holes into the pattern. Consider these differences and what they could do in a pattern match:
/#{ Regexp.union(%w[a . b]) }/ # => /(?-mix:a|\.|b)/
/#{ Regexp.union(%w[a . b]).source }/ # => /a|\.|b/
The first creates a separate pattern, with its own flags for case, multiple-line and white-space honoring, that would be embedded inside the outer pattern. That can be a bug that's really hard to track down and fix, so only do it when you intend to have the sub-pattern.
Also, notice what happens if you try to use:
/#{ %w[a . b].join('|') }/ # => /a|.|b/
The resulting pattern has a wildcard . embedded in it, which would blow apart your pattern, causing it to match anything. Don't go there.
If we don't tell the regex engine to honor word boundaries then unexpected/undesirable/terrible things can happen:
str = "this isn't the string "
array = ["this is" ,"second element", "third element"]
pattern = /(?:#{ Regexp.union(array).source })/ # => /(?:this\ is|second\ element|third\ element)/
str[pattern] # => "this is"
str.gsub(pattern, '').squeeze(' ').strip # => "n't the string"
It's important to think in terms of words, when working with substrings containing complete words. The engine doesn't know the difference, so you have to tell it what to do. This is a situation missed all too often by people who haven't had to do text processing.
Here is one way -
array =["this is" ,"second element", "third element"]
str = "this is the string "
str.gsub(Regexp.union(array),'') # => " the string "
To allow case-insensitive - str.gsub(/#{array.join('|')}/i,'')
I saw two kinds of solutions and at first I prefer Brad's. But I think the two approaches are so different that there must be a performance diff so I created below file and run it.
require 'benchmark/ips'
str = 'this is the string '
array =['this is' ,'second element', 'third element']
def by_loop(str, array)
array.inject(str) { |result , substring| result.gsub substring, '' }
end
def by_regex(str, array)
str.gsub(Regexp.union(array),'')
end
def by_loop_large(str, array)
array = array * 100
by_loop(str, array)
end
def by_regex_large(str, array)
array = array * 100
by_regex(str, array)
end
Benchmark.ips do |x|
x.report("loop") { by_loop(str, array) }
x.report("regex") { by_regex(str, array) }
x.report("loop large") { by_loop_large(str, array) }
x.report("regex large") { by_regex_large(str, array) }
end
The result:
-------------------------------------------------
loop 16719.0 (±10.4%) i/s - 83888 in 5.073791s
regex 18701.5 (±4.2%) i/s - 94554 in 5.063600s
loop large 182.6 (±0.5%) i/s - 918 in 5.027865s
regex large 330.9 (±0.6%) i/s - 1680 in 5.076771s
The conclusion:
Arup's approach is much more efficient when the array going large.
As to the Tin Man's concern of single quote in text, I think it's very important but that would be the responsibility of OP but not the current algorithms. And the two approaches produce the same on that string.

Ruby regex puncuation

I am having trouble writing this so that it will take a sentence as an argument and perform the translation on each word without affecting the punctuation.
I'd also like to continue using the partition method.
It would be nice if I could have it keep a quote together as well, such as:
"I said this", I said.
would be:
"I aidsay histay", I said.
def convert_sentence_pig_latin(sentence)
p split_sentence = sentence.split(/\W/)
pig_latin_sentence = []
split_sentence.each do |word|
if word.match(/^[^aeiou]+/x)
pig_latin_sentence << word.partition(/^[^aeiou]+/x)[2] + word.partition(/^[^aeiou]+/x)[1] + "ay"
else
pig_latin_sentence << word
end
end
rejoined_pig_sentence = pig_latin_sentence.join(" ").downcase + "."
p rejoined_pig_sentence.capitalize
end
convert_sentence_pig_latin("Mary had a little lamb.")
Your main problem is that [^aeiou] matches every character outside that range, including spaces, commas, quotation marks, etc.
If I were you, I'd use a positive match for consonants, ie. [b-df-hj-np-tv-z] I would also put that regex in a variable, so you're not having to repeat it three times.
Also, in case you're interested, there's a way to make your convert_sentence_pig_latin method a single gsub and it will do the whole sentence in one pass.
Update
...because you asked...
sentence.gsub( /\b([b-df-hj-np-tv-z])(\w+)/i ) { "#{$2}#{$1}ay" }
# iterate over and replace regexp matches using gsub
def convert_sentence_pig_latin2(sentence)
r = /^[^aeiou]+/i
sentence.gsub(/"([^"]*)"/m) {|x| x.gsub(/\w+/) {|y| y =~ r ? "#{y.partition(r)[2]}#{y.partition(r)[1]}ay" : y}}
end
puts convert_sentence_pig_latin2('"I said this", I said.')
# define instance method: String#to_pl
class String
R = Regexp.new '^[^aeiou]+', true # => /^[^aeiou]+/i
def to_pl
self.gsub(/"([^"]*)"/m) {|x| x.gsub(/\w+/) {|y| y =~ R ? "#{y.partition(R)[2]}#{y.partition(R)[1]}ay" : y}}
end
end
puts '"I said this", I said.'.to_pl
sources:
http://www.ruby-doc.org/core-2.1.0/Regexp.html
http://ruby-doc.org/core-2.0/String.html#method-i-gsub

What's the fastest way to check if a word from one string is in another string?

I have a string of words; let's call them bad:
bad = "foo bar baz"
I can keep this string as a whitespace separated string, or as a list:
bad = bad.split(" ");
If I have another string, like so:
str = "This is my first foo string"
What's the fasted way to check if any word from the bad string is within my comparison string, and what's the fastest way to remove said word if it's found?
#Find if a word is there
bad.split(" ").each do |word|
found = str.include?(word)
end
#Remove the word
bad.split(" ").each do |word|
str.gsub!(/#{word}/, "")
end
If the list of bad words gets huge, a hash is a lot faster:
require 'benchmark'
bad = ('aaa'..'zzz').to_a # 17576 words
str= "What's the fasted way to check if any word from the bad string is within my "
str += "comparison string, and what's the fastest way to remove said word if it's "
str += "found"
str *= 10
badex = /\b(#{bad.join('|')})\b/i
bad_hash = {}
bad.each{|w| bad_hash[w] = true}
n = 10
Benchmark.bm(10) do |x|
x.report('regex:') {n.times do
str.gsub(badex,'').squeeze(' ')
end}
x.report('hash:') {n.times do
str.gsub(/\b\w+\b/){|word| bad_hash[word] ? '': word}.squeeze(' ')
end}
end
user system total real
regex: 10.485000 0.000000 10.485000 ( 13.312500)
hash: 0.000000 0.000000 0.000000 ( 0.000000)
bad = "foo bar baz"
=> "foo bar baz"
str = "This is my first foo string"
=> "This is my first foo string"
(str.split(' ') - bad.split(' ')).join(' ')
=> "This is my first string"
All the solutions have problems with catching the bad words if the case does not match. The regex solution is easiest to fix by adding the ignore-case flag:
badex = /\b(#{bad.split.join('|')})\b/i
In addition, using "String".include?(" String ") will lead to boundary problems with the first and last words in the string or strings where the target words have punctuation or are hyphenated. Testing for those situations will result in a lot of other code being needed. Because of that I think the regex solution is the best one. It is not the fastest but it is going to be more flexible right out of the box, and, if the other algorithms are tweaked to handle case folding and compound-words the regex solution might pull ahead.
#!/usr/bin/ruby
require 'benchmark'
bad = 'foo bar baz comparison'
badex = /\b(#{bad.split.join('|')})\b/i
str = "What's the fasted way to check if any word from the bad string is within my comparison string, and what's the fastest way to remove said word if it's found?" * 10
n = 10_000
Benchmark.bm(20) do |x|
x.report('regex:') do
n.times { str.gsub(badex,'').gsub(' ',' ') }
end
x.report('regex with squeeze:') do
n.times{ str.gsub(badex,'').squeeze(' ') }
end
x.report('array subtraction') do
n.times { (str.split(' ') - bad.split(' ')).join(' ') }
end
end
I made the str variable a lot longer, to make the routines work a bit harder.
user system total real
regex: 0.740000 0.010000 0.750000 ( 0.752846)
regex with squeeze: 0.570000 0.000000 0.570000 ( 0.581304)
array subtraction 1.430000 0.010000 1.440000 ( 1.449578)
Doh!, I'm too used to how other languages handle their benchmarks. Now I got it working and looking better!
Just a little comment about what it looks like the OP is trying to do: Black-listed word removal is easy to fool, and a pain to keep maintained. L33t-sp34k makes it trivial to sneek words through. Depending on the application, people will consider it a game to find ways to push offensive words past the filtering. The best solution I found when I was asked to work on this, was to create a generator that would create all the variations on a word and dump them into a database where some process could check as soon as possible, rather than in real time. A million small strings being checked can take a while if you are searching through a long list of offensive words; I'm sure we could come up with quite a list of things that someone would find offensive, but that's an exercise for a different day.
I haven't seen anything similar in Ruby to Perl's Regexp::Assemble, but that was a good way to go after this sort of problem. You can pass an array of words, plus options for case-folding and word-boundaries, and it will spit out a regex pattern that will match all the words, with their commonalities considered to result in the smallest pattern that will match all words in the list. The problem after that is locating which word in the original string matched the hits found by the pattern, so they can be removed. Differences in word case and hits within compound-words makes that replacement more interesting.
And we won't even go into words that are benign or offensive depending on the context.
I added a bit more comprehensive test for the array-subtraction benchmark, to fit how it would need to work in a real piece of code. The if clause is specified in the answer, this now reflects it:
#!/usr/bin/env ruby
require 'benchmark'
bad = 'foo bar baz comparison'
badex = /\b(#{bad.split.join('|')})\b/i
str = "What's the fasted way to check if any word from the bad string is within my comparison string, and what's the fastest way to remove said word if it's found?" * 10
str_split = str.split
bad_split = bad.split
n = 10_000
Benchmark.bm(20) do |x|
x.report('regex') do
n.times { str.gsub(badex,'').gsub(' ',' ') }
end
x.report('regex with squeeze') do
n.times{ str.gsub(badex,'').squeeze(' ') }
end
x.report('bad.any?') do
n.times {
if (bad_split.any? { |bw| str.include?(bw) })
(str_split - bad_split).join(' ')
end
}
end
x.report('array subtraction') do
n.times { (str_split - bad_split).join(' ') }
end
end
with two test runs:
ruby test.rb
user system total real
regex 1.000000 0.010000 1.010000 ( 1.001093)
regex with squeeze 0.870000 0.000000 0.870000 ( 0.873224)
bad.any? 1.760000 0.000000 1.760000 ( 1.762195)
array subtraction 1.350000 0.000000 1.350000 ( 1.346043)
ruby test.rb
user system total real
regex 1.000000 0.010000 1.010000 ( 1.004365)
regex with squeeze 0.870000 0.000000 0.870000 ( 0.868525)
bad.any? 1.770000 0.000000 1.770000 ( 1.775567)
array subtraction 1.360000 0.000000 1.360000 ( 1.359100)
I usually make a point of not optimizing without measurements, but here's a wag:
To make it fast, you should iterate through each string once. You want to avoid a loop with bad count * str count inner compares. So, you could build a big regexp and gsub with it.
(adding foo variants to test word boundary works)
str = "This is my first foo fooo ofoo string"
=> "This is my first foo fooo ofoo string"
badex = /\b(#{bad.split.join('|')})\b/
=> /\b(foo|bar|baz)\b/
str.gsub(badex,'').gsub(' ',' ')
=> "This is my first fooo ofoo string"
Of course the huge resulting regexp might be as slow as the implied nested iteration in my other answer. Only way to know is to measure.
bad = %w(foo bar baz)
str = "This is my first foo string"
# find the first word in the list
found = bad.find {|word| str.include?(word)}
# remove it
str[found] = '' ;# str => "This is my first string"
I'd benchmark this:
bad = "foo bar baz".split(' ')
str = "This is my first foo string".split(' ')
# 1. What's the fasted way to check if any word from the bad string is within my comparison string
p bad.any? { |bw| str.include?(bw) }
# 2. What's the fastest way to remove said word if it's found?
p (str - bad).join(' ')
any? will quick checking as soon as it sees a match. If you can order your bad words by their probability, you can save some cycles.
Here's one that will check for words and phrases.
def checkContent(str)
bad = ["foo", "bar", "this place sucks", "or whatever"]
# may be best to map and singularize everything as well.
# maybe add some regex to catch those pesky, "How i make $69 dollars each second online..."
# maybe apply some comparison stuff to check for weird characters in those pesky, "How i m4ke $69 $ollars an hour"
bad_hash = {}
bad_phrase_hash = {}
bad.map(&:downcase).each do |word|
words = word.split().map(&:downcase)
if words.length > 1
words.each do |inner|
if bad_hash.key?(inner)
if bad_hash[inner].is_a?(Hash) && !bad_hash[inner].key?(words.length)
bad_hash[inner][words.length] = true
elsif bad_hash[inner] === 1
bad_hash[inner] = {1=>true,words.length => true}
end
else
bad_hash[inner] = {words.length => true}
end
end
bad_phrase_hash[word] = true
else
bad_hash[word] = 1
end
end
string = str.split().map(&:downcase)
string.each_with_index do |word,index|
if bad_hash.key?(word)
if bad_hash[word].is_a?(Hash)
if bad_hash[word].key?(1)
return false
else
bad_hash[word].keys.sort.each do |length|
value = string[index...(index + length)].join(" ")
if bad_phrase_hash.key?(value)
return false
end
end
end
else
return false
end
end
end
return true
end
The include? method is what you need. The ruby String specificacion says:
str.include?( string ) -> true or false
Returns true if str contains the given string or character.
"hello".include? "lo" -> true
"hello".include? "ol" -> false
"hello".include? ?h -> true
Note that it has O(n) and what you purposed is O(n^2)

Resources