Regex with ruby issue - ruby-on-rails

My regular extression (regex) is still work in progress, and I'm having the following issue with trying to extract some anchor text from a hash of where the element is stored.
My hash looks like:
hash["example"]
=> " Project, Area 1"
My ruby of which is trying to do the extraction of "Project" and "Area 1":
hash["ITA Area"].scan(/<a href=\"(.*)\">(.*)<\/a>/)
Any help would be much appreciated as always.

Your groups are using greedy matching, so it's going to grab as much as it can before, say, a < for the second group. Change the (.*) parts to (.*?) to use possessive matching.
There are loads of posts here on why you should not be using regex to parse html. There are many reasons why... such as, what if there is more than one space between the a and href, etc. It would be ideal to use a tool designed for parsing html.

You will have to exape the backslashes for the backslashes. so something like... \\\\ instead of just \\. It sounds stupid, but I had a similar problem with it.

I'm not entirely sure what your issue is, but the regexp should match. Double quotes " need not be escaped. As mentioned in Dan Breen's answer, you need to use non-greedy matchers if the string is expected to contain more than one possible match.

The canonical SO reason to use a real HTML parser is calmly explained right here.
However, regexen can parse simple snippets without too much trouble.
Update: Aha, the anchor text. That's actually pretty easy:
> s.scan /([^<>]*)<\/a>/
=> [["Project"], ["Area 1"]]

Related

Lua: getting rid of part of a path (sub, gsub,gmatch?)

So i have this variable:
a = [[C:\aaa\aaa\aa\bbb\ccc\ddd]]
And i need to end up here:
a = [[ccc\ddd]]
Note that the path (the aaa,ccc and ddd folders) might be different from time to time, but the word "bbb" is always gonna be there and thats what i´d like to use to start chopping the text (from the end of the word not from the beginning)
I´ve been reading some string tutorials and everything i tried just doesnt work (pretty new to scripting here). I think the "\" character messes things up.
Whats the best way to deal with this? Thaaaanks!
This is a good time to make use of patterns.
Information on that here: understanding lua patterns
With a pattern you could use string.match to flexibly capture the part of the string you want
a ="C:\\aaa\\aaa\\aa\\bbb\\ccc\\ddd"
print(string.match(a, "bbb\\(.*)"))

Does rails have an opposite of `parameterize` for strings?

I used parameterize method. I want to de-parameterize it. Is there a method to do the opposite of parameterize?
No, there is not. parameterize is a lossy conversion, you can't convert it back.
Here's an example. When you convert
My Awesome Pizza
into
my-awesome-pizza
you have no idea if the original string was
My Awesome Pizza
MY AWESOME PIZZA
etc. This is a simple example. However, as you can see from the source code, certain characters are stripped or converted into a separator (e.g. commas) and you will not be able to recover them.
If you just want an approximate conversion, then simply convert the dashes into spaces, trim multiple spaces and apply an appropriate case conversion.
In Rails there is titleize (source):
"this-is-my-parameterized-string".titleize
=> "This Is My Parameterized String"
"hello-world foo bar".titleize
=> "Hello World Foo Bar"
As mentioned above, this isn't going to revert the string to its pre-parameterized form, but if that's not a concern, this might help!
I'm with Simone on this one but you can always go with
def deparametrize(str)
str.split("-").join(" ").humanize
end
:)

Is there any advantage to _ever_ using a single quote around a string in Ruby/Rails?

I understand the functional difference between single and double quotes in Ruby, but I'm wondering what concrete reasons people have for varying between the two. In my mind it seems like you should just always use a double quote, and not think about it.
A couple rationales that I've read in researching the topic...
Use a single quote unless a double quote is required.
There's a very, very minor performance advantage to a single quotes.
Any other interesting thoughts out there? (Or maybe this is a case of the freedom or Ruby leaving the door open for no One Right Way to do something...)
I usually follow the following rule:
never use double quotes (or %Q or %W) if you don't interpolate
The reason for this is that if you're trying to track down an error or a security bug, you immediately know when looking at the beginning of the string that there cannot possibly any code inside it, therefore the bug cannot be in there.
However, I also follow the following exception to the rule:
use double quotes if they make the code more readable
I.e. I prefer
"It's time"
over
'It\'s time'
%q{It's time}
It is technically true that single quoted strings are infinitesimally faster to parse than double quoted strings, but that's irrelevant because
the program only gets parsed once, during startup, there is no difference in runtime performance
the performance difference really is extremely small
the time taken to parse strings is irrelevant compared to the time taken to parse some of Ruby's crazier syntax
So, the answer to your question is: Yes, there is an advantage, namely that you can spot right away whether or not a string may contain code.
I can think of three reasons to use single quoted strings:
They look cleaner (the reason I use them)
They make it easier to create a string you'd otherwise have to escape ('he said "yes"' vs "he said \"yes\"")
They are slightly more performant.
I would assume using a single-quoted string is faster, since double quotes allow string interpolation, and single-quoted strings do not.
That's the only difference I know of. For that reason, it's probably best to only use a single-quoted string unless you need string interpolation:
num = 59
"I ate #{num} pineapples"`
Well, there are a lot of fuzz about the "performance gain" of single quoted strings vs double quoted strings.
The fact is that it doesn't really matter if you don't interpolate. There are a lot of benchmarks around the web that corroborate that assertion. (Some here at stackoverflow)
Personally, I use double for strings that have interpolation just for the sake of readability. I prefer to see the double quotes when I need them. But in fact there are methods in ruby for interpolating strings other than "double quoting" them:
%q{#{this} doesn't get interpolated}
%Q{#{this} is interpolated}
1.9.2-p290 :004 > x = 3
=> 3
1.9.2-p290 :005 > "#{x}"
=> "3"
1.9.2-p290 :006 > '#{x}'
=> "\#{x}"
In any other case, i prefer single quotes, because it's easier to type and just makes the code less overbloated to my eyes.
Since asking this question I've discovered this unofficial Ruby Style Guide that addresses this, and many many more styling questions I've had floating around in my head. I'd highly recommend checking it out.
I found that when putting variables in a string using #{} did not work in single quotes, but did work in double quotes as below.
comp_filnam and num (integer) are the variables I used to create the file name in the file path:
file_path_1 = "C:/CompanyData/Components/#{comp_filnam}#{num.to_s}.skp"

What is the proper Lua pattern for quoted text?

I've been playing with this for an hour or tow and have found myself at a road block with the Lua pattern matching utilities. I am attempting to match all quoted text in a string and replace it if needed.
The pattern I have come up with so far is: (\?[\"\'])(.-)%1
This works in some cases but, not all cases:
Working: "This \"is a\" string of \"text to\" test with"
Not Working: "T\\\"his \"is\' a\" string\" of\' text\" to \"test\" wit\\\"h"
In the not working example I would like it to match to (I made a function that gets the matches I desire, I'm just looking for a pattern to use with gsub and curious if a lua pattern can do this):
string
a" string" of
is' a" string" of' text
test
his "is' a" string" of' text" to "test" wit
I'm going to continue to use my function instead for the time being, but am curious if there is a pattern I could/should be using and i'm just missing something with patterns.
(a few edits b/c I forgot about stackoverflows formating)
(another edit to make a non-html example since it was leading to assumptions that I was attempting to parse html)
Trying to match escaped, quoted text using regular expressions is like trying to remove the daisies (and only the daises) from a field using a lawnmower.
I made a function that gets the matches I desire
This is the correct move.
I'm curious if a lua pattern can do this
From a practical point of view, even if a pattern can do this, you don't want to. From a theoretical point of view, you are trying to find a double quote that is preceded by an even number of backslashes. This is definitely a regular language, and the regular expression you want would be something like the following (Lua quoting conventions)
[[[^\](\\)*"(.-[^\](\\)*)"]]
And the quoted string would be result #2. But Lua patterns are not full regular expressions; in particular, you cannot put a * after a parenthesized pattern.
So my guess is that this problem cannot be solved using Lua patterns, but since Lua patterns are not a standard thing in automata theory, I'm not aware of any body of proof technique that you could use to prove it.
The issue with escaped quotes is that, in general, if there's an odd number of backslashes before the quote, then it's escaped, and if there's an even number, it's not. I do not believe that Lua pattern-matching is powerful enough to represent this condition, so if you need to parse text like this, then you should seek another way. Perhaps you can iterate through the string and parse it, or you could find each quote in turn and read backwards, counting the backslashes until you find a non-backslash character (or the beginning of the string).
If you absolutely must use patterns for some reason, you could try doing this in a multi-step process. First, gsub for all occurrences of two backslashes in a row, and replace them with some sentinel value. This must be a value that does not already occur in the string. You could try something like "\001" if you know this string doesn't contain non-printable characters. Anyway, once you've replaced all sequences of two backslashes in a row, any backslashes left are escaping the following character. Now you can apply your original pattern, and then finally you can replace all instances of your sentinel value with two backslashes again.
Lua's pattern language is adequate for many simple cases. And it has at least one trick you don't find in a typical regular expression package: a way to match balanced parenthesis. But it has its limits as well.
When those limits are exceeded, then I reach for LPeg. LPeg is an implementation of a Parsing Expression Grammer for Lua, and was implemented by one of Lua's original authors so the adaptation to Lua is done quite well. A PEG allows specification of anything from simple patterns through complete language grammars to be written. LPeg compiles the grammar to a bytecode and executes it extremely efficiently.
you should NOT be trying to parse HTML with regular expressions, HTML and XML are NOT regular languages and can not be successfully manipulated with regular expressions. You should use a dedicated HTML parser. Here are lots of explanations why.

better alternative in letters substitution

Is there any better alternative to this?
name.gsub('è','e').gsub('à','a').gsub('ò','o').gsub('ì','i').gsub('ù','u')
thanks
Use tr.
Maybe like string.tr('èàòìù', 'eaoiu').
substitutes = {'è'=>'e', 'à'=>'a', 'ò'=>'o', 'ì'=>'i', 'ù'=>'u'}
substitutes.each do |old, new|
name.gsub!(old, new)
end
Or you could use an extension of String such as this one to do it for you.
If you really want a full solution, try pulling the tables from Perl's Unidecode module. After translating those tables to Ruby, you'll want to loop over each character of the input, substituting the table's value for that character.
Taking a wild stab in the dark, but if you're trying to remove the accented characters because you're using a legacy text encoding format you should look at Iconv.
An introduction which is great on the subject: http://blog.grayproductions.net/articles/encoding_conversion_with_iconv
In case you are wondering the technical terms for what you want to do is Case Folding and possibly Unicode Normalization (and sometimes collation).
Here is a case folding configuration for ThinkingSphinx to give you an idea of how many characters you need to worry about.
If JRuby is an option, see the answer to my question:
How do I detect unicode characters in a Java string?
It deals with removing accents from letters, using a Normalizer. You could access that class from JRuby.

Resources