Is there any better alternative to this?
name.gsub('è','e').gsub('à','a').gsub('ò','o').gsub('ì','i').gsub('ù','u')
thanks
Use tr.
Maybe like string.tr('èàòìù', 'eaoiu').
substitutes = {'è'=>'e', 'à'=>'a', 'ò'=>'o', 'ì'=>'i', 'ù'=>'u'}
substitutes.each do |old, new|
name.gsub!(old, new)
end
Or you could use an extension of String such as this one to do it for you.
If you really want a full solution, try pulling the tables from Perl's Unidecode module. After translating those tables to Ruby, you'll want to loop over each character of the input, substituting the table's value for that character.
Taking a wild stab in the dark, but if you're trying to remove the accented characters because you're using a legacy text encoding format you should look at Iconv.
An introduction which is great on the subject: http://blog.grayproductions.net/articles/encoding_conversion_with_iconv
In case you are wondering the technical terms for what you want to do is Case Folding and possibly Unicode Normalization (and sometimes collation).
Here is a case folding configuration for ThinkingSphinx to give you an idea of how many characters you need to worry about.
If JRuby is an option, see the answer to my question:
How do I detect unicode characters in a Java string?
It deals with removing accents from letters, using a Normalizer. You could access that class from JRuby.
Related
I used parameterize method. I want to de-parameterize it. Is there a method to do the opposite of parameterize?
No, there is not. parameterize is a lossy conversion, you can't convert it back.
Here's an example. When you convert
My Awesome Pizza
into
my-awesome-pizza
you have no idea if the original string was
My Awesome Pizza
MY AWESOME PIZZA
etc. This is a simple example. However, as you can see from the source code, certain characters are stripped or converted into a separator (e.g. commas) and you will not be able to recover them.
If you just want an approximate conversion, then simply convert the dashes into spaces, trim multiple spaces and apply an appropriate case conversion.
In Rails there is titleize (source):
"this-is-my-parameterized-string".titleize
=> "This Is My Parameterized String"
"hello-world foo bar".titleize
=> "Hello World Foo Bar"
As mentioned above, this isn't going to revert the string to its pre-parameterized form, but if that's not a concern, this might help!
I'm with Simone on this one but you can always go with
def deparametrize(str)
str.split("-").join(" ").humanize
end
:)
I am in need of matching Unicode letters, similarly to PCRE's \p{L}.
Now, since Dart's RegExp class is based on ECMAScript's, it doesn't have the concept of \p{L}, sadly.
I'm looking into perhaps constructing a big character class that matches all Unicode letters, but I'm not sure where to start.
So, I want to match letters like:
foobar
מכון ראות
But the R symbol shouldn't be matched:
BlackBerry®
Neither should any ASCII control characters or punctuation marks, etc. Essentially every letter in every language Unicode supports, whether it's å, ä, φ or ת, they should match if they are actual letters.
I know this is an old question. But RegExp now supports unicode categories (since Dart 2.4) so you can do something like this:
RegExp alpha = RegExp(r'\p{Letter}', unicode: true);
print(alpha.hasMatch("f")); // true
print(alpha.hasMatch("ת")); // true
print(alpha.hasMatch("®")); // false
I don't think that complete information about classification of Unicode characters as letters or non-letters is anywhere in the Dart libraries. You might be able to put something together that would mostly work using things in the Intl library, particularly Bidi. I'm thinking that, for example,
isLetter(oneCharacterString) => Bidi.endsWithLtr(oneLetterString) || Bidi.endsWithRTL(oneLetterString);
might do a plausible job. At least it seems to have a number of ranges for valid characters in there. Or you could put together your own RegExp based on the information in _LTR_CHARS and _RTL_CHARS. It explicitly says it's not 100% accurate, but good for most practical purposes.
Looks like you're going to have to iterate through the runes in the string and then check the integer value against a table of unicode ranges.
Golang has some code to generate these tables directly from the unicode source. See maketables.go, and some of the other files in the golang unicode package.
Or take the lazy option, and file a Dart bug, and wait for the Dart team to implement it ;)
There's no support for this yet in Dart or JS.
The Xregexp JS library has support for generating fairly large character class regexps to support something like this. You may be able to generate the regexp, print it and cut and paste it into your app.
Is there a library where I can simple call a method on a string to find out if it is non-English? I'm trying to only save English strings and the incoming stream of strings has plenty of non-English in them.
You can try to use linguo.
"your string".lang
# will return "en" for english strings
Disclaimer: I'm the creator of this gem.
You can use GoogleTranslate API with the RailsBridge for it - http://code.google.com/apis/gdata/articles/gdata_on_rails.html
Not that I'm aware... but you could get this list into an array (http://www.langmaker.com/wordlist/basiclex.htm) and then match the string's words against it... Decide on some percentage as good, and go from there.
You could even use bayesian algorithm here to mark those words as "good" and learn from there, but that might be overkill.
I was looking for some good options for fuzzy comparison in Rails.
Essentially, I have a set of strings that I'd like to compare against some strings in my database and I'd like to get the closest one if applicable. In this particular case, I'm not so interested in detecting letters out of order/mis-spellings, but rather the ability to ignore extraneous words (extra information, punctuation, words like: the, and, it etc) and pick out the best match. These strings will usually be somewhere between 2-7 words long.
What would you suggest is the best gem/method of doing that? I've looked at amatch (http://flori.github.com/amatch/doc/index.html) but I was wondering what else was out there.
Thanks!
Have a look and a play with Thinking Sphinx http://freelancing-god.github.com/ts/en/
I can heartily recommend it
There is also a superb Railscast on how to use it here
http://railscasts.com/episodes/120-thinking-sphinx
Otherwise use ARel - but you are going to have to implement your own fuzzy logic (Not something I'd recommend)
Have a look on this FuzzyMatch gem
It may help you.
My regular extression (regex) is still work in progress, and I'm having the following issue with trying to extract some anchor text from a hash of where the element is stored.
My hash looks like:
hash["example"]
=> " Project, Area 1"
My ruby of which is trying to do the extraction of "Project" and "Area 1":
hash["ITA Area"].scan(/<a href=\"(.*)\">(.*)<\/a>/)
Any help would be much appreciated as always.
Your groups are using greedy matching, so it's going to grab as much as it can before, say, a < for the second group. Change the (.*) parts to (.*?) to use possessive matching.
There are loads of posts here on why you should not be using regex to parse html. There are many reasons why... such as, what if there is more than one space between the a and href, etc. It would be ideal to use a tool designed for parsing html.
You will have to exape the backslashes for the backslashes. so something like... \\\\ instead of just \\. It sounds stupid, but I had a similar problem with it.
I'm not entirely sure what your issue is, but the regexp should match. Double quotes " need not be escaped. As mentioned in Dan Breen's answer, you need to use non-greedy matchers if the string is expected to contain more than one possible match.
The canonical SO reason to use a real HTML parser is calmly explained right here.
However, regexen can parse simple snippets without too much trouble.
Update: Aha, the anchor text. That's actually pretty easy:
> s.scan /([^<>]*)<\/a>/
=> [["Project"], ["Area 1"]]