Determine if a string is English - ruby-on-rails

Is there a library where I can simple call a method on a string to find out if it is non-English? I'm trying to only save English strings and the incoming stream of strings has plenty of non-English in them.

You can try to use linguo.
"your string".lang
# will return "en" for english strings
Disclaimer: I'm the creator of this gem.

You can use GoogleTranslate API with the RailsBridge for it - http://code.google.com/apis/gdata/articles/gdata_on_rails.html

Not that I'm aware... but you could get this list into an array (http://www.langmaker.com/wordlist/basiclex.htm) and then match the string's words against it... Decide on some percentage as good, and go from there.
You could even use bayesian algorithm here to mark those words as "good" and learn from there, but that might be overkill.

Related

Elixir/Erlang - Split paragraph into sentences based on the language

In Java there is a class called BreakItterator which allows me to pass a paragraph of text in any language (the language it is written in is known) and it will split the text into separate sentences. The magic is that it can take as an argument the locale of the langue the text is written in and it will split the text according to that languages rules (if you look into it it is actually a very complex issue even in English - it is certainly not a case of 'split by full-stops/periods').
Does anybody know how I would do this in elixir? I can't find anything in a Google search.
I am almost at the point of deploying a very thin public API that does only this basic task that I can call into from elixir - but this is really not desirable.
Any help would be really appreciated.
i18n library should be usable for this. Just going from the examples provided, since I have no experience using it, something like the following should work (:en is the locale code):
str = :i18n_string.from("some string")
iter = :i18n_iterator.open(:en, :sentence)
sentences = :i18n_string.split(iter, str)
There's also Cldr, which implements a lot of locale-dependent Unicode algorithms directly in Elixir, but it doesn't seem to include iteration in particular at the moment (you may want to raise an issue there).

What should I name my variables in a translation project?

I'm abstracting strings from views and I don't want to name my variables after the strings... what should I name them? I also don't want to number them in case I insert a new string into a view at some point.
I want short names that are easy to reference, not hard to put into my brain's short term memory, and not confusing to my translators.
The current version is in English, the future versions will be in Chinese, Spanish, Vietnamese, and Tagalog, in addition to English.
long descriptive names that are similar to the original string.
Based on this Mozilla article
https://developer.mozilla.org/en-US/docs/Mozilla/Localization/Localization_content_best_practices
Thanks to #davejagoda for the advice.

Apple's pinyin ranking algorithm

I'm currently developing an English to Chinese dictionary app to learn iOS development and I'm kind of stuck as to ranking the more commonly used characters in Chinese when the user searches it in pinyin.
My question is:
Is there some way that I can use Apple's ranking algorithm for how they rank the Chinese character that come up when pinyin is typed (as they do a pretty good job at producing the right Chinese character)? Or is there some other way whereby I can achieve this?
If you want convert Chinese character to pinyin, you may use:
CFString​Transform or PinYin4Objc.
If you want first letter of pinyin, you can use pinyinFirstLetter .
If you just sort in Pinyin alphabetical order,you can use
sortedArray = [array sortedArrayUsingSelector:#selector(localizedCaseInsensitiveCompare:)];
Note: Polyphone, place name may not get right.
Edit:
It seems something like auto complete:
How to create an efficient auto-complete?
Implementing Autocomplete in iOS
Hope it can help you.

Is there a faster way to parse hashtags than using Regular Expressions?

I am curious, is there a faster/better way to parse hashtags in a string, other than using Regular Expressions (mainly in Ruby)?
Edit
For example I want to parse the string This is a #hashtag, and this is #another one! and get the words #hashtag and #another. I am using #\S+ for my regex.
You don't show any code (which you should have) so we're guessing how you are using your regex.
#\S+ is as good of a pattern as you'll need, but scan is probably the best way to retrieve all occurrences in the string.
'This is a #hashtag, and this is #another one!'.scan(/#\S+/)
=> ["#hashtag,", "#another"]
Its should be /\B#\w+/, if you don't want to parse commas
Yes, I agree. /\B#\w+/ makes more sense.
Maybe
Hmm, ideas....
You could try s.split('#'), and then perhaps apply a regex only to actual hashtags
s.split('#').drop(1).map { |x| x[/\w+/] } --- it may or may not be any faster but it clearly is uglier
You could write a C extension that extracts hashtags
You could profile your program and see if it really needs any optimization for this case.

better alternative in letters substitution

Is there any better alternative to this?
name.gsub('è','e').gsub('à','a').gsub('ò','o').gsub('ì','i').gsub('ù','u')
thanks
Use tr.
Maybe like string.tr('èàòìù', 'eaoiu').
substitutes = {'è'=>'e', 'à'=>'a', 'ò'=>'o', 'ì'=>'i', 'ù'=>'u'}
substitutes.each do |old, new|
name.gsub!(old, new)
end
Or you could use an extension of String such as this one to do it for you.
If you really want a full solution, try pulling the tables from Perl's Unidecode module. After translating those tables to Ruby, you'll want to loop over each character of the input, substituting the table's value for that character.
Taking a wild stab in the dark, but if you're trying to remove the accented characters because you're using a legacy text encoding format you should look at Iconv.
An introduction which is great on the subject: http://blog.grayproductions.net/articles/encoding_conversion_with_iconv
In case you are wondering the technical terms for what you want to do is Case Folding and possibly Unicode Normalization (and sometimes collation).
Here is a case folding configuration for ThinkingSphinx to give you an idea of how many characters you need to worry about.
If JRuby is an option, see the answer to my question:
How do I detect unicode characters in a Java string?
It deals with removing accents from letters, using a Normalizer. You could access that class from JRuby.

Resources