Convert string to two byte hex unicode in Ruby on Rails - ruby-on-rails

I'm translating messages using Bing Translator and sending the results via SMS text message. The resulting strings often contain non-English characters, e.g. Korean, Japanese, Greek.
I am using the Clickatell SMS Gateway and according to the Clickatell spec here: http://www.clickatell.com/downloads/http/Clickatell_HTTP.pdf
...I think I should encode my strings in two-byte Unicode, hex-encoded.
For example, the Greek characters:
ΩΨΘ
After conversion should become:
03A903A80398
Which is then added to the querystring in my HTTP get request.
My issue however is finding the syntax to to this succinctly in my Ruby on Rails app.

I like fun ones like these. :) Try this:
input.codepoints.map { |c| "%04X" % c }.join
What I see:
[1] pry(main)> x = "\u03A9\u03A8\u0398"
=> "ΩΨΘ"
[2] pry(main)> x.codepoints.map { |c| "%04X" % c }.join
=> "03A903A80398"

Don't need to split -- use each_char instead to iterate.
Call map directly on each_char (or collect) and use capital "X" instead of lower "x" instead of upcase
input.each_char.map{|c| "%04X" % c.ord}.join

Related

How to remove from string before __

I am building a Rails 5.2 app.
In this app I got outputs from different suppliers (I am building a webshop).
The name of the shipping provider is in this format:
dhl_freight__233433
It could also be in this format:
postal__US-320202
How can I remove all that is before (and including) the __ so all that remains are the things after the ___ like for example 233433.
Perhaps some sort of RegEx.
A very simple approach would be to use String#split and then pick the second part that is the last part in this example:
"dhl_freight__233433".split('__').last
#=> "233433"
"postal__US-320202".split('__').last
#=> "US-320202"
You can use a very simple Regexp and a ask the resulting MatchData for the post_match part:
p "dhl_freight__233433".match(/__/).post_match
# another (magic) way to acces the post_match part:
p $'
Postscript: Learnt something from this question myself: you don't even have to use a RegExp for this to work. Just "asddfg__qwer".match("__").post_match does the trick (it does the conversion to regexp for you)
r = /[^_]+\z/
"dhl_freight__233433"[r] #=> "233433"
"postal__US-320202"[r] #=> "US-320202"
The regular expression matches one or more characters other than an underscore, followed by the end of the string (\z). The ^ at the beginning of the character class reads, "other than any of the characters that follow".
See String#[].
This assumes that the last underscore is preceded by an underscore. If the last underscore is not preceded by an underscore, in which case there should be no match, add a positive lookbehind:
r = /(?<=__[^_]+\z/
This requires the match to be preceded by two underscores.
There are many ruby ways to extract numbers from string. I hope you're trying to fetch numbers out of a string. Here are some of the ways to do so.
Ref- http://www.ruby-forum.com/topic/125709
line.delete("^0-9")
line.scan(/\d/).join('')
line.tr("^0-9", '')
In the above delete is the fastest to trim numbers out of strings.
All of above extracts numbers from string and joins them. If a string is like this "String-with-67829___numbers-09764" outut would be like this "6782909764"
In case if you want the numbers split like this ["67829", "09764"]
line.split(/[^\d]/).reject { |c| c.empty? }
Hope these answers help you! Happy coding :-)

Are Iconv.convert return values in wrong order?

I have a phoenix/elixir app and need to only have ASCII characters in my String.
From what I tried and found here, this can only be done properly by Iconv.
:iconv.convert "utf-8", "ascii//translit", "árboles más grandes"
# arboles mas grandes
but when I run it on my mac it says:
# 'arboles m'as grandes
It seems it returns multiple letters for any character that had more than one byte in size and the order is turned around.
for example:
ä will turn to \"a
á will turn to 'a
ß will turn to ss
ñ will turn to ~n
I'm running it with IEx 1.2.5 on Mac.
Is there any way around this, or generally a better way to achieve the same functionality as rails transliterate?
EDIT:
So here is the update rails-like behaviour according to the accepted answer on Henkik N. It does the same thing as rails parameterize( turn whatever string into sth. that you can use as a part of a url)
defmodule RailsLikeHelpers do
require Inflex
# replace accented chars with their ascii equivalents
def transliterate_string(abc) do
return :iconv.convert("utf-8", "ascii//translit", String.normalize(abc))
end
def parameterize_string(abc) do
parameterize_string(abc, "_")
end
def parameterize_string(abc,seperator) do
abc
|> String.strip
|> transliterate_string
|> Inflex.parameterize(seperator) # turns "Your Momma" into "your_momma"
|> String.replace(~r[#{Regex.escape(seperator)}{2,}],seperator) # No more than one of the separator in a row.
end
end
Running it through Unicode decomposition (as people kind of mentioned in the forum thread you linked to) seems to do it on my OS X:
iex> :iconv.convert "utf-8", "ascii//translit", String.normalize("árboles más grandes", :nfd)
"arboles mas grandes"
Decomposition means it will be normalized so that e.g. "á" is represented as two Unicode codepoints ("a" and a combining accent) as opposed to a composed form where it's a single Unicode codepoint. So I guess iconv's ASCII transliteration removes standalone accents/diacritics, but converts composed characters to things like 'a.

Splitting strings using Ruby ignoring certain characters

I'm trying to split a string and counts the number os words using Ruby but I want ignore special characters.
For example, in this string "Hello, my name is Hugo ..." I'm splitting it by spaces but the last ... should't counts because it isn't a word.
I'm using string.inner_text.split(' ').length. How can I specify that special characters (such as ... ? ! etc.) when separated from the text by spaces are not counted?
Thank you to everyone,
Kind Regards,
Hugo
"Hello, my name is não ...".scan /[^*!#%\^\s\.]+/
# => ["Hello,", "my", "name", "is", "não"]
/[^*!#%\^]+/ will match anything other than *!#%\^. You can add more to this list which need not be matched
this is part answer, part response to #Neo's answer: why not use proper tools for the job?
http://www.ruby-doc.org/core-1.9.3/Regexp.html says:
POSIX bracket expressions are also similar to character classes. They provide a portable alternative to the above, with the added benefit that they encompass non-ASCII characters. For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas /[[:digit:]]/ matches any character in the Unicode Nd category.
/[[:alnum:]]/ - Alphabetic and numeric character
/[[:alpha:]]/ - Alphabetic character
...
Ruby also supports the following non-POSIX character classes:
/[[:word:]]/ - A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation
you want words, use str.scan /[[:word:]]+/

Rails 3 encode non ascii?

I'm wondering if there's a helper in rails 3 or a simple way of converting all non ascii characters to their html entities. Such as: à to à. ® to ®
The purpose of this is to replace any such characters before exporting to a CSV format. Since viewing the characters in Excel doesn't turn out too well. Worst case scenario I'll just use gsub for each instance, but I'd rather avoid that if possible.
If you can't find anything for Rails then you could check out HTMLEntities:
http://htmlentities.rubyforge.org/
require 'htmlentities'
coder = HTMLEntities.new
string = "<élan>"
coder.encode(string, :named) # => "<élan>"

Length of a unicode string

In my Rails (2.3, Ruby 1.8.7) application, I need to truncate a string to a certain length. the string is unicode, and when running tests in console, such as 'א'.length, I realized that a double length is returned. I would like an encoding-agnostic length, so that the same truncation would be done for a unicode string or a latin1 encoded string.
I've gone over most of the unicode material for Ruby, but am still a little in the dark. How should this problem be tackled?
Rails has an mb_chars method which returns multibyte characters. Try unicode_string.mb_chars.slice(0,50)
"ア".size # 3 in 1.8, 1 in 1.9
puts "ア".scan(/./mu).size # 1 in both 1.8 and 1.9
chars and mb_chars don't give you text elements, which is what you seem to be looking for.
For text elements you'll want the unicode gem.
mb_chars:
>> 'กุ'.mb_chars.size
=> 2
>> 'กุ'.mb_chars.first.to_s
=> "ก"
text_elements:
>> Unicode.text_elements('กุ').size
=> 1
>> Unicode.text_elements('กุ').first
=> "กุ"
You can use something like str.chars.slice(0, 50).join to get the first 50 characters of a string, no matter how many bytes it uses per character.

Resources