Weird Characters encoding

Weird Characters encoding - ruby-on-rails

I have a weird behaviour in my params whichare passed as utf-8 but the special characters are not well managed.
Instead of 1 special character, I have 2 characters: the normal letter + the accent.
Parameters: {"name"=>"Mylène.png", "_cardbiz_session"=>"be1d5b7a2f27c7c4979ac4c16fe8fc82", "authenticity_token"=>"9vmJ02DjgKYCpoBNUcWwUlpxDXA8ddcoALHXyT6wrnM=", "asset"=>{"file"=># < ActionDispatch::Http::UploadedFile:0x007f94d38d37d0 #original_filename="Mylène.png", #content_type="image/png", #headers="Content-Disposition: form-data; name=\"asset[file]\"; filename=\"Myle\xCC\x80ne.png\"\r\nContent-Type: image/png\r\n", #tempfile=# < File:/var/folders/q5/yvy_v9bn5wl_s5ccy_35qsmw0000gn/T/RackMultipart20130805-51100-1eh07dp > >}, "id"=>"copie-de-sm"}
I log this:
logger.debug file_name
logger.debug file_name.chars.map(&:to_s).inspect
Each time, same result:
Mylène
["M", "y", "l", "e", "̀", "n", "e"]
As i try to use the filename as a matcher with already existing names properly encoded utf-8, you see my problem ;)
Encodings are utf-8 everywhere.
working under ruby 1.9.3 and rails 3.2.14.
Added #encoding: utf-8 in top of any file involved.
I anyone as an idea, take it !
I also published an Issue here : https://github.com/carrierwaveuploader/carrierwave/issues/1185 but not sure if its a carrierwave issue or me missing something...

Seems to be linked to MACOSX.
https://www.ruby-forum.com/topic/4407424 explains it and refers to https://bugs.ruby-lang.org/issues/7267 for more details and discution.
MACOSX decomposing special characters into utf8-mac instead of utf-8...
While you can't know the encoding of a file name, just presupose it.
Thanks to our Linux guy where it works properly. ;)
file_name.encode!('utf-8', 'utf-8-mac').chars.map(&:to_s)

Perhaps you have a Combining character and a problem with Unicode equivalence
When I check the codepoints with:
#encoding: utf-8
Parameters = {"name"=>"Mylène.png",}
p Parameters['name'].codepoints.to_a
I get Myl\u00E8ne.png, but I think that's a conversion problem when I copy the text. It would be helpfull, if you can provide a file with the raw data.
I expect you have a combining grave accent and a e
The solution would be a Unicode normalization. (Sorry, I don't know how to do it with ruby. Perhaps somebody else has an answer for it).
You found your problem, so this is not needed any longer for you.
But in meantime I found a mechanism to normalize Unicode strings:
#encoding: utf-8
text = "Myl\u00E8ne.png" #"Mylène.png"
text2 = "Myle\u0300ne.png" #"Mylène.png"
puts text #Mylène.png
puts text2 #Mylène.png
p text == text2 #false
#http://apidock.com/rails/ActiveSupport/Multibyte/Unicode/normalize
require 'active_support'
p text #"Myl\u00E8ne.png"
p ActiveSupport::Multibyte::Unicode.normalize(text, :d) #"Myle\u0300ne.png"
p text2 #"Myle\u0300ne.png"
p ActiveSupport::Multibyte::Unicode.normalize(text2, :c)#"Myl\u00E8ne.png"
Maybe there is an easier way, but up to now I found none.

Related

How to convert a formatted string into plain text

User copy paste and send data in following format: "𝕛𝕠𝕧𝕪 𝕕𝕖𝕓𝕓𝕚𝕖"
I need to convert it into plain txt (we can say ascii chars) like 'jovy debbie'
It comes in different font and format:
ex:
'𝑱𝒆𝒏𝒊𝒄𝒂 𝑫𝒖𝒈𝒐𝒔'
'𝙶𝚎𝚟𝚒𝚎𝚕𝚢𝚗 𝙽𝚒𝚌𝚘𝚕𝚎 𝙻𝚞𝚖𝚋𝚊𝚐'
Any Help will be Appreciated, I already refer other stack overflow question but no luck :(

Those letters are from the Mathematical Alphanumeric Symbols block.
Since they have a fixed offset to their ASCII counterparts, you could use tr to map them, e.g.:
"𝕛𝕠𝕧𝕪 𝕕𝕖𝕓𝕓𝕚𝕖".tr("𝕒-𝕫", "a-z")
#=> "jovy debbie"
The same approach can be used for the other styles, e.g.
"𝑱𝒆𝒏𝒊𝒄𝒂 𝑫𝒖𝒈𝒐𝒔".tr("𝒂-𝒛𝑨-𝒁", "a-zA-Z")
#=> "Jenica Dugos"
This gives you full control over the character mapping.
Alternatively, you could try Unicode normalization. The NFKC / NFKD forms should remove most formatting and seem to work for your examples:
"𝕛𝕠𝕧𝕪 𝕕𝕖𝕓𝕓𝕚𝕖".unicode_normalize(:nfkc)
#=> "jovy debbie"
"𝑱𝒆𝒏𝒊𝒄𝒂 𝑫𝒖𝒈𝒐𝒔".unicode_normalize(:nfkc)
#=> "Jenica Dugos"

cyrillic strings Я̆ Я̄ Я̈ return length 2 instead of 1 in ruby and other programming languages

In Ruby, Javascript and Java (others I didn't try), have cyrillic chars Я̆ Я̄ Я̈ length 2. When I try to check length of string with these chars indside, I get bad output value.
"Я̈".mb_chars.length
#=> 2 #should be 1 (ruby on rails)
"Я̆".length
#=> 2 #should be 1 (ruby, javascript)
"Ӭ".length
#=> 1 #correct (ruby, javascript)
Please note, that strings are encoded in UTF-8 and each char behave as single character.
My question is why is there such behaviour and how can I get length of string correctly with these chars inside?

The underlying problem is that Я̈ is actually two code points: the Я and the umlaut are separate:
'Я̈'.chars
#=> ["Я", "̈"]
Normally you'd solve this sort of problem through unicode normalization but that alone won't help you here as there is no single code point for Я̈ or Я̆ (but there is for Ӭ).
You could strip off the diacritics before checking the length:
'Я̆'.gsub(/\p{Diacritic}/, '')
#=> "Я"
'Я̆'.gsub(/\p{Diacritic}/, '').length
#=> 1
You'll get the desired length but the strings won't be quite the same. This also works on things like Ӭ which can be represented by a single code point:
'Ӭ'.length
#=> 1
'Ӭ'.gsub(/\p{Diacritic}/, '')
#=> "Ӭ"
'Ӭ'.gsub(/\p{Diacritic}/, '').length
#=> 1
Unicode is wonderful and awesome and solves many problems that used to plague us. Unfortunately, Unicode is also horrible and complicated because human languages and glyphs weren't exactly designed.

Ruby 2.5 adds String#each_grapheme_cluster:
'Я̆Я̄Я̈'.each_grapheme_cluster.to_a #=> ["Я̆", "Я̄", "Я̈"]
'Я̆Я̄Я̈'.each_grapheme_cluster.count #=> 3
Note that you can't use each_grapheme_cluster.size which is equivalent to each_char.size, so both would return 6 in the above example. (That looks like a bug, I've just filed a bug report)

Try unicode-display_width which is built to give an exact answer to this question:
require "unicode/display_width"
Unicode::DisplayWidth.of "Я̈" #=> 1

Are Iconv.convert return values in wrong order?

I have a phoenix/elixir app and need to only have ASCII characters in my String.
From what I tried and found here, this can only be done properly by Iconv.
:iconv.convert "utf-8", "ascii//translit", "árboles más grandes"
# arboles mas grandes
but when I run it on my mac it says:
# 'arboles m'as grandes
It seems it returns multiple letters for any character that had more than one byte in size and the order is turned around.
for example:
ä will turn to \"a
á will turn to 'a
ß will turn to ss
ñ will turn to ~n
I'm running it with IEx 1.2.5 on Mac.
Is there any way around this, or generally a better way to achieve the same functionality as rails transliterate?
EDIT:
So here is the update rails-like behaviour according to the accepted answer on Henkik N. It does the same thing as rails parameterize( turn whatever string into sth. that you can use as a part of a url)
defmodule RailsLikeHelpers do
require Inflex
# replace accented chars with their ascii equivalents
def transliterate_string(abc) do
return :iconv.convert("utf-8", "ascii//translit", String.normalize(abc))
end
def parameterize_string(abc) do
parameterize_string(abc, "_")
end
def parameterize_string(abc,seperator) do
abc
|> String.strip
|> transliterate_string
|> Inflex.parameterize(seperator) # turns "Your Momma" into "your_momma"
|> String.replace(~r[#{Regex.escape(seperator)}{2,}],seperator) # No more than one of the separator in a row.
end
end

Running it through Unicode decomposition (as people kind of mentioned in the forum thread you linked to) seems to do it on my OS X:
iex> :iconv.convert "utf-8", "ascii//translit", String.normalize("árboles más grandes", :nfd)
"arboles mas grandes"
Decomposition means it will be normalized so that e.g. "á" is represented as two Unicode codepoints ("a" and a combining accent) as opposed to a composed form where it's a single Unicode codepoint. So I guess iconv's ASCII transliteration removes standalone accents/diacritics, but converts composed characters to things like 'a.

Character encoding conversion

I have a string which contains Swedish characters and want to convert it to basic English.
name = "LänödmåtnÖng ÅjädårbÄn"
These characters should be converted as follows:
Å use A
å use a
Ä use A
ä use a
Ö use O
ö use o
Is there a simple way to do it? If I try:
ascii_to_string = name.unpack("U*").map{|s|s.chr}.join
It returns L\xE4n\xF6dm\xE5tn\xD6ng \xC5j\xE4d\xE5rb\xC4n as ASCII, but I want to convert it to English.

Using OP's conversion table as input for the tr method:
#encoding: utf-8
name = "LänödmåtnÖng ÅjädårbÄn"
p name.tr("ÅåÄäÖö", "AaAaOo") #=> "LanodmatnOng AjadarbAn"

Try this:
string.mb_chars.normalize(:kd).gsub(/[^\x00-\x7F]/n,'').downcase.to_s
As found in this post.

You already got decent answer, however there is a way that is easier to remember (no magical regular expressions):
name.parameterize
It changes whitespaces to dashes, so you need to handle it somehow, for example by processing each word separately:
name.split.map { |s| s.parameterize }.join ' '

Parsing \"–\" with Erlang re

I've parsed an HTML page with mochiweb_html and want to parse the following text fragment
0 – 1
Basically I want to split the string on the spaces and dash character and extract the numbers in the first characters.
Now the string above is represented as the following Erlang list
[48,32,226,128,147,32,49]
I'm trying to split it using the following regex:
{ok, P}=re:compile("\\xD2\\x80\\x93"), %% characters 226, 128, 147
re:split([48,32,226,128,147,32,49], P, [{return, list}])
But this doesn't work; it seems the \xD2 character is the problem [if I remove it from the regex, the split occurs]
Could someone possibly explain
what I'm doing wrong here ?
why the '–' character seemingly requires three integers for representation [226, 128, 147]
Thanks.

226,128,147 is E2,80,93 in hex.
> {ok, P} = re:compile("\xE2\x80\x93").
...
> re:split([48,32,226,128,147,32,49], P, [{return, list}]).
["0 "," 1"]

As to your second question, about why a dash takes 3 bytes to encode, it's because the dash in your input isn't an ASCII hyphen (hex 2D), but is a Unicode en-dash (hex 2013). Your code is recieving this in UTF-8 encoding, rather than the more obvious UCS-2 encoding. Hex 2013 comes out to hex E28093 in UTF-8 encoding.
If your next question is "why UTF-8", it's because it's far easier to retrofit an old system using 8-bit characters and null-terminated C style strings to use Unicode via UTF-8 than to widen everything to UCS-2 or UCS-4. UTF-8 remains compatible with ASCII and C strings, so the conversion can be done piecemeal over the course of years, or decades if need be. Wide characters require a "Big Bang" one-time conversion effort, where everything has to move to the new system at once. UTF-8 is therefore far more popular on systems with legacies dating back to before the early 90s, when Unicode was created.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Weird Characters encoding - ruby-on-rails

Related

How to convert a formatted string into plain text

cyrillic strings Я̆ Я̄ Я̈ return length 2 instead of 1 in ruby and other programming languages

Are Iconv.convert return values in wrong order?

Character encoding conversion

Parsing \"–\" with Erlang re

Categories

Resources