Ruby Parsing: Error when I trying to put Cyrillic symbols into URL request - ruby-on-rails

I wrote a Bot for Telegram, where users can receive images for their requests. But there was one problem, which I could not solve.
Some example with parsing on Ruby:
json_object = JSON.parse(open("https://api.site.com/search/photos?query=" + message.text + "&per_page=10&client_id=42324d2lkedi234fs342dfse2c038fdfsdfs").read)
message.text - It's a field with request from users.
Everything works fine with latin literals, but when I send Cyrillic(API also supports Cyrillic alphabet) symbols I get the below error:
/Users/me/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/uri/rfc3986_parser.rb:21:in
`split': URI must be ascii only
"https://api.site.com/search/photos?query=\u0432\u0430\u0432\u0430&per_page=10&client_id=42324d2lkedi234fs342dfse2c038fdfsdfs"
(URI::InvalidURIError)
I used Encoding with utf-8 and win-1252, but nothing helped. How should this be fixed?

You should encode your cyrillic string:
URI.encode('http://google.com?1=АБВ') # => "%D0%90%D0%91%D0%92"
So, use it like this (or encode whole url):
URI.encode(message.text)

Try with
"anything".parameterize.underscore.humanize.downcase

Related

Problem with attachments' character encoding using gmail gem in ruby/rails

What I am doing:
I am using the gmail gem in a Rails 4 app to get email attachments from a specific account at regular intervals. Here is an extract from the core part (here for simplicity only considering the first email and its first attachment):
require 'gmail'
Gmail.connect(#user_email,#user_password) do |gmail|
if gmail.logged_in?
emails = gmail.inbox.emails(:from => #sender_email)
email = emails[0]
attachment = email.message.attachments[0]
File.open("~/temp.csv", 'w') do |file|
file.write(
StringIO.new(attachment.decoded.to_s[2..-2].force_encoding("ISO-8859-15").encode!('UTF-8')).read
)
end
end
end
The encoding of the attached file can vary. The particular one that I am currently having issues with is in Finnish. It contains Finnish characters and a superscripted 3 character.
This is what I expect to get when I run the above code. (This is what I get when I download the attachment manually through gmail user interface):
What the problem is:
However, I am getting the following odd results.
From cat temp.csv (Looks good to me):
With nano temp.csv (Here I have no idea what I am looking at):
This is what temp.csv looks like opened in Sublime Text (directly via winscp). First line and small parts look ok but then Chinese/Japanese characters:
This is what temp.csv looks like in Notepad (after download via winscp). Looks ok except a blank space has been inserted between each character and the new lines seems to be missing:
What I have tried:
I have without success tried:
.force_encoding(...) with all the different "ISO-8859-x" character sets
putting the force_encoding("ISO-8859-15").encode!('UTF-8') outside the .read (works but doesn't solve the problem)
encode to UTF-8 without first forcing another encoding but this leads to Encoding::UndefinedConversionError: "\xC4" from ASCII-8BIT to UTF-8
writing as binary with 'wb' and 'w+b' in the File.open() (which oddly doesn't seem to make a difference to the outcome).
searching stackoverflow and the web for other ideas.
Any ideas would be much appreciated!
Not beautiful, but it will work for me now.
After re-encoding, I convert the string to a char array, then remove the chars I do not want and then join the remaining array elements to form a string.
decoded_att = attachment.decoded
data = decoded_att.encode("UTF-8", "ISO-8859-1", invalid: :replace, undef: :replace).gsub("\r\n", "\n")
data_as_array = data.chars
data_as_array = data_as_array.delete_if {|i| i == "\u0000" || i == "ÿ" || i == "þ"}
data = data_as_array.join('').to_s
File.write("~/temp.csv", data.to_s)
This will work for me now. However, I have no idea how these characters have ended up in the attachment ("ÿ" and "þ" in the start of the document and "\u0000" between all remaining characters).
It seems like you need to do attachment.body.decoded instead of attachment.decoded

Ruby How to convert back binary string from smsc

my app work with SMSC, and i need to get involve in sms before it send,
i try to send from the mobile that string
"hello this is test"
And when I check the smsc I got this as binary string of my text:
userData = "c8329bfd06d1d1e939283d07d1cb733a"
the encoding of this string is:
<Encoding:ASCII-8BIT>
I know that probably this userData is in GSM encoding in binary-string
so how can i get from userData back the clear text string ?
this question is for english lang, because in Hebrew I can get back the
string with this code:
[userData].pack('H*').force_encoding('utf-16be').encode('utf-8')
but in english i got error:
Encoding::InvalidByteSequenceError: "\xDA\xF3" followed by "u" on UTF-16BE
What I was try is to detect the binary string with ICU, and I got:
"ISO-8859-1" and the language that detected is: 'PT', that very strange cause my languages is English or Hebrew.
anyway i got lost with encoding stuff, so i try to encode to each name of list from Encoding.list
but without luck until now
thanks in advance
Shmulik
OK,
For who that also have this issue, i got the solution, thanks to someone from #ruby irc community (i missed his nickname)
The solution is:
for ascii chars that interpolate to binary:
You need that:
"c8329bfd06d1d1e939283d07d1cb733a".scan(/../).reverse_each.map { |h| h.to_i(16) }.pack('C*').unpack('B*')[0][2..-1].scan(/.{7}/).map.with_object("") { |x, s| s << x.to_i(2) }.reverse
Remember I sent this words in sms:
"hello this is test"
And that it has become in binary to:
"c8329bfd06d1d1e939283d07d1cb733a"
The reason that i got garbage in any encoding is, because the ascii chars is 7bits GSM, so only first 7bits represents the data but each another encoding uses at least 8bits, so that what the code actually do.
But this is just for ascii char set.
In another language like I use Hebrew, the SMS send as ucs2
So this code work for me:
[your_binary_string].pack('H*').force_encoding('utf-16be').encode('utf-8')
Very important to put the binary string in array
So that all for now.
If anybody want to translate and explain what exactly happen in the code for ascii char set, be my guest and welcome.
Shmulik

Rails/Ruby invalid byte sequence in UTF-8 even after force_encoding

I'm trying to iterate over a remote nginx log file (compressed .gz file) in Rails and I'm getting this error at some point in the file:
TTPArgumentError: invalid byte sequence in UTF-8
I tried forcing the encoding too although it seems the encoding was already UTF8:
logfile = logfile.force_encoding("UTF-8")
The method that I'm using:
def remote_update
uri = "http://" + self.url + "/localhost.access.log.2.gz"
source = open(uri)
gz = Zlib::GzipReader.new(source)
logfile = gz.read
# prints UTF-8
print logfile.encoding.name
logfile = logfile.force_encoding("UTF-8")
# prints UTF-8
print logfile.encoding.name
logfile.each_line do |line|
print line[/\/someregex\/1\/(.*)\//,1]
end
end
Really trying to understand why this is happening (tried to look in other SO threads with no success). What's wrong here?
Update:
Added exception's trace:
HTTPArgumentError: invalid byte sequence in UTF-8
from /Users/T/workspace/sample_app/app/models/server.rb:25:in `[]'
from /Users/T/workspace/sample_app/app/models/server.rb:25:in `block in remote_update'
from /Users/T/workspace/sample_app/app/models/server.rb:24:in `each_line'
from /Users/T/workspace/sample_app/app/models/server.rb:24:in `remote_update'
from (irb):2
from /Users/T/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/railties-4.2.5/lib/rails/commands/console.rb:110:in `start'
from /Users/T/.rbenv/versions/2.2.3/lib/ruby/gems/2.2.0/gems/railties-4.2.5/lib/rails/commands/console.rb:9:in `start'
force_encoding doesn't change the actual string data: it just changes the variable that says what encoding to use when interpreting the bytes.
If the data is not in fact utf-8 or contains invalid utf-8 sequences then force encoding won't help. Force encoding is basically only useful when you get some raw data from somewhere and you know what encoding it is in and you want to tell ruby what that encoding is.
The first thing to do would be to determine what is the actual encoding used. The charlock_holmes gem can detect encodings. A more tricky case would be if the file was a mish-mash of encodings but hopefully that isn't the case (if it was, then perhaps trying to handle each line separately might work).
If you want to take a string, which has the correct encoding, and transcode it to valid UTF-8 and clean up invalid characters you can use something like:
str.encode!('UTF-8', invalid: :replace, undef: :replace, replace: '?')
If you have a UTF-8 encoded string which has invalid UTF-8 characters in it, you can clean that up by using the 'binary' encoding source:
str.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '?')
Both will give you a UTF-8 string with any invalid characters replaced by question marks which should pass. You can also pass in replace: '' to strip the bad characters, or leave the option off and you'll get the \uFFFD unicode character.
My guess is that the source file before gzipping had some binary data/corruption/invalid UTF-8 that got logged into it?
This question has also been asked and answered before on StackOverflow. See the following blog post for good information:
https://robots.thoughtbot.com/fight-back-utf-8-invalid-byte-sequences
And here's a prior example of a SO answer:
https://stackoverflow.com/a/18454435/506908

SSRS Goto URL decoding and encoding

I am facing problem when passing value for url through data field.
I am passing value in goto url like this
="javascript:void(window.open('file:" &Fields!url.Value &"','_blank'))"
url value = /servername/foldername/FormuláriodeCalibração.xls
After report deployed and opened in internet explorer and clicked on the url. It is changing the url like this
/servername/foldername/FormuláriodeCalibração.xls
because of which I am unable to open the file.
Please help me in this.
Finally we come up with a solution of replacing non ASCII characters of Portuguese with HTML ASCII Codes.
For e.g. this is the actual file name of the attachment
TE-5180FormuláriodeCalibração(modelo)1271440308393(2)1338379011084.xls
We replaced the Portuguese characters with HTML ASCII Codes.
TE-5180FormuláriodeCalibração(modelo)1271440308393(2)1338379011084.xls
After these changes the above modified URL is passed in the place of actual URL and when it hits the server it was decoded properly and worked as expected.

HTML decoding in C/C++

I'm using libcurl for getting HTML pages.
I have some problems with Hebrew characters.
for example this: סלקום
gets gibberish.
How do I get Hebrew characters and not gibberish?
Do I need some HTML decoder?
Does libcurl support such operation?
Does libiconv support such operation?
I appreciate any help.
Thanks
Edit: Ok, so what you’re seeing is UTF-8 data being decoded as Windows-1252 (so the numeric character references were a red herring). Here’s a demonstration in Python:
>>> u = ''.join(map(unichr, [1505, 1500, 1511, 1493, 1501]))
>>> s = u.encode('utf-8')
>>> print s.decode('cp1255', 'replace')
׳¡׳�׳§׳•׳�
The solution to this problem depends on the environment in which the output is displayed. Merely outputting the bytes received and expecting them to be interpreted as characters leads to problems like this.
An HTML document typically contains a header tag like <meta charset=utf-8> to indicate to the browser what its encoding should be. A document served by a web server contains an HTTP header like Content-Type: text/html; charset=utf-8.
You should ask libcurl for the Content-Type HTTP header to know the encoding of the document, and then convert it to the system encoding using iconv. While in your case that would be codepage 1255, it depends on the user’s system and so you should look up the appropriate functions to detect that.
(Read Unicode and Character Sets and the character-encoding tag on this site for a wealth of further information.)

Resources