rails4 / emails and special characters - ruby-on-rails

On some app' setup
Everything works fine within the email, in the body
There is a problem with the subject of this email, containing special characters,
which would output the following for exemple in the mailboxes :
a été validée !
Searching around, I ended up with the following that works :
s = ("...a été validée !...").encode!("ISO-8859-15")
m = mail(to: email, subject: s)
But I guess that it is just a setup thing that would make everything works fine
Has anyone any experience about it ?

Can you try this in your application.rb?
config.action_mailer.default_charset = "iso-8859-15"
I'm using the default (utf-8) and accents work just fine, so you may want to see if there's something else that's botching up your data.
UPDATE: since this is only an issue with your subject, I think you may be able to fix this by adding this to the top of your source file that contains that subject line:
# encoding: UTF-8
Please try that out and let me know.

Related

regex to extract URLs from text - Ruby

I am trying to detect the urls from a text and replace them by wrapping in quotes like below:
original text: Hey, it is a url here www.example.com
required text: Hey, it is a url here "www.example.com"
original text show my input value and required text represents the required output. I searched a lot on web but could not find any possible solution. I already have tried URL.extract feature but that doesn't seem to detect URLs without http or https. Below are the examples of some of urls I want to deal with. Kindly let me know if you know the solution.
ANQUETIL-DUPERRON Abraham-Hyacinthe, KIEFFER Jean-Luc, www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,Les Belles lettres, 2001.
https://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/
www.jstor.org/stable/24084454
www.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/
insu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so
www.cerege.fr/spip.php?page=pageperso&id_user=94
Find words who look like urls:
str = "ANQUETIL-DUPERRON Abraham-Hyacinthe, KIEFFER Jean-Luc, www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,Les Belles lettres, 2001.\n\nhttps://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/\n\nwww.jstor.org/stable/24084454\n\nwww.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/\n\ninsu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so\n\nwww.cerege.fr/spip.php?page=pageperso&id_user=94"
str.split.select{|w| w[/(\b+\.\w+)/]}
This will give you an array of words which have no spaces and include a one or more . characters which MIGHT work for your use case.
puts str.split.select{|w| w[/(\b+\.\w+)/]}
www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,
https://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/
www.jstor.org/stable/24084454
www.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/
insu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so
www.cerege.fr/spip.php?page=pageperso&id_user=94
Updated
Complete solution to modify your string:
str_with_quote = str.clone # make a clone for the `gsub!`
str.split.select{|w| w[/(\b+\.\w+)/]}
.each{|url| str_with_quote.gsub!(url, '"' + url + '"')}
Now your cloned object wraps urls inside double quotes
puts str_with_quote
Will give you this output
ANQUETIL-DUPERRON Abraham-Hyacinthe, KIEFFER Jean-Luc, "www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,Les" Belles lettres, 2001.
"https://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/"
"www.jstor.org/stable/24084454"
"www.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/"
"insu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so"
"www.cerege.fr/spip.php?page=pageperso&id_user=94"

Problem with attachments' character encoding using gmail gem in ruby/rails

What I am doing:
I am using the gmail gem in a Rails 4 app to get email attachments from a specific account at regular intervals. Here is an extract from the core part (here for simplicity only considering the first email and its first attachment):
require 'gmail'
Gmail.connect(#user_email,#user_password) do |gmail|
if gmail.logged_in?
emails = gmail.inbox.emails(:from => #sender_email)
email = emails[0]
attachment = email.message.attachments[0]
File.open("~/temp.csv", 'w') do |file|
file.write(
StringIO.new(attachment.decoded.to_s[2..-2].force_encoding("ISO-8859-15").encode!('UTF-8')).read
)
end
end
end
The encoding of the attached file can vary. The particular one that I am currently having issues with is in Finnish. It contains Finnish characters and a superscripted 3 character.
This is what I expect to get when I run the above code. (This is what I get when I download the attachment manually through gmail user interface):
What the problem is:
However, I am getting the following odd results.
From cat temp.csv (Looks good to me):
With nano temp.csv (Here I have no idea what I am looking at):
This is what temp.csv looks like opened in Sublime Text (directly via winscp). First line and small parts look ok but then Chinese/Japanese characters:
This is what temp.csv looks like in Notepad (after download via winscp). Looks ok except a blank space has been inserted between each character and the new lines seems to be missing:
What I have tried:
I have without success tried:
.force_encoding(...) with all the different "ISO-8859-x" character sets
putting the force_encoding("ISO-8859-15").encode!('UTF-8') outside the .read (works but doesn't solve the problem)
encode to UTF-8 without first forcing another encoding but this leads to Encoding::UndefinedConversionError: "\xC4" from ASCII-8BIT to UTF-8
writing as binary with 'wb' and 'w+b' in the File.open() (which oddly doesn't seem to make a difference to the outcome).
searching stackoverflow and the web for other ideas.
Any ideas would be much appreciated!
Not beautiful, but it will work for me now.
After re-encoding, I convert the string to a char array, then remove the chars I do not want and then join the remaining array elements to form a string.
decoded_att = attachment.decoded
data = decoded_att.encode("UTF-8", "ISO-8859-1", invalid: :replace, undef: :replace).gsub("\r\n", "\n")
data_as_array = data.chars
data_as_array = data_as_array.delete_if {|i| i == "\u0000" || i == "ÿ" || i == "þ"}
data = data_as_array.join('').to_s
File.write("~/temp.csv", data.to_s)
This will work for me now. However, I have no idea how these characters have ended up in the attachment ("ÿ" and "þ" in the start of the document and "\u0000" between all remaining characters).
It seems like you need to do attachment.body.decoded instead of attachment.decoded

How to correctly handle character encoding when using Postgresql's copy_data function?

In my Rails app, I managed to stream large CSV files directly from Postgres based on solutions mentioned in this SO post. My working code looks somewhat like so:
query = <A Long SQL Query String>
response.headers["Cache-Control"] = "no-cache"
response.headers["Content-Type"] = "text/csv; charset=utf-8"
response.headers["Content-Disposition"] =
%(attachment; filename="#{csv_filename}")
response.headers["Last-Modified"] = Time.now.ctime.to_s
conn = ActiveRecord::Base.connection.raw_connection
conn.copy_data("COPY (#{query}) TO STDOUT WITH (FORMAT CSV, HEADER TRUE, FORCE_QUOTE *, ESCAPE E'\\\\');") do
while row = conn.get_copy_data
response.stream.write row
end
end
response.stream.close
end
Some of the columns (VARCHAR) being queried have values as either English or Chinese strings. The CSV file resulting from the above code doesn’t show the Chinese characters as is. Instead, I get something like this:
大大 文文
Am I supposed to change the way I’m using the copy_data function, or is there something I could do to the CSV file to solve this? I’ve tried saving the file as UTF-8 .txt file, as well as trying the convert_to function mentioned in the copy_data documentation, but to no avail.
This depends of the original encoding included in the CSV file.
Do this on Linux :
file -i you_file
Are you sure it's not UTF-16 or GB 18030 ?
And also in what kind of encoding is setup your database ?
do a \l in psql to see this.
So it boiled down to my MS Excel not being able to render the Chinese chars correctly. On MacOS, opening the same .csv file using the Numbers app (or even Atom, for that matter) resolved this issue for me.

Ruby How to convert back binary string from smsc

my app work with SMSC, and i need to get involve in sms before it send,
i try to send from the mobile that string
"hello this is test"
And when I check the smsc I got this as binary string of my text:
userData = "c8329bfd06d1d1e939283d07d1cb733a"
the encoding of this string is:
<Encoding:ASCII-8BIT>
I know that probably this userData is in GSM encoding in binary-string
so how can i get from userData back the clear text string ?
this question is for english lang, because in Hebrew I can get back the
string with this code:
[userData].pack('H*').force_encoding('utf-16be').encode('utf-8')
but in english i got error:
Encoding::InvalidByteSequenceError: "\xDA\xF3" followed by "u" on UTF-16BE
What I was try is to detect the binary string with ICU, and I got:
"ISO-8859-1" and the language that detected is: 'PT', that very strange cause my languages is English or Hebrew.
anyway i got lost with encoding stuff, so i try to encode to each name of list from Encoding.list
but without luck until now
thanks in advance
Shmulik
OK,
For who that also have this issue, i got the solution, thanks to someone from #ruby irc community (i missed his nickname)
The solution is:
for ascii chars that interpolate to binary:
You need that:
"c8329bfd06d1d1e939283d07d1cb733a".scan(/../).reverse_each.map { |h| h.to_i(16) }.pack('C*').unpack('B*')[0][2..-1].scan(/.{7}/).map.with_object("") { |x, s| s << x.to_i(2) }.reverse
Remember I sent this words in sms:
"hello this is test"
And that it has become in binary to:
"c8329bfd06d1d1e939283d07d1cb733a"
The reason that i got garbage in any encoding is, because the ascii chars is 7bits GSM, so only first 7bits represents the data but each another encoding uses at least 8bits, so that what the code actually do.
But this is just for ascii char set.
In another language like I use Hebrew, the SMS send as ucs2
So this code work for me:
[your_binary_string].pack('H*').force_encoding('utf-16be').encode('utf-8')
Very important to put the binary string in array
So that all for now.
If anybody want to translate and explain what exactly happen in the code for ascii char set, be my guest and welcome.
Shmulik

How to parse og meta tags using httparty for rails 3

I am trying to parse og meta tags using the HTTParty gem using this code:
link = http://www.usatoday.com/story/gameon/2013/01/08/nfl-jets-tony-sparano-fired/1817037/
# link = http://news.yahoo.com/chicago-lottery-winners-death-ruled-homicide-181627271.html
resp = HTTParty.get(link)
ret_body = resp.body
# title
og_title = ret_body.match(/\<[Mm][Ee][Tt][Aa] property\=\"og:title\"\ content\=\"(.*?)\"\/\>/)
og_title = og_title[1].to_s
The problem is that it worked on some sites (yahoo!) but not others (usa today)
Don't parse HTML with regular expressions, because they're too fragile for anything but the simplest problems. A tiny change to the HTML can break the pattern, causing you to begin a slow battle of maintaining an ever expanding pattern. It's a war you won't win.
Instead, use a HTML parser. Ruby has Nokogiri, which is excellent. Here's how I'd do what you want:
require 'nokogiri'
require 'httparty'
%w[
http://www.usatoday.com/story/gameon/2013/01/08/nfl-jets-tony-sparano-fired/1817037/
http://news.yahoo.com/chicago-lottery-winners-death-ruled-homicide-181627271.html
].each do |link|
resp = HTTParty.get(link)
doc = Nokogiri::HTML(resp.body)
puts doc.at('meta[property="og:title"]')['content']
end
Which outputs:
Jets fire offensive coordinator Tony Sparano
Chicago lottery winner's death ruled a homicide
Perhaps I can offer an easier solution? Check out the OpenGraph gem.
It's a simple library for parsing Open Graph protocol information from web sites and should solve your problem.
Solution:
og_title = ret_body.match(/\<[Mm][Ee][Tt][Aa] property\=\"og:title\"\ content\=\"(.*?)\"[\s\/\>|\/\>]/)
og_title = og_title[1].to_s
Trailing whitespace messed up the parsing so make sure to check for that. I added an OR clause to the regex to allow for both trailing and non trailing whitespace.

Resources