Ruby Net::IMAP - Attachment Filenames with special characters/umlauts [duplicate] - ruby-on-rails

I have the following header:
From: =?iso-8859-1?Q?Marta_Falc=E3o?= <marta.falcao#example.com.br>
I can easily split out the stuff before the <, which leaves me with
"=?iso-8859-1?Q?Marta_Falc=E3o?="
What can I use to turn this into "Marta Falcão"?

Using the newer Mail gem:
Mail::Encodings.value_decode(str) or
Mail::Encodings.unquote_and_convert_to(str, to_encoding)

Thanks to Roland Illig for his comment, which led me to two options:
install rfc2047-ruby and call Rfc2047.decode(header)
install TMail and call TMail::Unquoter.unquote_and_convert_to(header, 'utf-8') or better yet TMail::Address.parse(header).friendly, the latter of which strips out the <email address> part

Use Ruby to implement RFC 2047 isn't hard:
module Rfc2047
TOKEN = /[\041\043-\047\052\053\055\060-\071\101-\132\134\136\137\141-\176]+/.freeze
ENCODED_TEXT = /[\041-\076\100-\176]+/.freeze
ENCODED_WORD = /=\?(?<charset>#{TOKEN})\?(?<encoding>[QB])\?(?<encoded_text>#{ENCODED_TEXT})\?=/i.freeze
class << self
def encode(input)
"=?#{input.encoding}?B?#{[input].pack('m0')}?="
end
def decode(input)
match_data = ENCODED_WORD.match(input)
raise ArgumentError if match_data.nil?
charset, encoding, encoded_text = match_data.captures
decoded =
case encoding
when 'Q', 'q' then encoded_text.unpack1('M')
when 'B', 'b' then encoded_text.unpack1('m')
end
decoded.force_encoding(charset)
end
end
end
Rfc2047.decode '=?iso-8859-1?Q?Marta_Falc=E3o?=' # => Marta_Falcão
Update
mikel/mail is currently having an encoding issue which might not decode the string correctly.
If that really bothers you, you can try new_rfc_2047:
$ gem install new_rfc_2047
$ ruby -rrfc_2047 -e 'puts Rfc2047.decode "From: =?iso-8859-1?Q?Marta_Falc=E3o?= <marta.falcao#example.com.br>"'
From: Marta Falcão <marta.falcao#example.com.br>
Since the source code of mikel/mail is a little too complicated for me to do the modification, I just made my own gem for this.
Gem source is here: https://github.com/tonytonyjan/rfc_2047/

Related

Replacement for URI.escape that avoids Lint/UriEscapeUnescape warnings?

Not sure how to fix Rubocop's Lint/UriEscapeUnescape warning
Tried replacing URI with CGI thinking that was the "drop in" replacement but that blew up the test suite.
Here's the error followed by the line of code where URI is being used:
app/models/media_file.rb:76:5: W: Lint/UriEscapeUnescape: URI.escape method is obsolete and should not be used. Instead, use CGI.escape, URI.encode_www_form or URI.encode_www_form_component depending on your specific use case.
URI ...
^^^
# app/models/media_file.rb
...
def cdn_url(format: nil)
if format.nil?
"#{s3_config.cloudfront_endpoint}/#{escape_url(key)}"
elsif converted_urls.with_indifferent_access[format.to_s]
filename = converted_urls.with_indifferent_access[format.to_s]
if URI.parse(escape_url(filename)).host
filename
else
"#{s3_config.cloudfront_endpoint}/#{escape_url(filename)}"
end
else
converted(url)
end
end
...
private
def escape_url(url)
URI
.escape(url)
.gsub(/\(/, '%28')
.gsub(/\)/, '%29')
.gsub(/\[/, '%5B')
.gsub(/\]/, '%5D')
end
EDIT: Adding sample output of strings escaped with URI and CGI:
url: images/medium/test-image.jpg
URI.escape(url): images/medium/test-image.jpg
CGI.escape(url): images%2Fmedium%2Ftest-image.jpg
url: images/medium/test-image.jpg
URI.escape(url): images/medium/test-image.jpg
CGI.escape(url): images%2Fmedium%2Ftest-image.jpg
It would appear CGI is not a drop in replacement for URI as the listing error might have you believe. Thoughts?
Encounter with the same problem, got it fixed by using addressable lib.
escaped_query = URI.escape(search,
Regexp.new("[^#{URI::PATTERN::UNRESERVED}]"))
#W: Lint/UriEscapeUnescape: URI.escape method is obsolete and should not be used. Instead, use CGI.escape, URI.encode_www_form or URI.encode_www_form_component depending on your specific use case.
Solved by:
gem addressable gem in gemspec or Gemfile.
gem 'addressable', '~> 2.7'
require addressable/uri
Add appropriate.
escaped_query = Addressable::URI.encode_component(search, Addressable::URI::CharacterClasses::QUERY)

Gsub raises "invalid byte sequence in UTF-8"

I have the next method call:
Formatting.git_log_to_html(`git log --no-merges master --pretty=full #{interval}`)
The value of interval is something like release-20130325-01..release-20130327-04.
The git_log_to_html ruby method is the next (I am only pasting the line what raises the error):
module Formatting
def self.git_log_to_html(git_log)
...
git_log.gsub(/^commit /, "COMMIT_STARTcommit").split("COMMIT_STARTcommit").each do |commit|
...
end
end
This used to work, but actually I checked that gsub is raising an "invalid byte sequence in UTF-8" error.
Could you help to understand why and how can I fix it? :/
Here is the output of git_log:
https://dl.dropbox.com/u/42306424/output.txt
For some reason, this command:
git log --no-merges master --pretty=full #{interval}
is giving you a result that is not encoded in UTF-8, it may be that your computer is working with a different charset, try the following:
module Formatting
def self.git_log_to_html(git_log)
...
git_log.force_encoding("utf8").gsub(/^commit /, "COMMIT_STARTcommit").split("COMMIT_STARTcommit").each do |commit|
...
end
end
I'm not sure if that will work, but you can try.
If that doesn't work, you can check ruby iconv to detect the charset and encode it on utf-8: http://www.ruby-doc.org/stdlib-2.0/libdoc/iconv/rdoc/
Based on the file you added on the comment, I did:
require 'open-uri'
content = open('https://dl.dropbox.com/u/42306424/output.txt').read
content.gsub(/^commit /, "COMMIT_STARTcommit").split("COMMIT_STARTcommit")
and worked nice without any kind of troubles
btw, you can try:
require 'iconv'
module Formatting
def self.git_log_to_html(git_log)
...
git_log = Iconv.conv 'UTF-8', 'iso8859-1', git_log
git_log.gsub(/^commit /, "COMMIT_STARTcommit").split("COMMIT_STARTcommit").each do |commit|
...
end
end
but you should really detect the charset of the string before attempting a conversion to utf-8.

Force strings to UTF-8 from any encoding

In my rails app I'm working with RSS feeds from all around the world, and some feeds have links that are not in UTF-8. The original feed links are out of my control, and in order to use them in other parts of the app, they need to be in UTF-8.
How can I detect encoding and convert to UTF-8?
Ruby 1.9
"Forcing" an encoding is easy, however it won't convert the characters just change the encoding:
str = str.force_encoding('UTF-8')
str.encoding.name # => 'UTF-8'
If you want to perform a conversion, use encode:
begin
str.encode("UTF-8")
rescue Encoding::UndefinedConversionError
# ...
end
I would definitely read the following post for more information:
http://graysoftinc.com/character-encodings/ruby-19s-string
This will ensure you have the correct encoding and won't error out because it replaces any invalid or undefined character with a blank string.
This will ensure no matter what, that you have a valid UTF-8 string
str.encode(Encoding.find('UTF-8'), {invalid: :replace, undef: :replace, replace: ''})
Iconv
require 'iconv'
i = Iconv.new('UTF-8','LATIN1')
a_with_hat = i.iconv("\xc2")
Summary: the iconv gem does all the work of converting encodings. Make sure it's installed with:
gem install iconv
Now, you need to know what encoding your string is currently in as Ruby 1.8 treats Strings as an array of bytes (with no intrinsic encoding.) For example, say your string was in latin1 and you wanted to convert it to utf-8
require 'iconv'
string_in_utf8_encoding = Iconv.conv("UTF8", "LATIN1", string_in_latin1_encoding)
Only this solution worked for me:
string.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
Note the binary argument.

How to URL encode a string in Ruby

How do I URI::encode a string like:
\x12\x34\x56\x78\x9a\xbc\xde\xf1\x23\x45\x67\x89\xab\xcd\xef\x12\x34\x56\x78\x9a
to get it in a format like:
%124Vx%9A%BC%DE%F1%23Eg%89%AB%CD%EF%124Vx%9A
as per RFC 1738?
Here's what I tried:
irb(main):123:0> URI::encode "\x12\x34\x56\x78\x9a\xbc\xde\xf1\x23\x45\x67\x89\xab\xcd\xef\x12\x34\x56\x78\x9a"
ArgumentError: invalid byte sequence in UTF-8
from /usr/local/lib/ruby/1.9.1/uri/common.rb:219:in `gsub'
from /usr/local/lib/ruby/1.9.1/uri/common.rb:219:in `escape'
from /usr/local/lib/ruby/1.9.1/uri/common.rb:505:in `escape'
from (irb):123
from /usr/local/bin/irb:12:in `<main>'
Also:
irb(main):126:0> CGI::escape "\x12\x34\x56\x78\x9a\xbc\xde\xf1\x23\x45\x67\x89\xab\xcd\xef\x12\x34\x56\x78\x9a"
ArgumentError: invalid byte sequence in UTF-8
from /usr/local/lib/ruby/1.9.1/cgi/util.rb:7:in `gsub'
from /usr/local/lib/ruby/1.9.1/cgi/util.rb:7:in `escape'
from (irb):126
from /usr/local/bin/irb:12:in `<main>'
I looked all about the internet and haven't found a way to do this, although I am almost positive that the other day I did this without any trouble at all.
str = "\x12\x34\x56\x78\x9a\xbc\xde\xf1\x23\x45\x67\x89\xab\xcd\xef\x12\x34\x56\x78\x9a".force_encoding('ASCII-8BIT')
puts CGI.escape str
=> "%124Vx%9A%BC%DE%F1%23Eg%89%AB%CD%EF%124Vx%9A"
Nowadays, you should use ERB::Util.url_encode or CGI.escape. The primary difference between them is their handling of spaces:
>> ERB::Util.url_encode("foo/bar? baz&")
=> "foo%2Fbar%3F%20baz%26"
>> CGI.escape("foo/bar? baz&")
=> "foo%2Fbar%3F+baz%26"
CGI.escape follows the CGI/HTML forms spec and gives you an application/x-www-form-urlencoded string, which requires spaces be escaped to +, whereas ERB::Util.url_encode follows RFC 3986, which requires them to be encoded as %20.
See "What's the difference between URI.escape and CGI.escape?" for more discussion.
str = "\x12\x34\x56\x78\x9a\xbc\xde\xf1\x23\x45\x67\x89\xab\xcd\xef\x12\x34\x56\x78\x9a"
require 'cgi'
CGI.escape(str)
# => "%124Vx%9A%BC%DE%F1%23Eg%89%AB%CD%EF%124Vx%9A"
Taken from #J-Rou's comment
I was originally trying to escape special characters in a file name only, not on the path, from a full URL string.
ERB::Util.url_encode didn't work for my use:
helper.send(:url_encode, "http://example.com/?a=\11\15")
# => "http%3A%2F%2Fexample.com%2F%3Fa%3D%09%0D"
Based on two answers in "Why is URI.escape() marked as obsolete and where is this REGEXP::UNSAFE constant?", it looks like URI::RFC2396_Parser#escape is better than using URI::Escape#escape. However, they both are behaving the same to me:
URI.escape("http://example.com/?a=\11\15")
# => "http://example.com/?a=%09%0D"
URI::Parser.new.escape("http://example.com/?a=\11\15")
# => "http://example.com/?a=%09%0D"
You can use Addressable::URI gem for that:
require 'addressable/uri'
string = '\x12\x34\x56\x78\x9a\xbc\xde\xf1\x23\x45\x67\x89\xab\xcd\xef\x12\x34\x56\x78\x9a'
Addressable::URI.encode_component(string, Addressable::URI::CharacterClasses::QUERY)
# "%5Cx12%5Cx34%5Cx56%5Cx78%5Cx9a%5Cxbc%5Cxde%5Cxf1%5Cx23%5Cx45%5Cx67%5Cx89%5Cxab%5Cxcd%5Cxef%5Cx12%5Cx34%5Cx56%5Cx78%5Cx9a"
It uses more modern format, than CGI.escape, for example, it properly encodes space as %20 and not as + sign, you can read more in "The application/x-www-form-urlencoded type" on Wikipedia.
2.1.2 :008 > CGI.escape('Hello, this is me')
=> "Hello%2C+this+is+me"
2.1.2 :009 > Addressable::URI.encode_component('Hello, this is me', Addressable::URI::CharacterClasses::QUERY)
=> "Hello,%20this%20is%20me"
Code:
str = "http://localhost/with spaces and spaces"
encoded = URI::encode(str)
puts encoded
Result:
http://localhost/with%20spaces%20and%20spaces
I created a gem to make URI encoding stuff cleaner to use in your code. It takes care of binary encoding for you.
Run gem install uri-handler, then use:
require 'uri-handler'
str = "\x12\x34\x56\x78\x9a\xbc\xde\xf1\x23\x45\x67\x89\xab\xcd\xef\x12\x34\x56\x78\x9a".to_uri
# => "%124Vx%9A%BC%DE%F1%23Eg%89%AB%CD%EF%124Vx%9A"
It adds the URI conversion functionality into the String class. You can also pass it an argument with the optional encoding string you would like to use. By default it sets to encoding 'binary' if the straight UTF-8 encoding fails.
If you want to "encode" a full URL without having to think about manually splitting it into its different parts, I found the following worked in the same way that I used to use URI.encode:
URI.parse(my_url).to_s

Reading the first line of a file in Ruby

I want to read only the first line of a file using Ruby in the fastest, simplest, most idiomatic way possible. What's the best approach?
(Specifically: I want to read the git commit UUID out of the REVISION file in my latest Capistrano-deployed Rails directory, and then output that to my tag. This will let me see at an http-glance what version is deployed to my server. If there's an entirely different & better way to do this, please let me know.)
This will read exactly one line and ensure that the file is properly closed immediately after.
strVar = File.open('somefile.txt') {|f| f.readline}
# or, in Ruby 1.8.7 and above: #
strVar = File.open('somefile.txt', &:readline)
puts strVar
Here's a concise idiomatic way to do it that properly opens the file for reading and closes it afterwards.
File.open('path.txt', &:gets)
If you want an empty file to cause an exception use this instead.
File.open('path.txt', &:readline)
Also, here's a quick & dirty implementation of head that would work for your purposes and in many other instances where you want to read a few more lines.
# Reads a set number of lines from the top.
# Usage: File.head('path.txt')
class File
def self.head(path, n = 1)
open(path) do |f|
lines = []
n.times do
line = f.gets || break
lines << line
end
lines
end
end
end
You can try this:
File.foreach('path_to_file').first
How to read the first line in a ruby file:
commit_hash = File.open("filename.txt").first
Alternatively you could just do a git-log from inside your application:
commit_hash = `git log -1 --pretty=format:"%H"`
The %H tells the format to print the full commit hash. There are also modules which allow you to access your local git repo from inside a Rails app in a more ruby-ish manner although I have never used them.
first_line = open("filename").gets
I think the jkupferman suggestion of investigating the git --pretty options makes the most sense, however yet another approach would be the head command e.g.
ruby -e 'puts `head -n 1 filename`' #(backtick before `head` and after `filename`)
Improving on the answer posted by #Chuck, I think it might be worthwhile to point out that if the file you are reading is empty, an EOFError exception will be thrown. Catch and ignore the exception:
def readit(filename)
text = ""
begin
text = File.open(filename, &:readline)
rescue EOFError
end
text
end
first_line = File.readlines('file_path').first.chomp

Resources