I'm using the DocSplit gem for Ruby 1.9.3 to create Unicode UTF-8 versions of word documents. To my surprise today while I was running a test on a particular piece of one of these documents I started running into character encoding inconstencies.
I have tried a number of different methods to resolve the issue which I will list below, but the best success I've had so far is to remove all non-ASCII characters. This is far from ideal, as I don't think the character's are really going to be all that problematic in the DB.
gsub(/[^[:ascii:]]/, "")
This is a sample of what my output looks like vs. what I'm expecting:
My CODES'S APOSTROPHE
My CODES’S APOSTROPHE
The second apostrophe should look squiggly. If you paste it into irb, you get the following: \U+FFE2
I tried Regexing specifically for this character and it appears to work in Rubular. As soon as I put it in my model however, I got a syntax error.
syntax error, unexpected $end, expecting ')'
raw_title = raw_title.gsub(/’/, "")
I also tried forcing the encoding to UTF-8, but everything is already in UTF-8 and this does not appear to have an effect. I tried forcing the output to US-ASCII, but I get a byte sequence error.
I also tried a few of the encoding options found in Ruby library. These basically did the same thing as the Regex.
This all comes down to that I'm trying to match output for testing purposes. Should I even be concerned about these special characters? Is there a better way to match these characters without blindly removing them?
Try adding:
# encoding: utf-8
at the top of the failing rspec file. This should ensure things like:
raw_title = raw_title.gsub(/’/, "")
in your spec work.
I tried using the above example. but even after that it kept failing. So I used iconv to convert that specfic character. THis is what I used
Iconv.conv('ASCII//IGNORE', 'UTF8', text_to_be_converted)
I tried what was given in the following link - How to get rid of non-ascii characters in ruby
Related
I'm working on a project with some supplied CSV files that I need to parse and do some manipulation on. One is throwing this error when I try to load it into a file using CSV.read('path/file.csv')
CSV::MalformedCSVError: Unquoted fields do not allow \r or \n (line 7911).
Now when looking at the file, the last line is just blank. It's a \n character. I feel like this should not break the CSV read but it is. Now, I could just check the end of the CSV documents and strip any access return carriages/new lines since that seems like it'll work but it doesn't seem like the correct way. Anybody have some advice?
Edit: Using Ruby 2.0.0 and Rails 4.0.5
I have a rails app where my users can manually set up products via a web form. This works fine and accepts foreign characters well, words like 'Svölk' for example.
I now have a need to bulk import products and am using FasterCSV to do so. Generally this works without issue, but when the CSV contains foreign characters it stalls at that point.
Am I correct to believe the file needs to be UTF-8 in the first instance?
Also, I'm running Ruby 1.8.7 so is ICONV my only solution for converting the file? This could be an issue as the format of the original file won't be known.
Have others encountered this issue and if so, how did you overcome it?
You have two alternatives:
Use ensure_encoding gem to find the actual encoding of the strings.
Use Ruby to determine the file encoding using:
File.open(source_file).read.encoding
I prefer the first approach as it tries to detected the encoding based on Strings, and tries to convert to your desired encoding (UTF-8) and then you can set the encoding on FasterCSV options.
I have a Rails application where I use regex-based rules to categorize transactions. In my seeds.rb, I create some categories and rules, then import transactions from a CSV file (also utf8-encoded) and allow them to be categorized. This process works fine on my development machine, but when I run it on Heroku, I get:
incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)
I am running the Cedar Stack, Rails 2.3.15. I have put
# encoding: utf-8
at the top of all my source files and I've set the encoding to utf-8 in my app config, so I'm not sure what else could be causing this problem. I'm wondering if has something to do with the Heroku configuration.
The issue could be caused by invisible characters that are ignored by your local operating system, ensuring proper encoding takes place whereas on Heroku, the characters mess up the magic number declaration at the top of the file and you end up with both ASCII-8BIT and UTF-8.
Since the file that is having issues contains the regex, it's probably your model class instead of seeds.rb.
There are many ways to view invisible characters in your file. In vi, just set the option :set list
<a class="close" href="#">×</a>
I get an error regarding the use of ×.
It's used in error messages on twitter's bootstrap framework, I get an invalid byte sequence in UTF-8 error when I try to use it. Is there any work-around? Apart from using a normal x or X.
I have:
# Configure the default encoding used in templates for Ruby 1.9.
config.encoding = "utf-8"
In my application.rb
This seems almost too simple, but why aren't you using ×?
You need to set the encoding at the top of the file where that character is used. You can do this with:
# coding: utf-8
class MyClass
end
I haven't tried it in an erb file, but I don't see why that would be any different. I think you can use the word "encoding" too instead of just "coding" if that feels better. All that is required is at minimum "coding".
What editor are you using?
I suspect that you are saving the source file using an encoding other than UTF-8 (such as Latin-1 or ANSI on Windows), which is then causing ruby to fail to interpret the file correctly.
I've tried adding the times symbol to one of my views (using HAML) and it worked correctly. I'm using VIM as my editor and saving in UTF-8 without any BOM.
#encoding: utf-8
class ClassiClass
end
everything works fine!
i'm having problem to deal with charset in ruby on rails app, specificially in my templates. Code that comes from my database, works fine, but codes like ç ~ that are located in my views are not working. I added the following codes to my code
I added a function like that, but that still not working i have ç ~ codes in my application.rhtml that are not working.
before_filter :configure_charsets
# Configuring charset to UTF-8 def configure_charsets
headers["Content-Type"] = "text/html; charset=UTF-8"
end
I added as well meta http-equiv html to utf-8 and a .htaccess parameter AddDefaultCharset UTF-8
That's still not working, any other tip?
Put this piece of code in your config (environment.rb)
Rails::Initializer.run do |config|
config.action_controller.default_charset = "iso-8859-1"
end
This will do it.
Also, remove the default charset line if any in layouts/application.html
Is the text editor you're using to put the special characters into the file (either source or views) treating those characters as UTF-8? For example, if you're using TextMate, you can deliberately save a file as UTF-8. If for some reason you used a different encoding earlier (a default, perhaps), those UTF-8 characters might be getting transcoded at the code editing stage, so even if the rendering process is using UTF-8 throughout, it'll still not work.
Further, if you're using something from a shell, like vi, or whatever, is your terminal set up to accept UTF-8 as default? If you had it set to ISO-8859-1 or whatever, you'd get the same issue.
Is your application.rhtml file written in the correct character set? Make sure it's UTF-8, and not ISO-8859-1.
So if the contents of your file are UTF-8, and the output is being interpreted as UTF-8, something in between is changing the data. Can give give us the the hex interpretation of the input bytes (anything non-ASCII will be at least two bytes in UTF-8) for one of your special characters, and the hex interpretation of the output byte or bytes? Perhaps we can figure out what the change is, and work back from there.