Delete special characters in malformed text

Delete special characters in malformed text - ruby-on-rails

I am encountering some malformed text, and can't seem to find a generalized way to remove the special characters.
This is the text as seen on the website: Technological�\x00 Sciences. String#force_encoding('UTF-8') results: Technological\u0000 Sciences, which still causes Nokogiri to terminate early.
I could do a quick and dirty gsub "Technological\u0000 Sciences".gsub(/\u0000/,''), but was wondering if there was a more generalized solution, or a configuration in Nokogiri or ruby that would also work?

You can try this:
"Technological�\x00 Sciences".gsub(/[^[:alnum:][:space:][:punct:]]/, '')

You could do:
[29] pry(main)> str
=> "Technological�\u0000 Sciences"
[30] pry(main)> str.scan(/[a-zA-Z]{2,}/).join(' ')
=> "Technological Sciences"

Related

invalid byte sequence in UTF-8 for single quote in Ruby

I'm using the following code to show description in template:
json.description resource.description if resource.description.present?
It gives me invalid byte sequence in UTF-8 error. I dig this a little bit, and find out the issue is my description has single quote as ’ instead of '. Wondering what's the best way to fix this encoding issue? Description is input by user and I have no control over it. Another weird issue is, I have multiple test environments, they all have the same Ruby and Rails version and they are running the same code, but only one of the environment has this error.

def to_utf8(str)
str = str.force_encoding("UTF-8")
return str if str.valid_encoding?
str = str.force_encoding("BINARY")
str.encode("UTF-8", invalid: :replace, undef: :replace)
end
ref: https://stackoverflow.com/a/17028706/455770

Rails: mongodb error when using $lt

this code return an error
db[:zips].find(city: {$lt: 'd'}).limit(2).to_a.each{|r| pp r}
syntax error, unexpected '}', expecting end-of-input
However, this code works well
db[:zips].find(city: {:$lt=> 'd'}).limit(2).to_a.each{|r| pp r}
Why can not use :$lt like the first one?

You can't use the JSON-like {key: value} syntax in this case , because the key starts with $. Either use the older hash syntax or, since ruby 2.2,
{'$lt': 'd'}
I couldn't find a reference for when quoting is required (emojis are OK for example) - I suspect you would have to delve into the ruby source for this.

Unicode issue with strftime on Windows jRuby

I have a date/time format:
format = '%Y年%b%d日 %H:%M'
calling Time#strftime(format) (e.g. Time.now.strftime(format) produces:
> Time.now.strftime(format)
=> "2013?Jun20? 11:56"
I'm using jruby 1.7.2 (1.9.3p327) on Windows. Is there a way to make strftime Unicode-compatible?
Update
Windows console hasn't been very accommodating for Unicode, when I output just format, I get:
> I18n.t :'time.formats.long'
=> "%YÕ╣┤%b%dµùÑ %H:%M"
but at least it's something. It's trying to show Unicode characters, whereas strftime just ignores it:
> I18n.t(:'time.formats.long').encoding
=> #<Encoding:UTF-8>
> Time.now.strftime("").encoding
=> #<Encoding:Windows-1252>

It's a readline (shipped with JRuby) issue, a simple fix is start irb with option --noreadline (Or add IRB.conf[:USE_READLINE] = false in your ~/.irbrc).
C:\ConEmu>jirb
irb(main):001:0> format = '%Y年%b%d日 %H:%M'
=> "%Y?b%d?%H:%M" # Readline cannot handle GBK input here
irb(main):002:0> exit
C:\ConEmu>jirb --noreadline
irb(main):001:0> format = '%Y年%b%d日 %H:%M'
=> "%Y年%b%d日 %H:%M" # without Readline, it works
irb(main):002:0> format.encoding
=> #<Encoding:GBK>
irb(main):003:0> Time.now.strftime(format)
=> "2013??Jun20?? 23:20" # strftime cannot process GBK input here
strftime won't function well with a GBK encoded string. So encode the parameter to UTF-8 before passing it to strftime. BTW, it's very strange behavior that strftime returns GBK encoded string regardless of Encoding.default_internal!
C:\ConEmu>jirb --noreadline
irb(main):001:0> format = '%Y年%b%d日 %H:%M'
=> "%Y年%b%d日 %H:%M"
irb(main):002:0> Time.now.strftime(format.encode('utf-8'))
=> "2013年Jun20日 23:32"
irb(main):003:0> Time.now.strftime(format.encode('utf-8')).encoding
=> #<Encoding:GBK>
irb(main):004:0> Encoding.default_internal = Encoding::UTF_8
=> #<Encoding:UTF-8>
irb(main):005:0> Time.now.strftime(format.encode('utf-8')).encoding
=> #<Encoding:GBK>
I don't have Rails environment under JRuby, so I cannot help with the I18n encoding issue.
Readline is provided as JVM bytecode class files, as a result, you don't have easy way patching the library. So it is with strftime.
It's my first taste on JRuby (I was interested in the Encoding problem about Ruby), but I don't think I will pick it up ever again.
If you are finding some programming languages on JVM, you can take a look at Scala. It's more consistent, productive and creative, and (the most important compared to JRuby) less bug-prone in libraries.
Or if you are interested in Ruby, try RailsInstaller on Windows, or install RVM under Linux on a virtual machine. I'm sure you will find less trouble compared with JRuby, at least less Encoding trouble.

iconv deprecation warning with ruby 1.9.3

I'm getting this warning when I run rspec:
/gems/activesupport-3.1.0/lib/active_support/dependencies.rb:240:in `block in require': iconv will be deprecated in the future, use String#encode instead.
I get the same warning with rails 3.1.0, 3.1.1, 3.1.2.rc2 versions. Seems it's related to sqlite3 gem, but I'm not sure. There are no warnings with ruby 1.9.2
Any suggestions how to deal with it?

You are getting this deprecation notice cause a library somewhere is requiring iconv.
iconv is a gem created by Matz that can be used to convert strings from one format to another.
For example this is often used:
Iconv.iconv('UTF-8//IGNORE', 'UTF-8', content) this little bit of magic takes a UTF-8 string that may have invalid chars and converts it to a proper UTF-8 string.
It has been decided that in Ruby 1.9.3 we should not be using iconv any more and instead use the built-in String#encode. encode is more powerful and allows you more flexibility.
The theory is that the above example could be replaced with:
string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")
In practice it seems this is imperfect.
This also leads to a less than easy story for gem creators who wish to support 1.8:
content = RUBY_VERSION.to_f < 1.9 ?
Iconv.iconv('UTF-8//IGNORE', 'UTF-8', "content") :
"#{content}".encode(Encoding::UTF_8, :invalid => :replace, :undef => :replace, :replace => '')
So, you have a gem somewhere that is requiring iconv, to find it:
Assuming your error message is: /gems/activesupport-3.1.0/lib/active_support/dependencies.rb:240
Open up /gems/activesupport-3.1.0/lib/active_support/dependencies.rb on line 240:
Add the line:
p caller if file =~ /iconv/
(just after: load_dependency(file) { result = super })
You will get a big fat stack trace:
rake --tasks
/home/sam/.rvm/gems/ruby-1.9.3-p125/gems/activesupport-3.2.6/lib/active_support/dependencies.rb:251:in `block in require': iconv will be deprecated in the future, use String#encode instead.
["/home/sam/.rvm/gems/ruby-1.9.3-p125/gems/calais-0.0.13/lib/calais.rb:5:in `'",
.. more omitted ..
This tells me it is the calais gem. Looking through pull requests, I am not the first. The pull has not been yanked in.
Depending on the gem, there may be an upgraded version that does not have this error, so I would recommend you upgrade your gems first. If you are unlucky you may be stuck with the unfortunate task of forking a gem to get rid of this (if for example your pull request to fix it languishes)

If you're seeing this, it's very probably not Rails. If you look at the method surrounding the line being referred to in the error you posted, you'll see the following:
def require(file, *)
result = false
load_dependency(file) { result = super }
result
end
I'm not saying it's your code, necessarily, but I'm certain that it's not actually the line in question where iconv is being called. In my case, I found that my project's code actually contained a reference to iconv.
If you want to check your code for such a reference, try grep -ir iconv ./ in your project directory.
When iconv is actually in a library it can be harder to find. By temporarily changing the above method to:
def require(file, *)
result = false
puts
puts caller.reverse
load_dependency(file) { result = super }
result
end
You can then easily run your code and grep out the relevant lines of the backtrace to find the root cause of the warning.
ruby your/code.rb 2>&1 | grep -B 5 iconv

Add this to the start of your program:
oldverb = $VERBOSE; $VERBOSE = nil
require 'iconv'
$VERBOSE = oldverb
and curse the people who think this is a professional way to handle deprecation.

You can pin down the exact location of the warning by generating exceptions for ActiveSupport::Deprecation, instead of just printing to the log. At the top of application.rb:
ActiveSupport::Deprecation.behavior = Proc.new do |message, backtrace|
raise message
end
Once you've figured out where the warning is coming from (by inspecting the full backtrace), remove this again.

To remove this warning...
go to your .rvm directory and find iconv.c (mine was at ~/.rvm/src/ruby-1.9.3-p125/ext/iconv/iconv.c)
edit that file are remove or comment out the call to warn_deprecated() (should be near the bottom)
from that file's directory, run ruby extconf.rb
then make
then make install
Should do the trick

ActiveSupport::JSON.decode does not properly handle literal line breaks

Is this the expected behavior? Note how the line break character gets lost.
ruby-1.9.2-p136 :001 > ActiveSupport::JSON.decode("{\"content\": \"active\n\nsupport\"}")
=> {"content"=>"active\nsupport"}
The same happens with unicode-escaped line breaks:
ruby-1.9.2-p136 :002 > ActiveSupport::JSON.decode("{\"content\": \"active\u000a\u000asupport\"}")
=> {"content"=>"active\nsupport"}
I'm using rails 3.0.3.

I eventually came across this ticket: https://rails.lighthouseapp.com/projects/8994/tickets/3479-activesupportjson-fails-to-decode-unicode-escaped-newline-and-literal-newlines
It seems this is a bug in ActiveSupport that will be fixed in Rails 3.0.5. For now I have patched activesupport and things are working as expected.
ruby-1.9.2-p136 :001 > ActiveSupport::JSON.decode("{\"content\": \"active\n\nsupport\"}")
=> {"content"=>"active\n\nsupport"}

To represent a newline in JSON data using double quotes, you must escape the newline:
ActiveSupport::JSON.decode("{\"content\": \"active\\n\\nsupport\"}")
Otherwise, you are inserting a newline into the JSON source and not the JSON data. Note that this would also work:
ActiveSupport::JSON.decode('{"content": "active\n\nsupport"}')
By using single quotes, you are no longer inserting a literal newline into the source of the JSON.
It is interesting to note the way ActiveSupport handles this by default (the default JSON backend is ActiveSupport::JSON::Backends::Yaml). By installing the json gem and changing the JSON backend to it (ActiveSupport::JSON.backend = 'JSONGem') and attempting to decode the same text (ActiveSupport::JSON.decode("{\"content\": \"active\\n\\nsupport\"}")) you get the following:
JSON::ParserError: 737: unexpected token at '{"content": "active
support"}'

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Delete special characters in malformed text - ruby-on-rails

You can try this: "Technological�\x00 Sciences".gsub(/[^[:alnum:][:space:][:punct:]]/, '')

You could do: [29] pry(main)> str => "Technological�\u0000 Sciences" [30] pry(main)> str.scan(/[a-zA-Z]{2,}/).join(' ') => "Technological Sciences"

Related

invalid byte sequence in UTF-8 for single quote in Ruby

Rails: mongodb error when using $lt

Unicode issue with strftime on Windows jRuby

iconv deprecation warning with ruby 1.9.3

ActiveSupport::JSON.decode does not properly handle literal line breaks

Categories

Resources