Unicodes in file get escaped - ruby-on-rails

I have a model Person when i save a new Person through the console with unicode for eg:
p = Person.new
p.name = "Hans"
p.street = "Jo\u00DFstreet"
p.save
It returns on p.street Joßstreet what is correct. But when i try in my seeds to build from a file:
file.txt
Hans;Jo\u00DFstreet
Jospeh;Baiuvarenstreet
and run this in my seed:
File.readlines('file.txt').each do |line|
f = line.split(';')
p = Person.new
p.name = p[0]
p.street = p[1]
p.save
end
Now when i call for eg: p = Person.last i get p.street => "Jo\\u00DFstreet"
I dont understand why the \u00DF gets escaped! What can i do to fix this problem? Thanks

It's because escape sequences such as \u00DF are handled only in source code string literals, as a part of Ruby syntax.
When you read some file (or receive data from somewhere else) Ruby don't try to handle escape sequences and you should do it by your own code.
To unescape string you may use approaches described here or there, but maybe you'd be better to store initially unescaped text in your file.

Related

Ruby switch words for hyperlinks without changing the chars

I have two different models: post (it has a content) and keywords (it has the word and the link). I am trying to make a function which would switch words in post content with the same keywords and its link (so it would work as hyperlink) For examples there is a keyword 'Hello' with some link on it and word 'hello', I want 'hello' in post.content to become a hyperlink with link from 'Hello' in keywords.
Here is my function:
def execute
#post = Post.find(params[:post_id])
all_keys = Keyword.all.pluck(:key, :link)
all_keys = all_keys.map.to_h
all_keys = all_keys.transform_keys(&:downcase)
new_content = #post.content.to_s
new_content_downcase = new_content.downcase
all_keys.map { |key, link| new_content_downcase.gsub!(key, "<a href='#{link}'>#{key}</a>") }
#post.content = new_content_downcase
#post.save!
end
Function is easy: I made a hash {key: 'link'} and have #post.content, then I downcase hash keys and #post.content and switch the words in post content with key from hash and link (so it would look like hyperlink).
Everything works fine but the problem is that it switch words in #post.content to lowercase (Hello --> hello). Is there any way to switch compare new_content and new_content_downcase, save the original word AND hyperlink on it?
Just don't downcase the post's content, that's it :) You could use gsub! with the block to make things concise, smth. like the following:
def execute
#post = Post.find(params[:post_id])
keys = Keyword.pluck(:key, :link).to_h.transform_keys(&:downcase)
#post.content.gsub!(/\w+/) do |word|
# We downcase each word when we check for the links presence...
url = keys[word.downcase]
# ... but not when we do replacements.
url ? "<a href='#{url}'>#{word}</a>" : word
end
#post.save!
end
So, your output is all lower case because you've applied #downcase to both your list of keywords and your content. And I assume you did that because you're doing a literal match between the keyword and the content string in your gsub.
One solution is to use a case-insensitive regex instead, :
all_keys.map { |key, link|
#post.content.gsub!(/(#{key})/i, "<a href='#{link}'>\1</a>")
}
Here, I've ignored the downcase and just used #post.content directly (I assume that it's a string so the to_s is redundant).
Then, in the gsub, I replaced the key direct match with a regex. This uses brackets to capture the term that's found for use in the replace term, so that you retain the capitalisation of the source rather than that of the stored keyword. The \1 in the replacement string is how that stored result from the regex gets used.
Fingers crossed that gets you working!
===Edit===
Here's an attempt at doing this properly, updating the entire method. (I'd also not escaped the \1 above, which it needs because it's in double quotes. Sorry about that!)
def export
#post = Post.find(params[:post_id])
_content = #post.content
Keyword.pluck(:key, :link).to_h.each { |_key, _link|
_content.gsub!(/(#{_key})/i, "<a href='#{_link}'>\\1</a>")
}
#post.update(content: _content)
end
Don't add key after \1, as you mention in a comment - the \1 should automatically be replaced with whatever was found by the regex (i.e. the value of key regardless of case).
Also, you shouldn't need to downcase your Keyword entries in any case: the time to do that is when they're created, so you only have to do it once.

How to fix slow Nokogiri parsing

I have a Rake task in my Rails app which looks into a folder for an XML file, parses it, and saves it to a database. The code works OK, but I have about 2100 files totaling 1.5GB, and processing is very slow, about 400 files in 7 hours. There are approximately 600-650 contracts in each XML file, and each contract can have 0 to n attachments. I did not paste all values, but each contract has 25 values.
To speed-up the process I use Activerecord's Import gem, so I am building an array per file and when the whole file is parsed. I do a mass import to Postgres. Only if a record is found is it directly updated and/or a new attachment inserted, but this is like 1 out of 100000 records. This helps a little, instead of doing new record per contract, but now I see that the slow part is XML parsing. Can you please look if I am doing something wrong in my parsing?
When I tried to print the arrays I am building, the slow part was until it loaded/parsed whole file and starts printing array by array. Thats why I assume the probem with speed is in parsing as Nokogiri loads the whole XML before it starts.
require 'nokogiri'
require 'pp'
require "activerecord-import/base"
ActiveRecord::Import.require_adapter('postgresql')
namespace :loadcrz2 do
desc "this task load contracts from crz xml files to DB"
task contracts: :environment do
actual_dir = File.dirname(__FILE__).to_s
Dir.foreach(actual_dir+'/../../crzfiles') do |xmlfile|
next if xmlfile == '.' or xmlfile == '..' or xmlfile == 'archive'
page = Nokogiri::XML(open(actual_dir+"/../../crzfiles/"+xmlfile))
puts xmlfile
cons = page.xpath('//contracts/*')
contractsarr = []
#c =[]
cons.each do |contract|
name = contract.xpath("name").text
crzid = contract.xpath("ID").text
procname = contract.xpath("procname").text
conname = contract.xpath("contractorname").text
subject = contract.xpath("subject").text
dateeff = contract.xpath("dateefficient").text
valuecontract = contract.xpath("value").text
attachments = contract.xpath('attachments/*')
attacharray = []
attachments.each do |attachment|
attachid = attachment.xpath("ID").text
attachname = attachment.xpath("name").text
doc = attachment.xpath("document").text
size = attachment.xpath("size").text
arr = [attachid,attachname,doc,size]
attacharray.push arr
end
#con = Crzcontract.find_by_crzid(crzid)
if #con.nil?
#c=Crzcontract.new(:crzname => name,:crzid => crzid,:crzprocname=>procname,:crzconname=>conname,:crzsubject=>subject,:dateeff=>dateeff,:valuecontract=>valuecontract)
else
#con.crzname = name
#con.crzid = crzid
#con.crzprocname=procname
#con.crzconname=conname
#con.crzsubject=subject
#con.dateeff=dateeff
#con.valuecontract=valuecontract
#con.save!
end
attacharray.each do |attar|
attachid=attar[0]
attachname=attar[1]
doc=attar[2]
size=attar[3]
#at = Crzattachment.find_by_attachid(attachid)
if #at.nil?
if #con.nil?
#c.crzattachments.build(:attachid=>attachid,:attachname=>attachname,:doc=>doc,:size=>size)
else
#a=Crzattachment.new
#a.attachid = attachid
#a.attachname = attachname
#a.doc = doc
#a.size = size
#a.crzcontract_id=#con.id
#a.save!
end
end
end
if #c.present?
contractsarr.push #c
end
#p #c
end
#p contractsarr
puts "done"
if contractsarr.present?
Crzcontract.import contractsarr, recursive: true
end
FileUtils.mv(actual_dir+"/../../crzfiles/"+xmlfile, actual_dir+"/../../crzfiles/archive/"+xmlfile)
end
end
end
There are a number of problems with the code. Here are some ways to improve it:
actual_dir = File.dirname(__FILE__).to_s
Don't use to_s. dirname is already returning a string.
actual_dir+'/../../crzfiles', with and without a trailing path delimiter is used repeatedly. Don't make Ruby rebuild the concatenated string over and over. Instead define it once, but take advantage of Ruby's ability to build the full path:
File.absolute_path('../../bar', '/path/to/foo') # => "/path/bar"
So use:
actual_dir = File.absolute_path('../../crzfiles', __FILE__)
and then refer to actual_dir only:
Dir.foreach(actual_dir)
This is unwieldy:
next if xmlfile == '.' or xmlfile == '..' or xmlfile == 'archive'
I'd do:
next if (xmlfile[0] == '.' || xmlfile == 'archive')
or even:
next if xmlfile[/^(?:\.|archive)/]
Compare these:
'.hidden'[/^(?:\.|archive)/] # => "."
'.'[/^(?:\.|archive)/] # => "."
'..'[/^(?:\.|archive)/] # => "."
'archive'[/^(?:\.|archive)/] # => "archive"
'notarchive'[/^(?:\.|archive)/] # => nil
'foo.xml'[/^(?:\.|archive)/] # => nil
The pattern will return a truthy value if it starts with '.' or is equal to 'archive'. It's not as readable but it's compact. I'd recommend the compound conditional test though.
In some places, you're concatenating xmlfile, so again let Ruby do it once:
xml_filepath = File.join(actual_dir, xmlfile)
which will honor the file path delimiter for whatever OS you're running on. Then use xml_filepath instead of concatenating the name:
xml_filepath = File.join(actual_dir, xmlfile)))
page = Nokogiri::XML(open(xml_filepath))
[...]
FileUtils.mv(xml_filepath, File.join(actual_dir, "archive", xmlfile)
join is a good tool so take advantage of it. It's not just another name for concatenating strings, because it's also aware of the correct delimiter to use for the OS the code is running on.
You use a lot of instances of:
xpath("some_selector").text
Don't do that. xpath, along with css and search return a NodeSet, and text when used on a NodeSet can be evil in a way that'll hurtle you down a very steep and slippery slope. Consider this:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<root>
<node>
<data>foo</data>
</node>
<node>
<data>bar</data>
</node>
</root>
EOT
doc.search('//node/data').class # => Nokogiri::XML::NodeSet
doc.search('//node/data').text # => "foobar"
The concatenation of the text into 'foobar' can't be split easily and it's a problem we see here in questions too often.
Do this if you expect getting a NodeSet back because of using search, xpath or css:
doc.search('//node/data').map(&:text) # => ["foo", "bar"]
It's better to use at, at_xpath or at_css if you're after a specific node because then text will work as you'd expect.
See "How to avoid joining all text from Nodes when scraping" also.
There's a lot of replication that could be DRY'd. Instead of this:
name = contract.xpath("name").text
crzid = contract.xpath("ID").text
procname = contract.xpath("procname").text
You could do something like:
name, crzid, procname = [
'name', 'ID', 'procname'
].map { |s| contract.at(s).text }

Ruby (Rails) unescape a string -- undo Array.to_s

Have been hacking together a couple of libraries, and had an issue where a string was getting 'double escaped'.
for example:
Fixed example
> x = ['a']
=> ["a"]
> x.to_s
=> "[\"a\"]"
>
Then again to
\"\[\\\"s\\\"\]\"
This was happening while dealing with http headers. I have a header which will be an array, but the http library is doing it's own character escaping on the array.to_s value.
The workaround I found, was to convert the array to a string myself, and then 'undo' the to_s. Like so:
formatted_value = value.to_s
if value.instance_of?(Array)
formatted_value = formatted_value.gsub(/\\/,"") #remove backslash
formatted_value = formatted_value.gsub(/"/,"") #remove single quote
formatted_value = formatted_value.gsub(/\[/,"") #remove [
formatted_value = formatted_value.gsub(/\]/,"") #remove ]
end
value = formatted_value
... There's gotta be a better way ... (without needing to monkey-patch the gems I'm using). (yeah, this break's if my string actually contains those strings.)
Suggestions?
** UPDATE 2 **
Okay. Still having troubles in this neighborhood, but now I think I've figured out the core issue. It's serializing my array to json after a to_s call. At least, that seems to be reproducing what I'm seeing.
['a'].to_s.to_json
I'm calling a method in a gem that is returning the results of a to_s, and then I'm calling to_json on it.
I've edited my answer due to your edited question:
I still can't duplicate your results!
>> x = ['a']
=> ["a"]
>> x.to_s
=> "a"
But when I change the last call to this:
>> x.inspect
=> "[\"a\"]"
So I'll assume that's what you're doing?
it's not necessarily escaping the values - per se. It's storing the string like this:
%{["a"]}
or rather:
'["a"]'
In any case. This should work to un-stringify it:
>> x = ['a']
=> ["a"]
>> y = x.inspect
=> "[\"a\"]"
>> z = Array.class_eval(y)
=> ["a"]
>> x == z
=> true
I'm skeptical about the safe-ness of using class_eval though, be wary of user inputs because it may produce un-intended side effects (and by that I mean code injection attacks) unless you're very sure you know where the original data came from, or what was allowed through to it.

Converting filesize string to kilobyte equivalent in Rails

My objective is to convert form input, like "100 megabytes" or "1 gigabyte", and converts it to a filesize in kilobytes I can store in the database. Currently, I have this:
def quota_convert
#regex = /([0-9]+) (.*)s/
#sizes = %w{kilobyte megabyte gigabyte}
m = self.quota.match(#regex)
if #sizes.include? m[2]
eval("self.quota = #{m[1]}.#{m[2]}")
end
end
This works, but only if the input is a multiple ("gigabytes", but not "gigabyte") and seems insanely unsafe due to the use of eval. So, functional, but I won't sleep well tonight.
Any guidance?
EDIT: ------
All right. For some reason, the regex with (.*?) isn't working correctly on my setup, but I've worked around it with Rails stuff. Also, I've realized that bytes would work better for me.
def quota_convert
#regex = /^([0-9]+\.?[0-9]*?) (.*)/
#sizes = { 'kilobyte' => 1024, 'megabyte' => 1048576, 'gigabyte' => 1073741824}
m = self.quota.match(#regex)
if #sizes.include? m[2].singularize
self.quota = m[1].to_f*#sizes[m[2].singularize]
end
end
This catches "1 megabyte", "1.5 megabytes", and most other things (I hope). It then makes it the singular version regardless. Then it does the multiplication and spits out magic answers.
Is this legit?
EDIT AGAIN: See answer below. Much cleaner than my nonsense.
You can use Rails ActiveHelper number_to_human_size.
def quota_convert
#regex = /([0-9]+) (.*)s?/
#sizes = "kilobytes megabytes gigabytes"
m = self.quota.match(#regex)
if #sizes.include? m[2]
m[1].to_f.send(m[2])
end
end
Added ? for optional plural in the regex.
Changed #sizes to a string of plurals.
Convert m[1] (the number to a float).
Send the message m[2] directly
why don't you simply create a hash that contains various spellings of the multiplier as the key and the numerical value as the value? No eval necessary and no regexs either!
First of all, changing your regex to #regex = /([0-9]+) (.*?)s?/ will fix the plural issue. The ? says match either 0 or 1 characters for the 's' and it causes .* to match in a non-greedy manner (as few characters as possible).
As for the size, you could have a hash like this:
#hash = { 'kilobyte' => 1, 'megabyte' => 1024, 'gigabyte' => 1024*1024}
and then your calculation is just self.quota = m[1].to_i*#hash[m2]
EDIT: Changed values to base 2

Get the value of a model field as it is in the database

Suppose I do
>> a = Annotation.first
>> a.body
=> "?"
>> a.body = "hello"
=> "hello"
Now, I haven't saved a yet, so in the database its body is still ?. How can I find out what a's body was before I changed it?
I guess I could do Annotation.find(a.id).body, but I wonder if there's a cleaner way (e.g., one that doesn't do a DB query)
a.body_was
You can also check to see if it's dirty with a.changed? and/or a.body_changed?

Resources