Trying to navigate XML file using nokogiri and xpath - ruby-on-rails

I have an xml file which is a download from: https://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml
What I'm trying to do is navigate through the currencies, so that I can save them in my database.
I have:
open('app/assets/forex/eurofxref-daily.xml', 'wb') do |file|
file << open('https://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml').read
end
then
doc = File.open("app/assets/forex/eurofxref-daily.xml") { |f| Nokogiri::XML(f) }
I am having a hard time accessing the nodes I'm interested in to extract currencies and values.

I'm not familiar with Nokogiri, but from this tutorial, it looks like you can apply the following XPath: /*/e:Cubes/e:Cube/e:Cube to select all of the Cube elements.
From there, you can iterate over each of the Cube elements, and select their #currency and #rate attributes:
#doc = Nokogiri::XML(File.open("https://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml"))
#doc.xpath('/*/e:Cubes/e:Cube/e:Cube', 'e' => 'ttp://www.ecb.int/vocabulary/2002-08-01/eurofxref').each do |node|
# do stuff
currency = node.attr('currency')
rate = node.attr('rate')
end

Related

Slower while generating the XML from the bunch of model object

class GenericFormatter < Formatter
attr_accessor :tag_name,:objects
def generate_xml
builder = Nokogiri::XML::Builder.new do |xml|
xml.send(tag_name.pluralize) {
objects.each do |obj|
xml.send(tag_name.singularize){
self.generate_obj_row obj,xml
}
end
}
end
builder.to_xml
end
def initialize tag_name,objects
self.tag_name = tag_name
self.objects = objects
end
def generate_obj_row obj,xml
obj.attributes.except("updated_at").map do |key,value|
xml.send(key, value)
end
xml.updated_at obj.updated_at.try(:strftime,"%m/%d/%Y %H:%M:%S") if obj.attributes.key?('updated_at')
end
end
In the above code, I have implemented the formatter where I have used the nokogiri XML Builder to generate the XML by manipulating the objects passing out inside the code.It's generated the faster XML when the data is not too large if data is larger like more than 10,000 records then It's slow down the XML to generate and takes at least 50-60 seconds.
Problem: Is there any way to generate the XML faster, I have tried XML Builders on view as well but did n't work.How can I generate the XML Faster? Should the solution be an application on rails 3 and suggestions to optimized above code?
Your main problem is processing everything in one go instead of splitting your data into batches. It all requires a lot of memory, first to build all those ActiveRecord models and then to build memory representation of the whole xml document. Meta-programming is also quite expensive (I mean those send methods).
Take a look at this code:
class XmlGenerator
attr_accessor :tag_name, :ar_relation
def initialize(tag_name, ar_relation)
#ar_relation = ar_relation
#tag_name = tag_name
end
def generate_xml
singular_tag_name = tag_name.singularize
plural_tag_name = tag_name.pluralize
xml = ""
xml << "<#{plural_tag_name}>"
ar_relation.find_in_batches(batch_size: 1000) do |batch|
batch.each do |obj|
xml << "<#{singular_tag_name}>"
obj.attributes.except("updated_at").each do |key, value|
xml << "<#{key}>#{value}</#{key}>"
end
if obj.attributes.key?("updated_at")
xml << "<updated_at>#{obj.updated_at.strftime('%m/%d/%Y %H:%M:%S')}</updated_at>"
end
xml << "</#{singular_tag_name}>"
end
end
xml << "</#{tag_name.pluralize}>"
xml
end
end
# example usage
XmlGenerator.new("user", User.where("age < 21")).generate_xml
Major improvements are:
fetching data from database in batches, you need to pass ActiveRecord collection instead of array of ActiveRecord models
generating xml by constructing strings, this has a risk of producing invalid xml, but it is much faster than using builder
I tested it on over 60k records. It took around 40 seconds to generate such xml document.
There is much more that can be done to improve this even further, but it all depends on your application.
Here are some ideas:
do not use ActiveRecord to fetch data, instead use lighter library or plain database driver
fetch only data that you need
tweak batch size
write generated xml directly to a file (if that is your use case) to save memory
The Nokogiri gem has a nice interface for creating XML from scratch,
Nokogiri is a wrapper around libxml2.
Gemfile gem 'nokogiri' To generate xml simple use the Nokogiri XML Builder like this
xml = Nokogiri::XML::Builder.new { |xml|
xml.body do
xml.test1 "some string"
xml.test2 890
xml.test3 do
xml.test3_1 "some string"
end
xml.test4 "with attributes", :attribute => "some attribute"
xml.closing
end
}.to_xml
output
<?xml version="1.0"?>
<body>
<test1>some string</test1>
<test2>890</test2>
<test3>
<test3_1>some string</test3_1>
</test3>
<test4 attribute="some attribute">with attributes</test4>
<closing/>
</body>
Demo: http://www.jakobbeyer.de/xml-with-nokogiri

How to fix slow Nokogiri parsing

I have a Rake task in my Rails app which looks into a folder for an XML file, parses it, and saves it to a database. The code works OK, but I have about 2100 files totaling 1.5GB, and processing is very slow, about 400 files in 7 hours. There are approximately 600-650 contracts in each XML file, and each contract can have 0 to n attachments. I did not paste all values, but each contract has 25 values.
To speed-up the process I use Activerecord's Import gem, so I am building an array per file and when the whole file is parsed. I do a mass import to Postgres. Only if a record is found is it directly updated and/or a new attachment inserted, but this is like 1 out of 100000 records. This helps a little, instead of doing new record per contract, but now I see that the slow part is XML parsing. Can you please look if I am doing something wrong in my parsing?
When I tried to print the arrays I am building, the slow part was until it loaded/parsed whole file and starts printing array by array. Thats why I assume the probem with speed is in parsing as Nokogiri loads the whole XML before it starts.
require 'nokogiri'
require 'pp'
require "activerecord-import/base"
ActiveRecord::Import.require_adapter('postgresql')
namespace :loadcrz2 do
desc "this task load contracts from crz xml files to DB"
task contracts: :environment do
actual_dir = File.dirname(__FILE__).to_s
Dir.foreach(actual_dir+'/../../crzfiles') do |xmlfile|
next if xmlfile == '.' or xmlfile == '..' or xmlfile == 'archive'
page = Nokogiri::XML(open(actual_dir+"/../../crzfiles/"+xmlfile))
puts xmlfile
cons = page.xpath('//contracts/*')
contractsarr = []
#c =[]
cons.each do |contract|
name = contract.xpath("name").text
crzid = contract.xpath("ID").text
procname = contract.xpath("procname").text
conname = contract.xpath("contractorname").text
subject = contract.xpath("subject").text
dateeff = contract.xpath("dateefficient").text
valuecontract = contract.xpath("value").text
attachments = contract.xpath('attachments/*')
attacharray = []
attachments.each do |attachment|
attachid = attachment.xpath("ID").text
attachname = attachment.xpath("name").text
doc = attachment.xpath("document").text
size = attachment.xpath("size").text
arr = [attachid,attachname,doc,size]
attacharray.push arr
end
#con = Crzcontract.find_by_crzid(crzid)
if #con.nil?
#c=Crzcontract.new(:crzname => name,:crzid => crzid,:crzprocname=>procname,:crzconname=>conname,:crzsubject=>subject,:dateeff=>dateeff,:valuecontract=>valuecontract)
else
#con.crzname = name
#con.crzid = crzid
#con.crzprocname=procname
#con.crzconname=conname
#con.crzsubject=subject
#con.dateeff=dateeff
#con.valuecontract=valuecontract
#con.save!
end
attacharray.each do |attar|
attachid=attar[0]
attachname=attar[1]
doc=attar[2]
size=attar[3]
#at = Crzattachment.find_by_attachid(attachid)
if #at.nil?
if #con.nil?
#c.crzattachments.build(:attachid=>attachid,:attachname=>attachname,:doc=>doc,:size=>size)
else
#a=Crzattachment.new
#a.attachid = attachid
#a.attachname = attachname
#a.doc = doc
#a.size = size
#a.crzcontract_id=#con.id
#a.save!
end
end
end
if #c.present?
contractsarr.push #c
end
#p #c
end
#p contractsarr
puts "done"
if contractsarr.present?
Crzcontract.import contractsarr, recursive: true
end
FileUtils.mv(actual_dir+"/../../crzfiles/"+xmlfile, actual_dir+"/../../crzfiles/archive/"+xmlfile)
end
end
end
There are a number of problems with the code. Here are some ways to improve it:
actual_dir = File.dirname(__FILE__).to_s
Don't use to_s. dirname is already returning a string.
actual_dir+'/../../crzfiles', with and without a trailing path delimiter is used repeatedly. Don't make Ruby rebuild the concatenated string over and over. Instead define it once, but take advantage of Ruby's ability to build the full path:
File.absolute_path('../../bar', '/path/to/foo') # => "/path/bar"
So use:
actual_dir = File.absolute_path('../../crzfiles', __FILE__)
and then refer to actual_dir only:
Dir.foreach(actual_dir)
This is unwieldy:
next if xmlfile == '.' or xmlfile == '..' or xmlfile == 'archive'
I'd do:
next if (xmlfile[0] == '.' || xmlfile == 'archive')
or even:
next if xmlfile[/^(?:\.|archive)/]
Compare these:
'.hidden'[/^(?:\.|archive)/] # => "."
'.'[/^(?:\.|archive)/] # => "."
'..'[/^(?:\.|archive)/] # => "."
'archive'[/^(?:\.|archive)/] # => "archive"
'notarchive'[/^(?:\.|archive)/] # => nil
'foo.xml'[/^(?:\.|archive)/] # => nil
The pattern will return a truthy value if it starts with '.' or is equal to 'archive'. It's not as readable but it's compact. I'd recommend the compound conditional test though.
In some places, you're concatenating xmlfile, so again let Ruby do it once:
xml_filepath = File.join(actual_dir, xmlfile)
which will honor the file path delimiter for whatever OS you're running on. Then use xml_filepath instead of concatenating the name:
xml_filepath = File.join(actual_dir, xmlfile)))
page = Nokogiri::XML(open(xml_filepath))
[...]
FileUtils.mv(xml_filepath, File.join(actual_dir, "archive", xmlfile)
join is a good tool so take advantage of it. It's not just another name for concatenating strings, because it's also aware of the correct delimiter to use for the OS the code is running on.
You use a lot of instances of:
xpath("some_selector").text
Don't do that. xpath, along with css and search return a NodeSet, and text when used on a NodeSet can be evil in a way that'll hurtle you down a very steep and slippery slope. Consider this:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<root>
<node>
<data>foo</data>
</node>
<node>
<data>bar</data>
</node>
</root>
EOT
doc.search('//node/data').class # => Nokogiri::XML::NodeSet
doc.search('//node/data').text # => "foobar"
The concatenation of the text into 'foobar' can't be split easily and it's a problem we see here in questions too often.
Do this if you expect getting a NodeSet back because of using search, xpath or css:
doc.search('//node/data').map(&:text) # => ["foo", "bar"]
It's better to use at, at_xpath or at_css if you're after a specific node because then text will work as you'd expect.
See "How to avoid joining all text from Nodes when scraping" also.
There's a lot of replication that could be DRY'd. Instead of this:
name = contract.xpath("name").text
crzid = contract.xpath("ID").text
procname = contract.xpath("procname").text
You could do something like:
name, crzid, procname = [
'name', 'ID', 'procname'
].map { |s| contract.at(s).text }

Finding correct row in data extracted from PDF to text

I am trying to get data out of pdf files, convert them to CSV, then organize into one table.
A sample pdf can be found here
https://www.ttb.gov/statistics/2011/201101wine.pdf
It's data on US wine production. So far, I have been able to get the PDF files and convert them into CSV.
Here is the CSV file that has been converted from PDF:
https://github.com/jjangmes/wine_data/blob/master/csv_data/201101wine.txt
However, when I try to find data by row, it's not working.
require 'csv'
csv_in = CSV.read('201001wine.txt', :row_sep => :auto, :col_sep => ";")
csv_in.each do |line|
puts line
end
When I put line[0], I get the entire data being printed. So it looks like the entire data is just shoved into row[0].
line will extract all the data.
line[0] will extract all the data with space in between lines.
line[1] gives the error "/bin/bash: shell_session_update: command not found"
How can I correctly divide up the data so I can parse them by row?
This is a really messy data with no heading or ID, so I think the best approach is to get the data in csv, and find the data I want by looking up the right row number.
Though not all data have the same number of rows, most do. So I thought that'd be the best way for now.
Thanks!
Edit 1:
Here is the code that I have to scrape and get the data.
require 'mechanize'
require 'docsplit'
require 'byebug'
require 'csv'
def pdf_to_text(pdf_filename)
extracted_text = Docsplit.extract_text([pdf_filename], ocr: false, col_sep: ";", output: 'csv_data')
extracted_text
end
def save_by_year(starting, ending)
agent = Mechanize.new{|a| a.ssl_version, a.verify_mode = 'TLSv1', OpenSSL::SSL::VERIFY_NONE}
agent.get('https://www.ttb.gov')
(starting..ending).each do |year|
year_page = agent.get("/statistics/#{year}winestats.shtml")
(1..12).each do |month|
month_trail = '%02d' % month
link = year_page.links_with(:href => "20#{year}/20#{year}#{month_trail}wine.pdf").first
page = agent.click(link)
File.open(page.filename.gsub(" /","_"), 'w+b') do |file|
file << page.body.strip
end
pdf_to_text("20#{year}#{month_trail}wine.pdf")
end
end
end
After converting, I am trying to access the data through accessing the text file then row in each.

How to open, parse and process XML file with Ox gem like with Nokogiri gem?

I want to open an external XML file, parse it and use the data to store in my database. I do this with Nokogiri quite easy:
file = '...external.xml'
xml = Nokogiri::XML(open(file))
xml.xpath('//Element').each do |element|
# process elements and save to Database e.g.:
#data = Model.new(:attr => element.at('foo').text)
#data.save
end
Now I want to try the (maybe faster) Ox gem (https://github.com/ohler55/ox) - but I do not get how to open and process a file from the documentary.
Any equivalent code examples for the above code would be awesome! Thank you!
You can't use XPath to locate nodes in Ox, but Ox does provide a locate method. You can use it like so:
xml = Ox.parse(%Q{
<root>
<Element>
<foo>ex1</foo>
</Element>
<Element>
<foo>ex2</foo>
</Element>
</root>
}.strip)
xml.locate('Element/foo/^Text').each do |t|
#data = Model.new(:attr => t)
#data.save
end
# or if you need to do other stuff with the element first
xml.locate('Element').each do |elem|
# do stuff
#data = Model.new(:attr => elem.locate('foo/^Text').first)
#data.save
end
If your query doesn't find any matches, it will return an empty array. For a brief description of the locate query parameter, see the source code at element.rb.
From the documentation:
doc2 = Ox.parse(xml)
To read the contents of a file in Ruby you can use xml = IO.read('filename.xml') (among others). So:
doc = Ox.parse(IO.read(filename))
If your XML file is UTF-8 encoded, then alternatively:
doc = Ox.parse( File.open(filename,"r:UTF-8",&:read) )

Array to XML -- Rails

I have a multi-dimensional array that I'd like to use for building an xml output.
The array is storing a csv import. Where people[0][...] are the column names that will become the xml tags, and the people[...>0][...] are the values.
For instance, array contains:
people[0][0] => first-name
people[0][1] => last-name
people[1][0] => Bob
people[1][1] => Dylan
people[2][0] => Sam
people[2][1] => Shepard
XML needs to be:
<person>
<first-name>Bob</first-name>
<last-name>Dylan</last-name>
</person>
<person>
<first-name>Sam</first-name>
<last-name>Shepard</last-name>
</person>
Any help is appreciated.
I suggest using FasterCSV to import your data and to convert it into an array of hashes. That way to_xml should give you what you want:
people = []
FasterCSV.foreach("yourfile.csv", :headers => true) do |row|
people << row.to_hash
end
people.to_xml
There are two main ways I can think of achieving this, one using an XML serializer; the second by pushing out the raw string.
Here's an example of the second:
xml = ''
1.upto(people.size-1) do |row_idx|
xml << "<person>\n"
people[0].each_with_index do |column, col_idx|
xml << " <#{column}>#{people[row_idx][col_idx]}</#{column}>\n"
end
xml << "</person>\n"
end
Another way:
hash = {}
hash['person'] = []
1.upto(people.size-1) do |row_idx|
row = {}
people[0].each_with_index do |column, col_idx|
row[column]=people[row_idx][col_idx]
end
hash['person'] << row
end
hash.to_xml
Leaving this answer here in case someone needs to convert an array like this that didn't come from a CSV file (or if they can't use FasterCSV).
Using Hash.to_xml is a good idea, due to its support in the core rails. It's probably the simplest way to export Hash-like data to simple XML. In most, simple cases - more complex cases requires more complex tools.
Thanks to everyone that posted. Below is the solution that seems to work best for my needs. Hopefully others may find this useful.
This solution grabs a remote url csv file, stores it in a multi-dimensional array, then exports it as xml:
require 'rio'
require 'fastercsv'
url = 'http://remote-url.com/file.csv'
people = FasterCSV.parse(rio(url).read)
xml = ''
1.upto(people.size-1) do |row_idx|
xml << " <record>\n"
people[0].each_with_index do |column, col_idx|
xml << " <#{column.parameterize}>#{people[row_idx][col_idx]}</#{column.parameterize}>\n"
end
xml << " </record>\n"
end
There are better solutions out there, using hash.to_xml would have been great except I needed to change the csv index line to parameterize to use as a xml tag, but this code works so I'm happy.

Resources