Rails Microsoft Word, XML databinding, repeat rows - ruby-on-rails

Those willing to jump straight to my questions can go to the paragraph "Please help with". You will find there my beginning of implementation, along with short XML samples
The story
The famous problem of inserting repeating content, like table rows, into a word template, using the rails framework.
I decided to implement a 'cleaner' solution for replacing some variables in a Word document with rails, using XML databinding. This solution works very well for non-repetitive content, but for repetitive content, a little extra dirty work must be done and I need help with it.
No C#, No Visual, just plain olde ruby on rails & XML
The databinded document
I have a Word document with some content controls, tagged with "human-readable" text, so my users know what should be inside.
I have used Word 2007 Content Control Toolkit to add some custom XML to a .docx file. Therefore in each .docx I have some customXml/itemsx.xml that contains my custom XML.
I have manually databinded this XML to text content control I have in my word template, using drag & drop with Word 2007 Content Control Toolkit.
The replacing process with nokogiri
Basically I already have some code that replaces every XML node by the corresponding value from a hash. For example if I provide this hash to my function :
variables = {
"some_xml-node" => "some_value"
}
It will properly replace XML in customXml/itemsx.xml of .docx file :
<root> <some> <xml-node>some_value</xml-node></some> </root>
So this is taken care of !
The repetitive content
Now as I said, this works perfectly for non-repetitive content. For repetitive content (in my case I want to repeat some <w:tr> in a document), the solution I'd like to go with, is
Manually insert some tags in word/document.xml of .docx file (this is dirty, but hell I can't think of anything else) before every <tr> that needs to be duplicated
In rails, parse the XML and locate the <tr> that needs duplicating using Nokogiri
Copy the tr as many times as I need
Look at some text inside this <tr>, find the databinding (which looks like <w:dataBinding w:xpath="/root[1]/movies[1]/movie[1]/name[1]"
Replace movie[1] by movie[index]
Repeat for every table that needs <tr> duplication
With this solution Therefore I ensure 100% compatibility with my existing system ! It's some kind of preprocessing...
Please help with
Finding an XML comment containing a custom string, and selecting the node just below it (using Nokogiri)
Changing attributes in many sub-nodes of the node found in 1.
XML/Hash samples that could be used (my beginning of implementation after that):
Sample of .docx word/document.xml
<w:document>
<!-- My_Custom_Tag_ID -->
<w:tr someparam="something">
<w:td></w:td>
<w:td><w:sthelse></w:sthelse><w:dataBinding w:xpath="/root[1]/movies[1]/movie[1]/name[1]><w:sth>Value</w:sth></w:td>
<w:td></<:td>
</w:tr>
</w:document>
Sample of input parameter repeat_tag hash
repeat_tags_sample = [
{
"tag" => "My_Custom_Tag_ID",
"repeatable-content" => "movie"
},
{
"tag" => "My_Custom_Tag_ID_2",
"repeatable-content" => "cartoons"
}
]
Sample of input parameter contents hash
contents_sample =
{
"movies" => [{"name" => "X-Men",
"year" => 1998,
"property-xxx" => 42
}, { "name" => "X-Men-4",
"year" => 2007,
"property-xxx" => 42
}],
"cartoons" => [{"name" => "Tom_Jerry",
"year" => 1995,
"property-yyy" => "cat"
}, { "name" => "Random_name",
"year" => 2008,
"property-yyy" => 42
}]
}
My beginning of implementation :
def dynamic_table_content(zip, repeat_tags, contents)
doc = zip.find_entry("word/document.xml")
xml = Nokogiri::XML.parse(doc.get_input_dtream)
# repeat_tags_sample = [ {
# "tag" => My_Custom_Tag_ID",
# "repeatable-content" => "movie"},
# ...]
repeat_tags.each do |rpt|
content = contents[rpt[:repeatable-content]]
# content now looks like [
# {"name" => "X-Men",
# "year" => 1998,
# "property-xxx" => 42, ...},
# ...]
content_name = rpt[:repeateable_content].to_s
# the 'movie' of '/root[1]/movies[1]/movie[1]/name[1]' (see below)
puts "Processing #{rpt[:tag]}, adding #{content_name}s"
# Word document.xml sample code looks like this :
# <!-- My_Custom_Tag_ID_inserted_manually -->
# <w:tr ...>
# ...
# <w:dataBinding w:xpath="/root[1]/movies[1]/movie[1]/name[1]>
# ...
# </w:tr>
Find a comment containing a custom string, and select the node just below
# Find starting <w:tr > tag located after <!-- rpt[:tag] -->
base_tr_node = find the node after
# Duplicate it as many times as we want.
content.each_with_index do |content, index|
puts "Adding #{content_name} : #{content}.to_s"
new_tr_node = base_tr_node.add_next_sibling(base_tr_node)
# inside this new node there are many
# <w:dataBinding w:xpath="/root[1]/movies[1]/movie[1]/name[1]>
# <w:dataBinding w:xpath="/root[1]/movies[1]/movie[1]/year[1]>
# ..../movie[1]/property-xxx[1]
# GOAL : replace every movie[1] by movie[index]
Change attributes in many sub-nodes of the node found in 1.
new_tr_node.change_attributes as shown in (see GOAL in previous comments)
# Maybe, it would be something like
# new_tr_node.gsub("(#{content_name})\[([1-9]+)\]", "\1\[#{index}\]")
# ... But new_tr_node is a nokogiri element so .gsub doesn't exist
end
end
#replace["word/document.xml"] = xml.serialize :save_zip_with => 0
end

I have looked at the DoPE extension for Word documents. It looks great ! But alas I had already done a lot of work, and just now I (almost) finished building my own preprocessor.
What I needed was more complicated than what I originally asked. But nevertheless, the answers would be :
EDIT : fixed bad regex/xpath
# 1. Find a comment containing a custom string, and select the node just below
comment_nodes = doc.xpath("//comment()")
# Loop like comment_nodes.each do |comment|
base_tr_node = comment.next_sibling.next_sibling
# For some reason, need to apply next_sibling twice, thought the comment is indeed just above the <w:tr> node
# 2. Change attributes in many sub-nodes of the node found in 1.
matches = tr_node.search('.//*[name()='w:dataBinding']')
matches.each do |databinding_node|
# replace '.*phase[1].*' by '.*phase[index].*'
databinding_node['w:xpath'].gsub("#{comment.text}\[1\]", "#{comment.text}\[#{index}\]")
end

Related

Read multiple concatenated json objects in Ruby

I have a file that contains multiple JSON objects that are not separated by comma :
{
"field" : "value",
"another_field": "another_value"
} // no comma
{
"field" : "value"
}
Each of the objects standalone is a valid json object.
Is there a way that I can process this file easily?
I know this is NOT a valid json, but unfortunately this file is being generated by a 3rd party tool. I have no option of changing the way the output looks like.
I can't open a text editor and smart-insert commas / square brackets before the run, since this is an automated process (I also really don't want to write code that opens the file and manipulates it).
In .NET there's a library that has this exact feature :
https://stackoverflow.com/a/29480032/2970729
https://www.newtonsoft.com/json/help/html/P_Newtonsoft_Json_JsonReader_SupportMultipleContent.htm
Is there anything equivalent in Ruby?
As long as your file is that simple you might want to do something like this:
# content = File.read(filename)
content =<<-EOF
{
"field" : "value",
"another_field": "another_value"
} // no comma
{
"field" : "value"
}
EOF
require 'json'
JSON.parse("[#{content.gsub(/\}.*?\{/m, '},{')}]")
#=> [{"field"=>"value", "another_field"=>"another_value"}, {"field"=>"value"}]
The yajl-ruby gem enables processing concatenated JSON in Ruby. The parser can read from a String or an IO. Each complete object is yielded to a block.
require 'yajl'
File.open 'file.json' do |f|
Yajl.load f do |object|
# do something with object
end
end
See the documentation for other options (buffer size, symbolized keys, etc).

Rails Roo gem .xlsx output contains the object not the output of method

I am using the Roo gem to output a spreadsheet from a Rails app. One of my columns is a hash (Postgres DB). I would like to format the cell contents into something more readable. I am using a method to return a human readable cell.
The column data looks like this:
Inspection.first.results
=> {"soiled"=>"oil on back",
"assigned_to"=>"Warehouse#firedatasolutions.com",
"contaminated"=>"blood on left cuff",
"inspection_date"=>"01/01/2017",
"physical_damage_seam_integrity"=>"",
"physical_damage_thermal_damage"=>"",
"physical_damage_reflective_trim"=>"",
"physical_damage_rips_tears_cuts"=>"small tear on right sleeve",
"correct_assembly_size_compatibility_of_shell_liner_and_drd"=>"",
"physical_damage_damaged_or_missing_hardware_or_closure_systems"=>""}
In my Inspections model I defined the following method:
def print_results
self.results.each do |k,v|
puts "#{k.titleize}:#{v.humanize}\r\n"
end
end
So in the console I get this:
Inspection.first.print_results
Soiled:Oil on back
Assigned To:Warehouse
Contaminated:Blood on left cuff
Inspection Date:01/01/2017
Physical Damage Seam Integrity:
Physical Damage Thermal Damage:
Physical Damage Reflective Trim:
Physical Damage Rips Tears Cuts:Small tear on right sleeve
Correct Assembly Size Compatibility Of Shell Liner And Drd:
Physical Damage Damaged Or Missing Hardware Or Closure Systems:
=> {"soiled"=>"oil on back",
"assigned_to"=>"Warehouse",
"contaminated"=>"blood on left cuff",
"inspection_date"=>"01/01/2017",
"physical_damage_seam_integrity"=>"",
"physical_damage_thermal_damage"=>"",
"physical_damage_reflective_trim"=>"",
"physical_damage_rips_tears_cuts"=>"small tear on right sleeve",
"correct_assembly_size_compatibility_of_shell_liner_and_drd"=>"",
"physical_damage_damaged_or_missing_hardware_or_closure_systems"=>""}
But when I put this in the index.xlsx.axlsx file
wb = xlsx_package.workbook
wb.add_worksheet(name: "Inspections") do |sheet|
sheet.add_row ['Serial Number', 'Category', 'Inspection Type', 'Date',
'Pass/Fail', 'Assigned To', 'Inspected By', 'Inspection Details']
#inspections.each do |inspection|
sheet.add_row [inspection.ppe.serial, inspection.ppe.category,
inspection.advanced? ? 'Advanced' : 'Routine',
inspection.results['inspection_date'],
inspection.passed? ? 'Pass' : 'Fail',
inspection.ppe.user.last_first_name,
inspection.user.last_first_name,
inspection.print_results]
end
end
The output in the spreadsheet is the original hash, not the results of the print statement.
{"soiled"=>"oil on back",
"assigned_to"=>"Warehouse",
"contaminated"=>"blood on left cuff", "inspection_date"=>"01/01/2017",
"physical_damage_seam_integrity"=>"",
"physical_damage_thermal_damage"=>"",
"physical_damage_reflective_trim"=>"",
"physical_damage_rips_tears_cuts"=>"small tear on right sleeve",
"correct_assembly_size_compatibility_of_shell_liner_and_drd"=>"",
"physical_damage_damaged_or_missing_hardware_or_closure_systems"=>""}
Is it possible to get the output of the method into the cell rather than the hash object?
The problem is that your print_results method prints out what you want to stdout (that is, the console), but still returns the original hash. The return value of the method is all that matters to Roo.
What you want to do is rewrite print_results to return the formatted string:
def print_results
self.results.map do |k,v|
"#{k.titleize}:#{v.humanize}\r\n"
end.join
end
This will return a string (note the use of .join to combine the array of strings returned by .map) that you can throw into Roo and get your desired output.

When reading CSV files the output is strange

I have a CSV file which looks like this:
1ttAAAttAnaattFrench PolynesiattPFttAustralia and Oceaniatt-17.352606tt-145.509956
2ttAAEttAnnabattAlgeriattDZttAfricatt36.822225tt7.809167
3ttAAFttApalachicolattUnited StatesttUSttNorth Americatt29.7276066tt-85.0274416
4ttAAGtt\NttBrazilttBRttSouth Americatt\Ntt\N
I use this gem to fetch data: https://github.com/tilo/smarter_csv
This is the code I use to show data in terminal console:
filename = 'db/csv/airports_codes.csv'
options = {
:col_sep => 'tt',
}
records = SmarterCSV.process(filename, options)
puts records
I put these files in seeds.rb file because I will modify this code later to seed my database with data. This last line of code is there so I can see how it looks like. So I run rake db:seed
And the output is obviously huge because there are around ~5k lines. Now the first problem is that I can't see all of the data in my terminal. When I scroll to the top this is the first item (note that ID is 4674 which means it displayed last ~250 items):
{:"1"=>4674, :aaa=>"YPJ", :anaa=>"Aupaluk", :french_polynesia=>"Canada", :pf=>"CA", :australia_and_oceania=>"North America", :"_17.352606"=>59.2967, :"_145.509956"=>-69.5997}
How do I see others items?
The second problem is that key names are really weird. How do I rename them, or even better, how do I use arrays instead of hashes?
If you set the option
:headers_in_file => false
in options, that should sort the problem out.
i.e.
filename = 'db/csv/airports_codes.csv'
options = {
:col_sep => 'tt',
:headers_in_file => false
}
records = SmarterCSV.process(filename, options)

In need of an explanation of Web scraping with Nokogiri in Rails

I am utterly confused and lost with Nokogiri and web scraping in Rails. I need someone to explain to me how I can get article titles from a web site to list in a view in my Rails application. I can manage to retrieve the data in irb however I have no clue how I can get that same data to be displayed in a view I made.
I have watched a number of tutorials and read documentation and one thing that they do that confuses me the most is when they require nokogiri or open-uri in a their example ruby file what directory is that ruby file supposed to be placed in? Also is that file associated with any controller for it to be displayed in the particular view that I made?
I hope I am explaining my issue as clear as possible without any confusion as I am not trying to confuse myself anymore that i am in my explanation.
See, what I am trying to do is create an application where the user can register and sign in, after they are signed in they are redirected to a page with 3 links. Those links being Audi, BMW and Mercedes-Benz and depending on which link is clicked the user will be then directed to another page where they are returned back a list of articles that mention their desired choice.
I hope this explanation was helpful and I really hope someone can offer to help or give me some kind of documentation that will benefit me.
Thank you!
This is what I did in irb:
2.1.1 :001 > require 'rubygems'
=> false
2.1.1 :002 > require 'nokogiri'
=> true
2.1.1 :003 > require 'open-uri'
=> true
2.1.1 :004 > page = Nokogiri::HTML(open("http://www.dtm.com/de/News/Archiv/index.html"))
I then got this returned:
=> #<Nokogiri::HTML::Document:0x814e3b40 name="document" children=[#<Nokogiri::XML::DTD:0x814e37f8 name="HTML">, #<Nokogiri::XML::Element:0x814e358c name="html" children=[#<Nokogiri::XML::Text:0x814e3384 "\r\n">, #<Nokogiri::XML::Element:0x814e32d0 name="head" children=[#<Nokogiri::XML::Text:0x814e30f0 "\r\n">, #<Nokogiri::XML::Element:0x814e3028 name="title" children=[#<Nokogiri::XML::Text:0x814e2e48 "DTM | Newsarchiv">]>, #<Nokogiri::XML::Text:0x814e2c90 "\r\n">, #<Nokogiri::XML::Element:0x814e2bc8 name="meta" attributes=[#<Nokogiri::XML::Attr:0x814e2b64 name="charset" value="utf-8">]>, #<Nokogiri::XML::Text:0x814e2718 "\r\n">, #<Nokogiri::XML::Element:0x814e2664 name="meta" ...
(I got more but just put up a few lines of what was returned) I am assuming this is the raw data from the page.
I then put:
2.1.1 :008 > puts page
Which returned back the raw HTML content.
Finally I entered:
2.1.1 :014 > page.css("a")
Which returned back the all the links on the page.
I am hoping to help you with a real world example. Lets get some data from Reuters for example.
In your console try this:
# require your tools make sure you have gem install nokogiri
pry(main)> require 'nokogiri'
pry(main)> require 'open-uri'
# set the url
pry(main)> url = "http://www.reuters.com/finance/stocks/overview?symbol=0005.HK"
# load and assign to a variable
pry(main)> doc = Nokogiri::HTML(open(url))
# take a piece of the site that has an element style .sectionQuote you can use ids also
pry(main)> quote = doc.css(".sectionQuote")
Now if you have a look in quote you will see you will have Nokogiri elements. Lets have a look inside:
pry(main)> quote.size
=> 6
pry(main)> quote.first
=> #(Element:0x43ff468 {
name = "div",
attributes = [ #(Attr:0x43ff404 { name = "class", value = "sectionQuote nasdaqChange" })],
children = [
#(Text "\n\t\t\t"),
#(Element:0x43fef18 {
name = "div",
attributes = [ #(Attr:0x43feeb4 { name = "class", value = "sectionQuoteDetail" })],
children = [
#(Text "\n\t\t\t\t"),
#(Element:0x43fe9c8 { name = "span", attributes = [ #(Attr:0x43fe964 { name = "class", value = "nasdaqChangeHeader" })], children = [ #(Text "0005.HK on Hong Kong Stock")] }),
.....
}),
#(Text "\n\t\t")]
})
You can see that nokogiri has essentially encapsulated each DOM element, so that you can search and access it quickly.
if you want to just simply display this div element you can:
pry(main)> quote.first.to_html
=> "<div class=\"sectionQuote nasdaqChange\">\n\t\t\t<div class=\"sectionQuoteDetail\">\n\t\t\t\t<span class=\"nasdaqChangeHeader\">0005.HK on Hong Kong Stock</span>\n\t\t\t\t<br class=\"clear\"><br class=\"clear\">\n\t\t\t\t<span style=\"font-size: 23px;\">\n\t\t\t\t82.85</span><span>HKD</span><br>\n\t\t\t\t<span class=\"nasdaqChangeTime\">14 Aug 2014</span>\n\t\t\t</div>\n\t\t</div>"
and it is possible to use it directly in the view of a rails application.
if you want to be more specific and take individual components and traverse by looping the quote variable for elements one level down, in this instance you can:
pry(main)> quote.each{|p| puts p.inspect}
Or be very specific and get the value of an element ie the name of the stock in our example:
pry(main)> quote.at_css(".nasdaqChangeHeader").content
=> "0005.HK on Hong Kong Stock"
This is a very useful link: http://nokogiri.org/tutorials/searching_a_xml_html_document.html
Really hope this helps
PS: A tip for looking inside objects
(http://ruby-doc.org/core-2.1.1/Object.html#method-i-inspect)
puts quote.inspect
First, you can put nokogiri and openuri in the gemfile of your rails app, with that in place you don't need to require these libraries.
You flow to scrape the sites should be:
# put this code on your controller
web_site = params[:web_site] # could be http://www.bmw.com/com/en/
#doc = Nokogiri::HTML(open(web_site))
#then you can iterate over the document in your view
<% #doc.css('.standardTeaser').each do |teaser_bmw| %>
<p>teaser_bmw.css('.headline').text </p>
#other content of teaser you can search here
<% end %>
So, to scrape the web site you need to fetch the html from the web site and find what content you want to grab.
If you know some basics of css selector it will be very easy to do. Me example doesn't take in account if you want to save the data in a database... but if you want, you just need to create a table with the field you need to save and than create a record after parsing the html.
Is that made sense to you?

Generate a link_to on the fly if a URL is found inside the contents of a db text field?

I have an automated report tool (corp intranet) where the admins have a few text area boxes to enter some text for different parts of the email body.
What I'd like to do is parse the contents of the text area and wrap any hyperlinks found with link tags (so when the report goes out there are links instead of text urls).
Is ther a simple way to do something like this without figuring out a way of parsing the text to add link tags around a found (['http:','https:','ftp:] TO the first SPACE after)?
Thank You!
Ruby 1.87, Rails 2.3.5
Make a helper :
def make_urls(text)
urls = %r{(?:https?|ftp|mailto)://\S+}i
html_text = text.gsub urls, '\0'
html_text
end
on the view just call this function , you will get the expected output.
like :
irb(main):001:0> string = 'here is a link: http://google.com'
=> "here is a link: http://google.com"
irb(main):002:0> urls = %r{(?:https?|ftp|mailto)://\S+}i
=> /(?:https?|ftp|mailto):\/\/\S+/i
irb(main):003:0> html = string.gsub urls, '\0'
=> "here is a link: http://google.com"
There are many ways to accomplish your goal. One way would be to use Regex. If you have never heard of regex, this wikipedia entry should bring you up to speed.
For example:
content_string = "Blah ablal blabla lbal blah blaha http://www.google.com/ adsf dasd dadf dfasdf dadf sdfasdf dadf dfaksjdf kjdfasdf http://www.apple.com/ blah blah blah."
content_string.split(/\s+/).find_all { |u| u =~ /^https?:/ }
Which will return: ["http://www.google.com/", "http://www.apple.com/"]
Now, for the second half of the problem, you will use the array returned above to subsititue the text links for hyperlinks.
links = ["http://www.google.com/", "http://www.apple.com/"]
links.each do |l|
content_string.gsub!(l, "<a href='#{l}'>#{l}</a>")
end
content_string will now be updated to contain HTML hyperlinks for all http/https URLs.
As I mentioned earlier, there are numerous ways to tackle this problem - to find the URLs you could also do something like:
require 'uri'
URI.extract(content_string, ['http', 'https'])
I hope this helps you.

Resources