How to edit docx with nokogiri and rubyzip - ruby-on-rails

I'm using a combination of rubyzip and nokogiri to edit a .docx file. I'm using rubyzip to unzip the .docx file and then using nokogiri to parse and change the body of the word/document.xml file but ever time I close rubyzip at the end it corrupts the file and I can't open it or repair it. I unzip the .docx file on desktop and check the word/document.xml file and the content is updated to what I changed it to but all the other files are messed up. Could someone help me with this issue? Here is my code:
require 'rubygems'
require 'zip/zip'
require 'nokogiri'
zip = Zip::ZipFile.open("test.docx")
doc = zip.find_entry("word/document.xml")
xml = Nokogiri::XML.parse(doc.get_input_stream)
wt = xml.root.xpath("//w:t", {"w" => "http://schemas.openxmlformats.org/wordprocessingml/2006/main"}).first
wt.content = "New Text"
zip.get_output_stream("word/document.xml") {|f| f << xml.to_s}
zip.close

I ran into the same corruption problem with rubyzip last night. I solved it by copying everything to a new zip file, replacing files as necessary.
Here's my working proof of concept:
#!/usr/bin/env ruby
require 'rubygems'
require 'zip/zip' # rubyzip gem
require 'nokogiri'
class WordXmlFile
def self.open(path, &block)
self.new(path, &block)
end
def initialize(path, &block)
#replace = {}
if block_given?
#zip = Zip::ZipFile.open(path)
yield(self)
#zip.close
else
#zip = Zip::ZipFile.open(path)
end
end
def merge(rec)
xml = #zip.read("word/document.xml")
doc = Nokogiri::XML(xml) {|x| x.noent}
(doc/"//w:fldSimple").each do |field|
if field.attributes['instr'].value =~ /MERGEFIELD (\S+)/
text_node = (field/".//w:t").first
if text_node
text_node.inner_html = rec[$1].to_s
else
puts "No text node for #{$1}"
end
end
end
#replace["word/document.xml"] = doc.serialize :save_with => 0
end
def save(path)
Zip::ZipFile.open(path, Zip::ZipFile::CREATE) do |out|
#zip.each do |entry|
out.get_output_stream(entry.name) do |o|
if #replace[entry.name]
o.write(#replace[entry.name])
else
o.write(#zip.read(entry.name))
end
end
end
end
#zip.close
end
end
if __FILE__ == $0
file = ARGV[0]
out_file = ARGV[1] || file.sub(/\.docx/, ' Merged.docx')
w = WordXmlFile.open(file)
w.force_settings
w.merge('First_Name' => 'Eric', 'Last_Name' => 'Mason')
w.save(out_file)
end

I stumbled accross the post and know nothing about ruby or nokogiri but ...
It looks like you are reziping the new content incorrectly.
I don't know about rubyzip, but you need a way to tell it to update the entry word/document.xml
and then resave/rezip the file.
It looks like you are just overwriting the entry with new data wich of course is going to be a different size and totally screw up the rest of the zip file.
I give an example for excel in this post Parse text file and create an excel report
which may be of use even though i am using a different zip library and VB (Im still doing exactly what you are trying to do, my code is about half way down)
here is the part that applies
Using z As ZipFile = ZipFile.Read(xlStream.BaseStream)
'Grab Sheet 1 out of the file parts and read it into a string.
Dim myEntry As ZipEntry = z("xl/worksheets/sheet1.xml")
Dim msSheet1 As New MemoryStream
myEntry.Extract(msSheet1)
msSheet1.Position = 0
Dim sr As New StreamReader(msSheet1)
Dim strXMLData As String = sr.ReadToEnd
'Grab the data in the empty sheet and swap out the data that I want
Dim str2 As XElement = CreateSheetData(tbl)
Dim strReplace As String = strXMLData.Replace("<sheetData/>", str2.ToString)
z.UpdateEntry("xl/worksheets/sheet1.xml", strReplace)
'This just rezips the file with the new data it doesnt save to disk
z.Save(fiRet.FullName)
End Using

According to the official Github documentation, you should Use write_buffer instead open. There's also a code example at the link.

Related

extracting path within a string using ruby

I have written a ruby script where I iterate through folders, and search for file names ending with ".xyz" . Within these files I search then for lines which have the following structure:
<ClCompile Include="..\..\..\Projects\Project_A\Applications\Modules\Sources\myfile.c"/>
This works so far with the script:
def parse_xyz_files
files = Dir["./**/*.xyz"]
files.each do |file_name|
puts file_name
File.open(file_name) do |f|
f.each_line { |line|
if line =~ /<ClCompile Include=/
puts "Found #{line}"
end
}
end
end
end
Now I would like to extract only the string between double quotes, in this example:
..\..\..\Projects\Project_A\Applications\Modules\Sources\myfile.c
I'm trying to do it with something like this (with match method):
def parse_xyz_files
files = Dir["./**/*.xyz"]
files.each do |file_name|
puts file_name
File.open(file_name) do |f|
f.each_line { |line|
if line =~ /<ClCompile Include=/.match(/"([^"]*)"/)
puts "Found #{line}"
end
}
end
end
end
The regular expression is so far ok (checked with rubular). Any idea how to do it in a simple way? I'm relative new to ruby.
You can use the String#scan method:
line = '<ClCompile Include="..\..\..\Projects\Project_A\Applications\Modules\Sources\myfile.c"/>'
path = line.scan(/".*"/).first
or in the case if your <CICompile> tag can have some other attributes then:
path = line.scan(/Include="(.*)"/).first.first
But using an XML parser is definitely a much better idea.
Use Nokogiri to parse the XML and not regex.
require 'nokogiri'
xml = '<foo><bar><ClCompile Include="..\..\..\Projects\Project_A\Applications\Modules\Sources\myfile.c"/></bar></foo>'
document = Nokogiri::XML xml
d.xpath('//ClCompile/#Include').text

How do I automate running a Ruby script?

I have written a ruby script (code below) to scrape from Deliveroo.co.uk.
Right now I run it manually by going to terminal and typing in 'ruby ....rb'.
How do I automate things so that this script runs automatically every hour?
Also, how do I save the output from each run without overwriting the previous output?
Code is below.. thank you.
require 'open-uri'
require 'nokogiri'
require 'csv'
# Store URL to be scraped
url = "https://deliveroo.co.uk/restaurants/london/maida-vale?postcode=W92DE"
# Parse the page with Nokogiri
page = Nokogiri::HTML(open(url))
# Display output onto the screen
name =[]
page.css('span.list-item-title.restaurant-name').each do |line|
name << line.text.strip
end
category = []
page.css('span.restaurant-detail.detail-cat').each do |line|
category << line.text.strip
end
delivery_time = []
page.css('span.restaurant-detail.detail-time').each do |line|
delivery_time << line.text.strip
end
distance = []
page.css('span.restaurant-detail.detail-distance').each do |line|
distance << line.text.strip
end
status = []
page.css('li.restaurant--details').each do |line|
if line.attr("class").include? "unavailable"
sts = "closed"
else
sts = "open"
end
status << sts
end
# Write data to CSV file
CSV.open("deliveroo.csv", "w") do |file|
file << ["Name", "Category", "Delivery Time", "Distance", "Status"]
name.length.times do |i|
file << [name[i], category[i], delivery_time[i], distance[i], status[i]]
end
end
There's two questions, I'll try to answer them below.
How to run periodically:
What you are looking for is a cronjob, there are many resources out there for creating one.
Look into cron or gems like whenever / clockwork.
Save output between multiple runs: In order to save the output you could just write to a file directly in ruby, very similar to what you are doing right now.
The way you're saving it right now is:
CSV.open("deliveroo.csv", "w") do |file|
The "w" opens the file and overwrites any content present in it, try "a" (append) instead.
CSV.open("deliveroo.csv", "a") do |file|
Read more here about opening files in different modes: File opening mode in Ruby

How to handle a file_as_string (generated by Prawn) so that it is accepted by Carrierwave?

I'm using Prawn to generate a PDF from the controller of a Rails app,
...
respond_to do |format|
format.pdf do
pdf = GenerateReportPdf.new(#object, view_context)
send_data pdf.render, filename: "Report", type: "application/pdf", disposition: "inline"
end
end
This works fine, but I now want to move GenerateReportPdf into a background task, and pass the resulting object to Carrierwave to upload directly to S3.
The worker looks like this
def perform
pdf = GenerateReportPdf.new(#object)
fileString = ???????
document = Document.new(
object_id: #object.id,
file: fileString )
# file is field used by Carrierwave
end
How do I handle the object returned by Prawn (?????) to ensure it is a format that can be read by Carrierwave.
fileString = pdf.render_file 'filename' writes the object to the root directory of the app. As I'm on Heroku this is not possible.
file = pdf.render returns ArgumentError: string contains null byte
fileString = StringIO.new( pdf.render_file 'filename' ) returns TypeError: no implicit conversion of nil into String
fileString = StringIO.new( pdf.render ) returns ActiveRecord::RecordInvalid: Validation failed: File You are not allowed to upload nil files, allowed types: jpg, jpeg, gif, png, pdf, doc, docx, xls, xlsx
fileString = File.open( pdf.render ) returns ArgumentError: string contains null byte
....and so on.
What am I missing? StringIO.new( pdf.render ) seems like it should work, but I'm unclear why its generating this error.
It turns out StringIO.new( pdf.render ) should indeed work.
The problem I was having was that the filename was being set incorrectly and, despite following the advise below on Carrierwave's wiki, a bug elsewhere in the code meant that the filename was returning as an empty string. I'd overlooked this an assumed that something else was needed
https://github.com/carrierwaveuploader/carrierwave/wiki/How-to:-Upload-from-a-string-in-Rails-3
my code ended up looking like this
def perform
s = StringIO.new(pdf.render)
def s.original_filename; "my file name"; end
document = Document.new(
object_id: #object.id
)
document.file = s
document.save!
end
You want to create a tempfile (which is fine on Heroku as long as you don't expect it to persist across requests).
def perform
# Create instance of your Carrierwave Uploader
uploader = MyUploader.new
# Generate your PDF
pdf = GenerateReportPdf.new(#object)
# Create a tempfile
tmpfile = Tempfile.new("my_filename")
# set to binary mode to avoid UTF-8 conversion errors
tmpfile.binmode
# Use render to write the file contents
tmpfile.write pdf.render
# Upload the tempfile with your Carrierwave uploader
uploader.store! tmpfile
# Close the tempfile and delete it
tmpfile.close
tmpfile.unlink
end
Here's a way you can use StringIO like Andy Harvey mentioned, but without adding a method to the StringIO intstance's eigenclass.
class VirtualFile < StringIO
attr_accessor :original_filename
def initialize(string, original_filename)
#original_filename = original_filename
super(string)
end
end
def perform
pdf_string = GenerateReportPdf.new(#object)
file = VirtualFile.new(pdf_string, 'filename.pdf')
document = Document.new(object_id: #object.id, file: file)
end
This one took me couple of days, the key is to call render_file controlling the filepath so you can keep track of the file, something like this:
in one of my Models e.g.: Policy i have a list of documents and this is just the method for updating the model connected with the carrierwave e.g.:PolicyDocument < ApplicationRecord mount_uploader :pdf_file, PdfDocumentUploader
def upload_pdf_document_file_to_s3_bucket(document_type, filepath)
policy_document = self.policy_documents.where(policy_document_type: document_type)
.where(status: 'processing')
.where(pdf_file: nil).last
policy_document.pdf_file = File.open(file_path, "r")
policy_document.status = 's3_uploaded'
policy_document.save(validate:false)
policy_document
rescue => e
policy_document.status = 's3_uploaded_failed'
policy_document.save(validate:false)
Rails.logger.error "Error uploading policy documents: #{e.inspect}"
end
end
in one of my Prawn PDF File Generators e.g.: PolicyPdfDocumentX in here please note how im rendering the file and returning the filepath so i can grab from the worker object itself
def generate_prawn_pdf_document
Prawn::Document.new do |pdf|
pdf.draw_text "Hello World PDF File", size: 8, at: [370, 462]
pdf.start_new_page
pdf.image Rails.root.join('app', 'assets', 'images', 'hello-world.png'), width: 550
end
end
def generate_tmp_file(filename)
file_path = File.join(Rails.root, "tmp/pdfs", filename)
self.generate_prawn_pdf_document.render_file(file_path)
return filepath
end
in the "global" Worker for creating files and uploading them in the s3 bucket e.g.: PolicyDocumentGeneratorWorker
def perform(filename, document_type, policy)
#here we create the instance of the prawn pdf generator class
pdf_generator_class = document_type.constantize.new
#here we are creating the file, but also `returning the filepath`
file_path = pdf_generator_class.generate_tmp_file(filename)
#here we are simply updating the model with the new file created
policy.upload_pdf_document_file_to_s3_bucket(document_type, file_path)
end
finally how to test, run rails c and:
the_policy = Policies.where....
PolicyDocumentGeneratorWorker.new.perform('report_x.pdf', 'PolicyPdfDocumentX',the_policy)
NOTE: im using meta-programming in case we have multiple and different file generators, constantize.new is just creating new prawn pdf doc generator instance so is similar to PolicyPdfDocument.new that way we can only have one pdf doc generator worker class that can handle all of your prawn pdf documents so for instance if you need a new document you can simply PolicyDocumentGeneratorWorker.new.perform('report_y.pdf', 'PolicyPdfDocumentY',the_policy)
:D
hope this helps someone to save some time

Traversing directories and reading from files in ruby on rails

I'm having some trouble figuring out how to 1) traverse a directory and 2) taking each file (.txt) and saving it as a string. I'm obviously pretty new to both ruby and rails.
I know that I could save the file with f=File.open("/path/*.txt") and then output it with puts f.read but I would rather save it as a string, not .txt, and dont know how to do this for each file.
Thanks!
You could use Dir.glob and map over the filenames to read each filename into a string using IO.read. This is some pseudo code:
file_names_with_contents = Dir.glob('/path/*.txt').inject({}){|results, file_name| result[file_name] = IO.read(file_name)}
You could prob also use tap here:
file_names_with_contents = {}.tap do |h|
Dir.glob('/path/*.txt').each{|file_name| h[file_name] = IO.read(file_name)}
end
The following based on python os.walk function, which returns a list of tuples with: (dirname, dirs, files ). Since this is ruby, you get a list of arrays with:
[dirname, dirs, files]. This should be easier to process than trying to recursively walk the directory yourself. To run the code, you'll need to provide a demo_folder.
def walk(dir)
dir_list = []
def _walk(dir, dir_list)
fns = Dir.entries(dir)
dirs = []
files = []
dirname = File.expand_path(dir)
list_item = [dirname, dirs, files]
fns.each do |fn|
next if [".",".."].include? fn
path_fn = File.join(dirname, fn)
if File.directory? path_fn
dirs << fn
_walk(path_fn, dir_list)
else
files << fn
end
end
dir_list << list_item
end
_walk(dir, dir_list)
dir_list
end
if __FILE__ == $0
require 'json'
dir_list = walk('demo_folder')
puts JSON.pretty_generate(dir_list)
end
Jake's answer is good enough, but each_with_object will make it slightly shorter. I also made it recursive.
def read_dir dir
Dir.glob("#{dir}/*").each_with_object({}) do |f, h|
if File.file?(f)
h[f] = open(f).read
elsif File.directory?(f)
h[f] = read_dir(f)
end
end
end
When the directory is like:
--+ directory_a
+----file_b
+-+--directory_c
| +-----file_d
+----file_e
then
read_dir(directory_a)
willl return:
{file_b => contents_of_file_b,
directory_c => {file_d => contents_of_file_d},
file_e => contents_of_file_e}

Rails, use the content of file in the controller

I've a file in the config directory, let's say my_policy.txt.
I want to use the content of that file in my controller like a simple string.
#policy = #content of /config/my_policy.txt
How to achieve that goal, does Rails provide its own way to do that?
Thanks
Rails doesn't provide a way, but Ruby does:
#policy = IO.read("#{Rails.root}\config\my_policy.txt")
#policy = File.read(RAILS_ROOT + '/config/my_policy.txt')
To also cache the content (if you don't want to read it every time the variable is used):
def policy
##policy ||= File.read(RAILS_ROOT + '/config/my_policy.txt')
end
If you need something more elegant for configuration, check configatronic.
Read file as string like that?!
def get_file_as_string(filename)
data = ''
f = File.open(filename, "r")
f.each_line do |line|
data += line
end
return data
end
##### MAIN #####
#policy = get_file_as_string 'path/to/my_policy.txt'
# print out the string
puts #policy

Resources