I have written a ruby script where I iterate through folders, and search for file names ending with ".xyz" . Within these files I search then for lines which have the following structure:
<ClCompile Include="..\..\..\Projects\Project_A\Applications\Modules\Sources\myfile.c"/>
This works so far with the script:
def parse_xyz_files
files = Dir["./**/*.xyz"]
files.each do |file_name|
puts file_name
File.open(file_name) do |f|
f.each_line { |line|
if line =~ /<ClCompile Include=/
puts "Found #{line}"
end
}
end
end
end
Now I would like to extract only the string between double quotes, in this example:
..\..\..\Projects\Project_A\Applications\Modules\Sources\myfile.c
I'm trying to do it with something like this (with match method):
def parse_xyz_files
files = Dir["./**/*.xyz"]
files.each do |file_name|
puts file_name
File.open(file_name) do |f|
f.each_line { |line|
if line =~ /<ClCompile Include=/.match(/"([^"]*)"/)
puts "Found #{line}"
end
}
end
end
end
The regular expression is so far ok (checked with rubular). Any idea how to do it in a simple way? I'm relative new to ruby.
You can use the String#scan method:
line = '<ClCompile Include="..\..\..\Projects\Project_A\Applications\Modules\Sources\myfile.c"/>'
path = line.scan(/".*"/).first
or in the case if your <CICompile> tag can have some other attributes then:
path = line.scan(/Include="(.*)"/).first.first
But using an XML parser is definitely a much better idea.
Use Nokogiri to parse the XML and not regex.
require 'nokogiri'
xml = '<foo><bar><ClCompile Include="..\..\..\Projects\Project_A\Applications\Modules\Sources\myfile.c"/></bar></foo>'
document = Nokogiri::XML xml
d.xpath('//ClCompile/#Include').text
Related
I have a list of names (names.txt) separated by line. After I loop through each line, I'd like to move it to another file (processed.txt).
My current implementation to loop through each line:
open("names.txt") do |csv|
csv.each_line do |line|
url = line.split("\n")
puts url
# Remove line from this file amd move it to processed.txt
end
end
def readput
#names = File.readlines("names.txt")
File.open("processed.txt", "w+") do |f|
f.puts(#names)
end
end
You can do it like this:
File.open('processed.txt', 'a') do |file|
open("names.txt") do |csv|
csv.each_line do |line|
url = line.chomp
# Do something interesting with url...
file.puts url
end
end
end
This will result in processed.txt containing all of the urls that were processed with this code.
Note: Removing the line from names.txt is not practical using this method. See How do I remove lines of data in the middle of a text file with Ruby for more information. If this is a real goal of this solution, it will be a much larger implementation with some design considerations that need to be defined.
I have written a ruby script (code below) to scrape from Deliveroo.co.uk.
Right now I run it manually by going to terminal and typing in 'ruby ....rb'.
How do I automate things so that this script runs automatically every hour?
Also, how do I save the output from each run without overwriting the previous output?
Code is below.. thank you.
require 'open-uri'
require 'nokogiri'
require 'csv'
# Store URL to be scraped
url = "https://deliveroo.co.uk/restaurants/london/maida-vale?postcode=W92DE"
# Parse the page with Nokogiri
page = Nokogiri::HTML(open(url))
# Display output onto the screen
name =[]
page.css('span.list-item-title.restaurant-name').each do |line|
name << line.text.strip
end
category = []
page.css('span.restaurant-detail.detail-cat').each do |line|
category << line.text.strip
end
delivery_time = []
page.css('span.restaurant-detail.detail-time').each do |line|
delivery_time << line.text.strip
end
distance = []
page.css('span.restaurant-detail.detail-distance').each do |line|
distance << line.text.strip
end
status = []
page.css('li.restaurant--details').each do |line|
if line.attr("class").include? "unavailable"
sts = "closed"
else
sts = "open"
end
status << sts
end
# Write data to CSV file
CSV.open("deliveroo.csv", "w") do |file|
file << ["Name", "Category", "Delivery Time", "Distance", "Status"]
name.length.times do |i|
file << [name[i], category[i], delivery_time[i], distance[i], status[i]]
end
end
There's two questions, I'll try to answer them below.
How to run periodically:
What you are looking for is a cronjob, there are many resources out there for creating one.
Look into cron or gems like whenever / clockwork.
Save output between multiple runs: In order to save the output you could just write to a file directly in ruby, very similar to what you are doing right now.
The way you're saving it right now is:
CSV.open("deliveroo.csv", "w") do |file|
The "w" opens the file and overwrites any content present in it, try "a" (append) instead.
CSV.open("deliveroo.csv", "a") do |file|
Read more here about opening files in different modes: File opening mode in Ruby
I'm running a rake task to import some file attributes and I'm receiving an error that would lead me to believe that the string created for each line contains some sort of new-line character (e.g. /n).
EDIT - New-line character has been confirmed to be the issue.
Here is a sample of what my CSV file might look like:
1|type1,type2|category1
2|type2|category1,category2,category3
3|type2,type4|category3,category8
And here is my code to deal with it:
namespace :data do
desc "import"
task :import => :environment do
file = File.open(Rails.root.join('lib/assets/data.csv'), 'r')
file.each do |line|
attrs = line.split("|")
foo = Model.find(attrs[0])
attrs[1].split(",").each do |type|
foo.add_type!(ModelType.find_by_name(type))
end
attrs[2].split(",").each do |category|
foo.categorize!(ModelCategory.find_by_name(category))
end
end
end
end
ModelType and ModelCategory are both seperate models with a :through relationship to Model that is built with the function Model.add_type! and Model.categorize!.
When I run rake data:import, everything works fine up until the final category is reached at the end of the first line. It doesn't matter which category it is, nor how many categories are present in attrs[2] - it only fails on the last one. This is the error I receive:
Called id for nil, which would mistakenly be 4 -- if you really wanted the id of nil, use object_id
Any thoughts on how to fix this or avoid this error?
You can use chomp:
attrs = line.chomp.split("|")
attrs = line.split("|")
if attrs.length > 0
foo = Model.find(attrs[0])
...
end
You probably have an empty line at the end of your CSV
UPDATE
file = File.open(Rails.root.join('lib/assets/data.csv'), 'r')
file.split("\r\n").each do |line|
or
file = File.open(Rails.root.join('lib/assets/data.csv'), 'r')
file.split("\r").each do |line|
or
file = File.open(Rails.root.join('lib/assets/data.csv'), 'r')
file.split("\n").each do |line|
depending on how the CSV was originally generated!
Use String.encode(universal_newline: true) instead gsub.
It converting CRLF and CR to LF # Always break lines with \n
I'm having some trouble figuring out how to 1) traverse a directory and 2) taking each file (.txt) and saving it as a string. I'm obviously pretty new to both ruby and rails.
I know that I could save the file with f=File.open("/path/*.txt") and then output it with puts f.read but I would rather save it as a string, not .txt, and dont know how to do this for each file.
Thanks!
You could use Dir.glob and map over the filenames to read each filename into a string using IO.read. This is some pseudo code:
file_names_with_contents = Dir.glob('/path/*.txt').inject({}){|results, file_name| result[file_name] = IO.read(file_name)}
You could prob also use tap here:
file_names_with_contents = {}.tap do |h|
Dir.glob('/path/*.txt').each{|file_name| h[file_name] = IO.read(file_name)}
end
The following based on python os.walk function, which returns a list of tuples with: (dirname, dirs, files ). Since this is ruby, you get a list of arrays with:
[dirname, dirs, files]. This should be easier to process than trying to recursively walk the directory yourself. To run the code, you'll need to provide a demo_folder.
def walk(dir)
dir_list = []
def _walk(dir, dir_list)
fns = Dir.entries(dir)
dirs = []
files = []
dirname = File.expand_path(dir)
list_item = [dirname, dirs, files]
fns.each do |fn|
next if [".",".."].include? fn
path_fn = File.join(dirname, fn)
if File.directory? path_fn
dirs << fn
_walk(path_fn, dir_list)
else
files << fn
end
end
dir_list << list_item
end
_walk(dir, dir_list)
dir_list
end
if __FILE__ == $0
require 'json'
dir_list = walk('demo_folder')
puts JSON.pretty_generate(dir_list)
end
Jake's answer is good enough, but each_with_object will make it slightly shorter. I also made it recursive.
def read_dir dir
Dir.glob("#{dir}/*").each_with_object({}) do |f, h|
if File.file?(f)
h[f] = open(f).read
elsif File.directory?(f)
h[f] = read_dir(f)
end
end
end
When the directory is like:
--+ directory_a
+----file_b
+-+--directory_c
| +-----file_d
+----file_e
then
read_dir(directory_a)
willl return:
{file_b => contents_of_file_b,
directory_c => {file_d => contents_of_file_d},
file_e => contents_of_file_e}
I'm using a combination of rubyzip and nokogiri to edit a .docx file. I'm using rubyzip to unzip the .docx file and then using nokogiri to parse and change the body of the word/document.xml file but ever time I close rubyzip at the end it corrupts the file and I can't open it or repair it. I unzip the .docx file on desktop and check the word/document.xml file and the content is updated to what I changed it to but all the other files are messed up. Could someone help me with this issue? Here is my code:
require 'rubygems'
require 'zip/zip'
require 'nokogiri'
zip = Zip::ZipFile.open("test.docx")
doc = zip.find_entry("word/document.xml")
xml = Nokogiri::XML.parse(doc.get_input_stream)
wt = xml.root.xpath("//w:t", {"w" => "http://schemas.openxmlformats.org/wordprocessingml/2006/main"}).first
wt.content = "New Text"
zip.get_output_stream("word/document.xml") {|f| f << xml.to_s}
zip.close
I ran into the same corruption problem with rubyzip last night. I solved it by copying everything to a new zip file, replacing files as necessary.
Here's my working proof of concept:
#!/usr/bin/env ruby
require 'rubygems'
require 'zip/zip' # rubyzip gem
require 'nokogiri'
class WordXmlFile
def self.open(path, &block)
self.new(path, &block)
end
def initialize(path, &block)
#replace = {}
if block_given?
#zip = Zip::ZipFile.open(path)
yield(self)
#zip.close
else
#zip = Zip::ZipFile.open(path)
end
end
def merge(rec)
xml = #zip.read("word/document.xml")
doc = Nokogiri::XML(xml) {|x| x.noent}
(doc/"//w:fldSimple").each do |field|
if field.attributes['instr'].value =~ /MERGEFIELD (\S+)/
text_node = (field/".//w:t").first
if text_node
text_node.inner_html = rec[$1].to_s
else
puts "No text node for #{$1}"
end
end
end
#replace["word/document.xml"] = doc.serialize :save_with => 0
end
def save(path)
Zip::ZipFile.open(path, Zip::ZipFile::CREATE) do |out|
#zip.each do |entry|
out.get_output_stream(entry.name) do |o|
if #replace[entry.name]
o.write(#replace[entry.name])
else
o.write(#zip.read(entry.name))
end
end
end
end
#zip.close
end
end
if __FILE__ == $0
file = ARGV[0]
out_file = ARGV[1] || file.sub(/\.docx/, ' Merged.docx')
w = WordXmlFile.open(file)
w.force_settings
w.merge('First_Name' => 'Eric', 'Last_Name' => 'Mason')
w.save(out_file)
end
I stumbled accross the post and know nothing about ruby or nokogiri but ...
It looks like you are reziping the new content incorrectly.
I don't know about rubyzip, but you need a way to tell it to update the entry word/document.xml
and then resave/rezip the file.
It looks like you are just overwriting the entry with new data wich of course is going to be a different size and totally screw up the rest of the zip file.
I give an example for excel in this post Parse text file and create an excel report
which may be of use even though i am using a different zip library and VB (Im still doing exactly what you are trying to do, my code is about half way down)
here is the part that applies
Using z As ZipFile = ZipFile.Read(xlStream.BaseStream)
'Grab Sheet 1 out of the file parts and read it into a string.
Dim myEntry As ZipEntry = z("xl/worksheets/sheet1.xml")
Dim msSheet1 As New MemoryStream
myEntry.Extract(msSheet1)
msSheet1.Position = 0
Dim sr As New StreamReader(msSheet1)
Dim strXMLData As String = sr.ReadToEnd
'Grab the data in the empty sheet and swap out the data that I want
Dim str2 As XElement = CreateSheetData(tbl)
Dim strReplace As String = strXMLData.Replace("<sheetData/>", str2.ToString)
z.UpdateEntry("xl/worksheets/sheet1.xml", strReplace)
'This just rezips the file with the new data it doesnt save to disk
z.Save(fiRet.FullName)
End Using
According to the official Github documentation, you should Use write_buffer instead open. There's also a code example at the link.