How to get XML doc from downloaded zip file in rails - ruby-on-rails

I have used Typhoeus to stream a zip file to memory, then am iterating through each file to extract the XML doc. To read the XML file I used Nokogiri, but am getting an error, Errno::ENOENT: No such file or directory # rb_sysopen - my_xml_doc.xml.
I looked up the error and saw that ruby is most likely running the script in the wrong directory. I am a little confused, do I need to save the XML doc to memory first before I can read it as well?
Here is my code to clarify further:
Controller
def index
url = "http://feed.omgili.com/5Rh5AMTrc4Pv/mainstream/posts/"
html_response = Typhoeus.get(url)
doc = Nokogiri::HTML(html_response.response_body)
path_array = []
doc.search("a").each do |value|
path_array << value.content if value.content.include?(".zip")
end
path_array.each do |zip_link|
download_file = File.open zip_link, "wb"
request = Typhoeus::Request.new("#{url}#{zip_link}")
binding.pry
request.on_headers do |response|
if response.code != 200
raise "Request failed"
end
end
request.on_body do |chunk|
download_file.write(chunk)
end
request.run
Zip::File.open(download_file) do |zipfile|
zipfile.each do |file|
binding.pry
doc = Nokogiri::XML(File.read(file.name))
end
end
end
end
file
=> #<Zip::Entry:0x007ff88998373
#comment="",
#comment_length=0,
#compressed_size=49626,
#compression_method=8,
#crc=20393847,
#dirty=false,
#external_file_attributes=0,
#extra={},
#extra_length=0,
#filepath=nil,
#follow_symlinks=false,
#fstype=0,
#ftype=:file,
#gp_flags=2056,
#header_signature=009890,
#internal_file_attributes=0,
#last_mod_date=18769,
#last_mod_time=32626,
#local_header_offset=0,
#local_header_size=nil,
#name="my_xml_doc.xml",
#name_length=36,
#restore_ownership=false,
#restore_permissions=false,
#restore_times=true,
#size=138793,
#time=2016-10-17 15:59:36 -0400,
#unix_gid=nil,
#unix_perms=nil,
#unix_uid=nil,
#version=20,
#version_needed_to_extract=20,
#zipfile="some_zip_file.zip">

This is the solution I came up with:
Gems:
gem 'typhoeus'
gem 'rubyzip'
gem 'redis', '~>3.2'
Controller:
def xml_to_redis_list(url)
html_response = Typhoeus.get(url)
doc = Nokogiri::HTML(html_response.response_body)
#redis = Redis.new
path_array = []
doc.search("a").each do |value|
path_array << value.content if value.content.include?(".zip")
end
path_array.each do |zip_link|
download_file = File.open zip_link, "wb"
request = Typhoeus::Request.new("#{url}#{zip_link}")
request.on_headers do |response|
if response.code != 200
raise "Request failed"
end
end
request.on_body do |chunk|
download_file.write(chunk)
end
request.run
while download_file.size == 0
sleep 1
end
zip_download = Zip::File.open(download_file.path)
Zip::File.open("#{Rails.root}/#{zip_download.name}") do |zip_file|
zip_file.each do |file|
xml_string = zip_file.read(file.name)
check_if_xml_duplicate(xml_string)
#redis.rpush("NEWS_XML", xml_string)
end
end
File.delete("#{Rails.root}/#{zip_link}")
end
end
def check_if_xml_duplicate(xml_string)
#redis.lrem("NEWS_XML", -1, xml_string)
end

Related

Ruby on Rails Encoding::UndefinedConversionError ("\xF8" from ASCII-8BIT to UTF-8)

I saw a bunch of question with a similar topic but I couldn't find a solution to my problem. Hopefully someone can help.
I have a Ruby on Rails app. In this app, I have some base64 data that I want to decode and write in a file. When I have a small script that I call through ruby myFile.rb, the program behaves as expeted. However when I run the same code with rails c. I have the following error:
Traceback (most recent call last):
7: from (irb):1:in `<main>'
6: from app/models/node.rb:1:in `<main>'
5: from app/models/node_manager.rb:317:in `<main>'
4: from app/models/node_manager.rb:255:in `convert_base64_to_file'
3: from app/models/node_manager.rb:255:in `open'
2: from app/models/node_manager.rb:255:in `block in convert_base64_to_file'
1: from app/models/node_manager.rb:255:in `write'
Encoding::UndefinedConversionError ("\xF8" from ASCII-8BIT to UTF-8)
Here are my files:
class I wrote in node_manager.rb
class NodeManager
class MacaroonInterceptor < GRPC::ClientInterceptor
attr_reader :macaroon
def initialize(macaroon)
#macaroon = macaroon
super
end
def request_response(request:, call:, method:, metadata:)
metadata['macaroon'] = macaroon
yield
end
def server_streamer(request:, call:, method:, metadata:)
metadata['macaroon'] = macaroon
yield
end
end
def initialize(tls_path, macaroon_path, node_ip)
#tls_path = tls_path
#macaroon_path = macaroon_path
#node_ip = node_ip
#node = connect(#tls_path, #macaroon_path, #node_ip)
end
def connect(tls_path, macaroon_path, node_ip)
certificate = File.read(File.expand_path(tls_path))
credentials = GRPC::Core::ChannelCredentials.new(certificate)
macaroon_binary = File.read(File.expand_path(macaroon_path))
macaroon = macaroon_binary.each_byte.map { |b| b.to_s(16).rjust(2,'0') }.join
if stub = Lnrpc::Lightning::Stub.new(
node_ip,
credentials,
interceptors: [NodeManager::MacaroonInterceptor.new(macaroon)],
channel_args: {"grpc.max_receive_message_length" => 1024 * 1024 * 50}
)
return stub
else
puts 'error'
raise "Could not connect to the node"
end
end
def get_info()
begin
node_info = #node.get_info(Lnrpc::GetInfoRequest.new())
raise LoadError if !node_info
return #node.get_info(Lnrpc::GetInfoRequest.new())
rescue
puts "could not connect to the node"
end
end
# Class methods
def self.convert_base64_to_file(directory: nil, file_name:, base64data:)
if directory
file_path = [directory, file_name].join('/')
else
file_path = file_name
end
file = File.open("file_name", 'w') {|f| f.write(Base64.decode64(base64data)) }
return file_path
end
end
script I run - works when I call ruby node_manager.rb and not when I run it with rails c
admin_tls_base64 = "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUNKekNDQWMyZ0F3SUJBZ0lSQUlWYzdlNHkxVGlpcHpmNWtHd09BUkF3Q2dZSUtvWkl6ajBFQXdJd01URWYKTUIwR0ExVUVDaE1XYkc1a0lHRjFkRzluWlc1bGNtRjBaV1FnWTJWeWRERU9NQXdHQTFVRUF4TUZZV3hwWTJVdwpIaGNOTWpFd056QTRNVEF5TmpFMFdoY05Nakl3T1RBeU1UQXlOakUwV2pBeE1SOHdIUVlEVlFRS0V4WnNibVFnCllYVjBiMmRsYm1WeVlYUmxaQ0JqWlhKME1RNHdEQVlEVlFRREV3VmhiR2xqWlRCWk1CTUdCeXFHU000OUFnRUcKQ0NxR1NNNDlBd0VIQTBJQUJMQ3B2eDdZb1ptZURIQVdjN0ozdmZTTUhPNEJhRnpMV0hJUXhJM2swaVJRS2xVVQpaSkdUcnFCQm1kU3AzcnNRMWsvSmtqOVlLZEVRMHVmcTdNeDBzcEdqZ2NVd2djSXdEZ1lEVlIwUEFRSC9CQVFECkFnS2tNQk1HQTFVZEpRUU1NQW9HQ0NzR0FRVUZCd01CTUE4R0ExVWRFd0VCL3dRRk1BTUJBZjh3SFFZRFZSME8KQkJZRUZJWlUrUHlTYVB0eFdOV0RqckJ5SUZuVXdTL0JNR3NHQTFVZEVRUmtNR0tDQldGc2FXTmxnZ2xzYjJOaApiR2h2YzNTQ0JXRnNhV05sZ2c1d2IyeGhjaTF1TWkxaGJHbGpaWUlFZFc1cGVJSUtkVzVwZUhCaFkydGxkSUlIClluVm1ZMjl1Ym9jRWZ3QUFBWWNRQUFBQUFBQUFBQUFBQUFBQUFBQUFBWWNFckJJQUFqQUtCZ2dxaGtqT1BRUUQKQWdOSUFEQkZBaUVBbDhnOERONStpRTA5c1ZZS2RoblExUmZWeWhhaTdLMG9xK2puV3NyNHpVNENJR2t6Nko4eQprTk4rSTRHT040UGNIWGN1QjRjWEt1aDJxVWFvd2IvZVBxV3YKLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo="
admin_mac_base64= "AgEDbG5kAvgBAwoQ4MUVu75H5pCIGbg7qzbRfBIBMBoWCgdhZGRyZXNzEgRyZWFkEgV3cml0ZRoTCgRpbmZvEgRyZWFkEgV3cml0ZRoXCghpbnZvaWNlcxIEcmVhZBIFd3JpdGUaIQoIbWFjYXJvb24SCGdlbmVyYXRlEgRyZWFkEgV3cml0ZRoWCgdtZXNzYWdlEgRyZWFkEgV3cml0ZRoXCghvZmZjaGFpbhIEcmVhZBIFd3JpdGUaFgoHb25jaGFpbhIEcmVhZBIFd3JpdGUaFAoFcGVlcnMSBHJlYWQSBXdyaXRlGhgKBnNpZ25lchIIZ2VuZXJhdGUSBHJlYWQAAAYgzSSZTiO2yEt9aP+zP95czvfNNPgQXhyLNto2X1onfqQ="
admin_tls = NodeManager.convert_base64_to_file(file_name: "test_tls.cert", base64data: admin_tls_base64)
admin_mac = NodeManager.convert_base64_to_file(file_name: "test.macaroon", base64data: admin_mac_base64)
admin_node = NodeManager.new(admin_tls, admin_mac, '127.0.0.1:10001')
puts admin_node.get_info
Thank you for your help,
A simple solution was to give the File.write function the 'wb' rights
def self.convert_base64_to_file(directory: nil, file_name:, base64data:)
if directory
file_path = [directory, file_name].join('/')
else
file_path = file_name
end
file = File.open("file_name", 'w') {|f| f.write(Base64.decode64(base64data)) }
return file_path
end
I hope that'll help someone!

AWS::S3::Errors::NoSuchKey: No Such Key error

I'm trying to create a method that deletes files on an S3 instance, but I am getting a AWS::S3::Errors::NoSuchKey: No Such Key error when I try to call .head or .read on an object.
app/models/file_item.rb
def thumbnail
{
exists: thumbnailable?,
small: "http://#{bucket}.s3.amazonaws.com/images/#{id}/small_thumb.png",
large: "http://#{bucket}.s3.amazonaws.com/images/#{id}/large_thumb.png"
}
end
lib/adapters/amazons3/accessor.rb
module Adapters
module AmazonS3
class Accessor
S3_BUCKET = AWS::S3.new.buckets[ENV['AMAZON_BUCKET']]
...
def self.delete_file(thumbnail)
prefix_pattern = %r{http://[MY-S3-HOST]-[a-z]+.s3.amazonaws.com/}
small_path = thumbnail[:small].sub(prefix_pattern, '')
large_path = thumbnail[:large].sub(prefix_pattern, '')
small = S3_BUCKET.objects[small_path]
large = S3_BUCKET.objects[large_path]
binding.pry
S3_BUCKET.objects.delete([small, large])
end
end
end
end
example url1
"http://projectname-staging.s3.amazonaws.com/images/994/small_thumb.png"
example url2
"http://projectname-production.s3.amazonaws.com/images/994/large_thumb.png"
assuming awssdk v1 for ruby.
small = S3_BUCKET.objects[small_path]
does not actually get any objects.
from: https://docs.aws.amazon.com/AWSRubySDK/latest/AWS/S3/Bucket.html
bucket.objects['key'] #=> makes no request, returns an S3Object
bucket.objects.each do |obj|
puts obj.key
end
so you would need to alter your code to something like:
to_delete = []
S3_BUCKET.objects[small_path].each do |obj|
to_delete << obj.key
end
S3_BUCKET.objects[large_path].each do |obj|
to_delete << obj.key
end
S3_BUCKET.objects.delete(to_delete)
just banged out the code, so the idea is there, you might need to correct/polish it a bit
I was able to come of with a kind of different solution thanks to your answer of #Mircea above.
def self.delete_file(thumbnail)
folder = thumbnail[:small].match(/(\d+)(?!.*\d)/)
to_delete = []
S3_BUCKET.objects.with_prefix("images/#{folder}").each do |thumb|
to_delete << thumb.key
end
# binding.pry
S3_BUCKET.objects.delete(to_delete)
end

Log file ruby testing

I have this ruby script and I have a problem on the line 57..
please let me know where is the error
#!/usr/bin/ruby
class CommonLog
# init method takes log filename, computes frequency counts
# for ips, urls & statuses
def initialize(log)
f = File.open(log)
#filename = log
#filesize = f.size
#ip_counts = Hash.new(0)
#url_counts = Hash.new(0)
#status_counts = Hash.new(0)
#total_records = 0
f.readlines.each do |line|
tokens = line.split(' ')
ip = tokens[0]
url = tokens[-4]
status = tokens[-2]
#ip_counts[ip] += 1
#url_counts[url] += 1
#status_counts[status] += 1
#total_records += 1
end
f.close
end
# displays filename and bytesize
def file_info
"Filename: #{#filename}, Bytes Transferred: #{#filesize}"
end
# draws ip histogram
def ip_hist
#ip_counts.each do |ip, freq|
puts "#{ip}: #{'*'*freq}"
end
puts file_info
end
# draws url histogram
def url_hist
#url_counts.each do |url,freq|
puts "#{url} #{'*'*freq}"
end
puts file_info
end
# draws list of statuses
def status_list
msg = ""
sorted_status_codes = #status_counts.keys.sort
sorted_status_codes do |code|
count = #status_counts[code]
pct = ((count.to_f/#total_records)*100).to_i
msg += "#{code}: #{percentage}%\n"
end
puts msg
puts file_info
end
end
def test
log = CommonLog.new('**TEST LOG FILE PATH**')
log.ip_histogram
log.url_histogram
log.status_codes
end
test()
Thank you for your help in advance.. your help would be much appreciated..
The log file is on the server, so I removed the path because you won't be able to reach it ..
Here are the message I get back after I run the file
/1.rb:57:in `status_list': undefined method `keys' for nil:NilClass (NoMethodError)
from ./1.rb:73:in `test'
from ./1.rb:75:in `<main>'

Gem Resque Error - Undefined "method perform" after Overriding it form the super class

First of all Thanks for you all for helping programmers like me with your valuable inputs in solving day to day issues.
This is my first question in stack overflow as I am experiencing this problems from almost one week.
WE are building a crawler which crawls the specific websites and extract the contents from it, we are using mechanize to acheive this , as it was taking loads of time we decided to run the crawling process as a background task using resque with redis gem , but while sending the process to background I am experiencing the error as the title saying,
my code in lib/parsers/home.rb
require 'resque'
require File.dirname(__FILE__)+"/../index"
class Home < Index
Resque.enqueue(Index , :page )
def self.perform(page)
super (page)
search_form = page.form_with :name=>"frmAgent"
resuts_page = search_form.submit
total_entries = resuts_page.parser.xpath('//*[#id="PagingTable"]/tr[2]/td[2]').text
if total_entries =~ /(\d+)\s*$/
total_entries = $1
else
total_entries = "unknown"
end
start_res_idx = 1
while true
puts "Found #{total_entries} entries"
detail_links = resuts_page.parser.xpath('//*[#id="MainTable"]/tr/td/a')
detail_links.each do |d_link|
if d_link.attribute("class")
next
else
data_page = #agent.get d_link.attribute("href")
fields = get_fields_from_page data_page
save_result_page page.uri.to_s, fields
#break
end
end
site_done
rescue Exception => e
puts "error: #{e}"
end
end
and the superclass in lib/index.rb is
require 'resque'
require 'mechanize'
require 'mechanize/form'
class Index
#queue = :Index_queue
def initialize(site)
#site = site
#agent = Mechanize.new
#agent.user_agent = Mechanize::AGENT_ALIASES['Windows Mozilla']
#agent.follow_meta_refresh = true
#rows_parsed = 0
#rows_total = 0
rescue Exception => e
log "Unable to login: #{e.message}"
end
def run
log "Parsing..."
url = "unknown"
if #site.url
url = #site.url
log "Opening #{url} as a data page"
#page = #agent.get(url)
#perform method should be override in subclasses
#data = self.perform(#page)
else
#some sites do not have "datapage" URL
#for example after login you're already on your very own datapage
#this is to be addressed in 'perform' method of subclass
#data = self.perform(nil)
end
rescue Exception=>e
puts "Failed to parse URL '#{url}', exception=>"+e.message
set_site_status("error "+e.message)
end
#overriding method
def self.perform(page)
end
def save_result_page(url, result_params)
result = Result.find_by_sql(["select * from results where site_id = ? AND ref_code = ?", #site.id, utf8(result_params[:ref_code])]).first
if result.nil?
result_params[:site_id] = #site.id
result_params[:time_crawled] = DateTime.now().strftime "%Y-%m-%d %H:%M:%S"
result_params[:link] = url
result = Result.create result_params
else
result.result_fields.each do |f|
f.delete
end
result.link = url
result.time_crawled = DateTime.now().strftime "%Y-%m-%d %H:%M:%S"
result.html = result_params[:html]
fields = []
result_params[:result_fields_attributes].each do |f|
fields.push ResultField.new(f)
end
result.result_fields = fields
result.save
end
#rows_parsed +=1
msg = "Saved #{#rows_parsed}"
msg +=" of #{#rows_total}" if #rows_total.to_i > 0
log msg
return result
end
end
What's Wrong with this code?
Thanks

Contextual Logging with Log4r

Here's how some of my existing logging code with Log4r is working. As you can see in the WorkerX::a_method, any time that I log a message I want the class name and the calling method to be included (I don't want all the caller history or any other noise, which was my purpose behind LgrHelper).
class WorkerX
include LgrHelper
def initialize(args = {})
#logger = Lgr.new({:debug => args[:debug], :logger_type => 'WorkerX'})
end
def a_method
error_msg("some error went down here")
# This prints out: "WorkerX::a_method - some error went down here"
end
end
class Lgr
require 'log4r'
include Log4r
def initialize(args = {}) # args: debug boolean, logger type
#debug = args[:debug]
#logger_type = args[:logger_type]
#logger = Log4r::Logger.new(#logger_type)
format = Log4r::PatternFormatter.new(:pattern => "%l:\t%d - %m")
outputter = Log4r::StdoutOutputter.new('console', :formatter => format)
#logger.outputters = outputter
if #debug then
#logger.level = DEBUG
else
#logger.level = INFO
end
end
def debug(msg)
#logger.debug(msg)
end
def info(msg)
#logger.info(msg)
end
def warn(msg)
#logger.warn(msg)
end
def error(msg)
#logger.error(msg)
end
def level
#logger.level
end
end
module LgrHelper
# This module should only be included in a class that has a #logger instance variable, obviously.
protected
def info_msg(msg)
#logger.info(log_intro_msg(self.method_caller_name) + msg)
end
def debug_msg(msg)
#logger.debug(log_intro_msg(self.method_caller_name) + msg)
end
def warn_msg(msg)
#logger.warn(log_intro_msg(self.method_caller_name) + msg)
end
def error_msg(msg)
#logger.error(log_intro_msg(self.method_caller_name) + msg)
end
def log_intro_msg(method)
msg = class_name
msg += '::'
msg += method
msg += ' - '
msg
end
def class_name
self.class.name
end
def method_caller_name
if /`(.*)'/.match(caller[1]) then # caller.first
$1
else
nil
end
end
end
I really don't like this approach. I'd rather just use the existing #logger instance variable to print the message and be smart enough to know the context. How can this, or similar simpler approach, be done?
My environment is Rails 2.3.11 (for now!).
After posting my answer using extend, (see "EDIT", below), I thought I'd try using set_trace_func to keep a sort of stack trace like in the discussion I posted to. Here is my final solution; the set_trace_proc call would be put in an initializer or similar.
#!/usr/bin/env ruby
# Keep track of the classes that invoke each "call" event
# and the method they called as an array of arrays.
# The array is in the format: [calling_class, called_method]
set_trace_func proc { |event, file, line, id, bind, klass|
if event == "call"
Thread.current[:callstack] ||= []
Thread.current[:callstack].push [klass, id]
elsif event == "return"
Thread.current[:callstack].pop
end
}
class Lgr
require 'log4r'
include Log4r
def initialize(args = {}) # args: debug boolean, logger type
#debug = args[:debug]
#logger_type = args[:logger_type]
#logger = Log4r::Logger.new(#logger_type)
format = Log4r::PatternFormatter.new(:pattern => "%l:\t%d - %m")
outputter = Log4r::StdoutOutputter.new('console', :formatter => format)
#logger.outputters = outputter
if #debug then
#logger.level = DEBUG
else
#logger.level = INFO
end
end
def debug(msg)
#logger.debug(msg)
end
def info(msg)
#logger.info(msg)
end
def warn(msg)
#logger.warn(msg)
end
def error(msg)
#logger.error(msg)
end
def level
#logger.level
end
def invoker
Thread.current[:callstack] ||= []
( Thread.current[:callstack][-2] || ['Kernel', 'main'] )
end
end
class CallingMethodLogger < Lgr
[:info, :debug, :warn, :error].each do |meth|
define_method(meth) { |msg| super("#{invoker[0]}::#{invoker[1]} - #{msg}") }
end
end
class WorkerX
def initialize(args = {})
#logger = CallingMethodLogger.new({:debug => args[:debug], :logger_type => 'WorkerX'})
end
def a_method
#logger.error("some error went down here")
# This prints out: "WorkerX::a_method - some error went down here"
end
end
w = WorkerX.new
w.a_method
I don't know how much, if any, the calls to the proc will affect the performance of an application; if it ends up being a concern, perhaps something not as intelligent about the calling class (like my old answer, below) will work better.
[EDIT: What follows is my old answer, referenced above.]
How about using extend? Here's a quick-and-dirty script I put together from your code to test it out; I had to reorder things to avoid errors, but the code is the same with the exception of LgrHelper (which I renamed CallingMethodLogger) and the second line of WorkerX's initializer:
#!/usr/bin/env ruby
module CallingMethodLogger
def info(msg)
super("#{#logger_type}::#{method_caller_name} - " + msg)
end
def debug(msg)
super("#{#logger_type}::#{method_caller_name} - " + msg)
end
def warn(msg)
super("#{#logger_type}::#{method_caller_name} - " + msg)
end
def error(msg)
super("#{#logger_type}::#{method_caller_name} - " + msg)
end
def method_caller_name
if /`(.*)'/.match(caller[1]) then # caller.first
$1
else
nil
end
end
end
class Lgr
require 'log4r'
include Log4r
def initialize(args = {}) # args: debug boolean, logger type
#debug = args[:debug]
#logger_type = args[:logger_type]
#logger = Log4r::Logger.new(#logger_type)
format = Log4r::PatternFormatter.new(:pattern => "%l:\t%d - %m")
outputter = Log4r::StdoutOutputter.new('console', :formatter => format)
#logger.outputters = outputter
if #debug then
#logger.level = DEBUG
else
#logger.level = INFO
end
end
def debug(msg)
#logger.debug(msg)
end
def info(msg)
#logger.info(msg)
end
def warn(msg)
#logger.warn(msg)
end
def error(msg)
#logger.error(msg)
end
def level
#logger.level
end
end
class WorkerX
def initialize(args = {})
#logger = Lgr.new({:debug => args[:debug], :logger_type => 'WorkerX'})
#logger.extend CallingMethodLogger
end
def a_method
#logger.error("some error went down here")
# This prints out: "WorkerX::a_method - some error went down here"
end
end
w = WorkerX.new
w.a_method
The output is:
ERROR: 2011-07-24 20:01:40 - WorkerX::a_method - some error went down here
The downside is, via this method, the caller's class name isn't automatically figured out; it's explicit based on the #logger_type passed into the Lgr instance. However, you may be able to use another method to get the actual name of the class--perhaps something like the call_stack gem or using Kernel#set_trace_func--see this thread.

Resources