I'm developing a web-scraper. So, I wrote some code and don't understand why the loop doesn't work? How can help me with that?
scraper_service.rb:
browser = Watir::Browser.new
browser.goto('some_link_here')
browser.is(class: /event--head-block/).each do |event|
event.is(class: /event--more/).button.click
puts "Hello world"
binding.pry
end
So, when I executed the code, I didn't see 'hello world' in the console. In addition, when tried to understand are the class 'event--head-block' present on the web-page, I run browser.element(class: /event--head-block/).exists? and that returns true.
Update
I forget to say that there are 8-10 same classes with name 'event--head-block'. Probably it's the reason?
I inherited a rails app that is deployed using Heroku (I think). I edit it on AWS's Cloud9 IDE and, for now, just do everything in development mode. The app's purpose is to process large amounts of survey data and spit it out onto a PDF report. This works for small reports with like 10 rows of data, but when I load a report that is querying a data upload of 5000+ rows to create an HTML page which gets converted to a PDF, it takes around 105 seconds, much longer than Heroku's 30 seconds allotted for HTTP requests.
Heroku says this on their website, which gave me some hope:
"Heroku supports HTTP 1.1 features such as long-polling and streaming responses. An application has an initial 30 second window to respond with a single byte back to the client. However, each byte transmitted thereafter (either received from the client or sent by your application) resets a rolling 55 second window. If no data is sent during the 55 second window, the connection will be terminated." (Source: https://devcenter.heroku.com/articles/request-timeout#long-polling-and-streaming-responses)
This sounds excellent to me - I can just send a request to the client every second or so in a loop until we're done creating the large PDF report. However, I don't know how to send or receive a byte or so to "reset the rolling 55 second window" they're talking about.
Here's the part of my controller that is sending the request.
return render pdf: pdf_name + " " + pdf_year.to_s,
disposition: 'attachment',
page_height: 1300,
encoding: 'utf8',
page_size: 'A4',
footer: {html: {template: 'recent_grad/footer.html.erb'}, spacing: 0 },
margin: { top: 10, # default 10 (mm)
bottom: 20,
left: 10,
right: 10 },
template: "recent_grad/report.html.erb",
locals: {start: #start, survey: #survey, years: #years, college: #college, department: #department, program: #program, emphasis: #emphasis, questions: #questions}
I'm making other requests to get to this point, but I believe the part that is causing the issue is here where the template is being rendered. My template queries the database in a finite loop that stops when it runs out of survey questions to query from.
My question is this: how can I "send or receive a byte to the client" to tell Heroku "I'm still trying to create this massive PDF so please reset the timer and give me my 55 seconds!" Is it in the form of a query? Because, if so, I am querying the MySql database over and over again in my report.html.erb file.
Also, it used to work without issues and does work on small reports, but now I get the error "504 Gateway Timeout" before the request is complete on the actual page, but my puma console continues to query the database like a mad man. I assume it's a Heroku problem because the 504 error happens exactly every 35 seconds (5 seconds to process the other parts and 30 seconds to try to finish the loop in the template so it can render correctly).
If you need more information or code, please ask! Thanks in advance
EDIT:
Both of the comments below suggest possible duplicates, but neither of them have a real answer with real code, they simply refer to the docs that I am quoting here. I'm looking for a code example (or at least a way to get my foot in the door), not just a link to the docs. Thanks!
EDIT 2:
I tried what #Sergio said and installed SideKiq. I think I'm really close, but still having some issues with the worker. The worker doesn't have access to ActionView::Base which is required for the render method in rails, so it's not working. I can access the worker method which means my sidekiq and redis servers are running correctly, but it gets caught on the ActionView line with this error:
WARN: NameError: uninitialized constant HardWorker::ActionView
Here's the worker code:
require 'sidekiq'
Sidekiq.configure_client do |config|
# config.redis = { db: 1 }
config.redis = { url: 'redis://172.31.6.51:6379/0' }
end
Sidekiq.configure_server do |config|
# config.redis = { db: 1 }
config.redis = { url: 'redis://172.31.6.51:6379/0' }
end
class HardWorker
include Sidekiq::Worker
def perform(pdf_name, pdf_year)
av = ActionView::Base.new()
av.view_paths = ActionController::Base.view_paths
av.class_eval do
include Rails.application.routes.url_helpers
include ApplicationHelper
end
puts "inside hardworker"
puts pdf_name, pdf_year
av.render pdf: pdf_name + " " + pdf_year.to_s,
disposition: 'attachment',
page_height: 1300,
encoding: 'utf8',
page_size: 'A4',
footer: {html: {template: 'recent_grad/footer.html.erb'}, spacing: 0 },
margin: { top: 10, # default 10 (mm)
bottom: 20,
left: 10,
right: 10 },
template: "recent_grad/report.html.erb",
locals: {start: #start, survey: #survey, years: #years, college: #college, department: #department, program: #program, emphasis: #emphasis, questions: #questions}
end
end
Any suggestions?
EDIT 3:
I did what #Sergio said and attempted to make a PDF from an html.erb file directly and save it to a file. Here's my code:
# /app/controllers/recentgrad_controller.rb
pdf = WickedPdf.new.pdf_from_html_file('home/ec2-user/environment/gradSurvey/gradSurvey/app/views/recent_grad/report.html.erb')
save_path = Rails.root.join('pdfs', pdf_name + pdf_year.to_s + '.pdf')
File.open(save_path, 'wb') do |file|
file << pdf
end
And the error output:
RuntimeError (Failed to execute:
["/usr/local/rvm/gems/ruby-2.4.1#gradSurvey/bin/wkhtmltopdf", "file:///home/ec2-user/environment/gradSurvey/gradSurvey/app/views/recent_grad/report.html.erb", "/tmp/wicked_pdf_generated_file20190523-15416-hvb3zg.pdf"]
Error: PDF could not be generated!
Command Error: Loading pages (1/6)
Error: Failed loading page file:///home/ec2-user/environment/gradSurvey/gradSurvey/app/views/recent_grad/report.html.erb (sometimes it will work just to ignore this error with --load-error-handling ignore)
Exit with code 1 due to network error: ContentNotFoundError
):
I have no idea what it means when it says "sometimes it will work just to ignore this error with --load-error-handling ignore". The file definitely exists and I've tried maybe 5 variations of the file path.
I've had to do something like this several times. In all cases, I ended up writing a background job that does all the heavy lifting generation. And because it's not a web request, it's not affected by the 30 seconds timeout. It goes something like this:
client (your javascript code) requests a new report.
server generates job description and enqueues it for your worker to pick up.
worker picks the job from the queue and starts working (querying database, etc.)
in the meanwhile, client periodically asks the server "is my report done yet?". Server responds with "not yet, try again later"
worker is finished generating the report. It uploads the file to some storage (S3, for example), sets job status to "completed" and job result to the download link for the uploaded report file.
server, seeing that job is completed, can now respond to client status update requests "yes, it's done now. Here's the url. Have a good day."
Everybody's happy. And nobody had to do any streaming or playing with heroku's rolling response timeouts.
The scenario above uses short-polling. I find it the easiest to implement. But it is, of course, a bit wasteful with regard to resources. You can use long-polling or websockets or other fancy things.
Check my response here just in case it works for you. I didnĀ“t wanted to change the user workflow adding a bg job and then a place/notification to get the result.
I use Rails controller streaming support with Live module and set the right reponse headers. I fetch the data from some Enumerable object.
I have a select2 v4 that loads options through AJAX.
I am running a Cucumber test where I need to select 2 options of the list, but I can't seem to make the list open up and load (which normally gets populated when I type 2 or characters).
I have tried:
As suggested here:
#session.execute_script("$('#publish_to').select2('open')")
and
#session.first(".input.publish_to .select2-container").click
and
#session.first("#publish_to").find(".select2-choice").click
which do not give me an error, but I am not getting the options to select, so I am assuming that the click is not really working. Things I have tried to select the options:
# This one cannot find the css:
#session.find(".select2-results__options", text: client.email).click
# This one gives me a Timeout error
#session.evaluate_script "$('#publish_to').val(#{client.id}).trigger('change')"
# This one gives me a Timeout error
#session.evaluate_script "$('.select2-search__field').trigger('keydown').val('#{client.email}').trigger('keyup')";
sleep 10
#session.find('.select2-search__option', text: client.email).click
Anything with trigger gives me a Timeout error, so I tried waiting for jQuery.active but I never got a true even waiting for 2 minutes:
counter = 0
timeout_in_sec = 120
while counter < timeout_in_sec && #session.evaluate_script('jQuery.active').zero?
sleep 1.second
counter+=1
end
I tried using the gem capybara-select2 running:
#session.select2 client.email, css: '#publish_to', search: true
but I get the error undefined methodselect2' for #and I haveWorld(CapybaraSelect2)in myenv.rb`
I am using Cucumber v3.1.2 with ruby gem 'cucumber-rails'
The poltergeist driver is roughly equivalent to a 7 year old version of Safari which means it doesn't support a lot of current JS/CSS. This means your issue could simply be that select2 is no longer compatible with Poltergeist (without a lot of polyfilling). You're going to be much better off updating to using a real browser (stable - chrome via selenium, etc) or one of the direct to Chrome drivers (highly beta) that have spun off Poltergeist (Apparition is one of them). Those will allow you to run with a visible browser (useful for debugging) or headless.
The following code uses Chrome via selenium and interacts with the select2 demo site to select an entry that is loaded via Ajax.
require "selenium/webdriver"
require "capybara/dsl"
sess = Capybara::Session.new(:selenium_chrome)
sess.visit("https://select2.org/data-sources/ajax")
sess.first('.select2-container', minimum: 1).click
sess.find('.select2-dropdown input.select2-search__field').send_keys("capy")
sleep 5 # just to watch the browser search
sess.find('.select2-results__option', text: 'teamcapybara/capybara').click
sess.assert_selector(:css, '.select2-selection__rendered', text: 'teamcapybara/capybara')
sleep 5 # just to see the effect
I wrote a ruby script to download an image URL:
require 'open-uri'
imageAddress = ARGV[0]
targetPath = ARGV[1]
fullFileNamePath = "#{targetPath}test.jpg"
begin
File.open(fullFileNamePath, 'wb') do |fo|
fo.write open(imageAddress).read
end
rescue OpenURI::HTTPError => ex
puts ex
File.delete(fullFileNamePath)
end
Example Usage:
ruby download_image.rb "https://images.genius.com/b015b15e476c92d10a834d523575d3c9.1000x1000x1.jpg" "/Users/Me/Downloads/"
The problem is, sometimes I run across this output error:
520 Origin Error
Then, when I try the same URL in my browser, I get something like this:
If I reload the page or click the 'Retry for a live version' button in the above image, the page loads.
Then if I run the script again it downloads the image just fine.
So how can I replicate this page reload / 'Retry for a live version' behavior using ruby and without switching to my browser? Running the script again doesn't do the job.
It sounds like you are looking for a delay command. If the script fails (or encounters '520 Origin Error') wait and re-try.
This is a quick built recursive function, you may want to add other checks for how many times you have looped, breaking after so many. (Also not tested, may contain errors, meant as an example)
def getFile(params_you_need)
begin
File.open(fullFileNamePath, 'wb') do |fo|
fo.write open(imageAddress).read
end
rescue OpenURI::HTTPError => ex
puts ex
File.delete(fullFileNamePath)
if ex == '520 Origin Error'
sleep(30) #generally a good time to pause
getFile(params_you_need)
end
end
end
Using Rails 3.1.1 and Herkou
I have 1.000 products in my app. They all have a very slow controller which is effectively solved by fragment caching. Although the data doesn't change very often, it still needs to expire (which I do by sweeping) periodically, in my case once a week.
Now, after sweeping the cached views I don't want my users to create the new fragments by trying to access the products one after another (takes about 6-8 secs at the first load, 2-3 sec for the cached load). I assume I can do that with some sort of script that will load each Product Page one by one and thus make the server create those fragments.
I can imagine this can be handled in three ways:
Run a script on my local machine that will try to access each url with some sort of get-command - Downside: Not very pretty and will affect visitor stats in a way I would not prefer.
Run the same type script on the server after the sweeper, that will load each Product. How can I do that, in that case?
Using a smart Rails command to do this automatically. Is there such an elegant command?
I made this script and it works. The "product.slug" is because I have friendly_id installed. It will produce url-variables with names such as www.mydomain.com/productabc-123/ which will be read by Nokogiri (Nokogiri gem is needed for this solution).
PLEASE NOTE THAT I SWITCHED FROM FRAGMENT CACHING TO ACTION CACHING IN THIS SOLUTION (as opposed to the question, where I am using fragment caching). The important difference for this is when I check the cache if Rails.cache.exist?('views/www.mydomain.com/' + product.slug). For fragment_caching it should be the fragment name there instead.
require 'nokogiri'
require 'open-uri'
Product.all.each do |product|
url = 'http://www.mydomain.com/' + product.slug
begin
if Rails.cache.exist?('views/www.mydomain.com/' + product.slug)
puts url + " is already in cache"
else
doc = Nokogiri::HTML(open(url))
puts "Reads " + url
# Verifies if the caching worked. Only for trouble shooting
if Rails.cache.exist?('views/www.mydomain.com/' + product.slug)
puts "--->" + url + " is NOW in the cache"
else
puts "--->" + url + " is still not in the cache!"
end
sleep 1
end
rescue
puts 'Normal rescue of ' + url
rescue Timeout::Error
puts 'Timeout rescue of ' + url
puts 'Sleep for 5 sec'
sleep 5
retry
end
end
Create a script that runs as rake task, or better yet a worker, that runs and curls the page. There is no need to include a gem when you can just call curl
`curl -A "CacheRefresher" #{ENV['HOSTNAME']}/api/v1/#{klass.name.underscore.pluralize}/#{id} >/dev/null 2>&1`