I build a website-crawler that (later on) uses these links to read out information.
The current rake-task goes through all the possible pages one by one and checks if the requests goes trough (valid response) or returns a 404/503 error (invalid page). If it's valid the pages url gets saved into my database.
Now as you can see the task requests 50,000 pages in total thus requires some time.
I have read about Sidekiq and how it can perform these tasks asynchronously thus making this a lot faster.
My question: As you can see my task builds the counter after each loop. This will not work with Sidekiq I guess as it will only perform this script independent of itself various times, am I right?
How would I go around the problem of each instance needing its own counter then?
Hopefully my question makes sense - Thank you very much!
desc "Validate Pages"
task validate_url: :environment do
require 'rubygems'
require 'open-uri'
require 'nokogiri'
counter = 1
base_url = "http://example.net/file"
until counter > 50000 do
begin
url = "#{base_url}_#{counter}/"
open(url)
page = Page.new
page.url = url
page.save!
puts "Saved #{url} !"
counter += 1
rescue OpenURI::HTTPError => ex
logger ||= Logger.new("validations.log")
if ex.io.status[0] == "503"
logger.info "#{ex} # #{counter}"
end
puts "#{ex} # #{counter}"
counter += 1
rescue SocketError => ex
logger ||= Logger.new("validations.log")
logger.info "#{ex} # #{counter}"
puts "#{ex} # #{counter}"
counter += 1
end
end
end
A simple Redis INCR operation will create and/or increment a global counter for your jobs to use. You can use Sidekiq's redis connection to implement a counter trivially:
Sidekiq.redis do |conn|
conn.incr("my-counter")
end
If you want to use it async - that means you will have many instances of same job. The fastest approach - to use something like redis. This will give you simple and fast way to check\update counter for your needs. But also make sure you took care about counter: If one of your jobs using it, lock it for other jobs, so there wont be wrong results, etc
Related
I wrote a ruby script to download an image URL:
require 'open-uri'
imageAddress = ARGV[0]
targetPath = ARGV[1]
fullFileNamePath = "#{targetPath}test.jpg"
begin
File.open(fullFileNamePath, 'wb') do |fo|
fo.write open(imageAddress).read
end
rescue OpenURI::HTTPError => ex
puts ex
File.delete(fullFileNamePath)
end
Example Usage:
ruby download_image.rb "https://images.genius.com/b015b15e476c92d10a834d523575d3c9.1000x1000x1.jpg" "/Users/Me/Downloads/"
The problem is, sometimes I run across this output error:
520 Origin Error
Then, when I try the same URL in my browser, I get something like this:
If I reload the page or click the 'Retry for a live version' button in the above image, the page loads.
Then if I run the script again it downloads the image just fine.
So how can I replicate this page reload / 'Retry for a live version' behavior using ruby and without switching to my browser? Running the script again doesn't do the job.
It sounds like you are looking for a delay command. If the script fails (or encounters '520 Origin Error') wait and re-try.
This is a quick built recursive function, you may want to add other checks for how many times you have looped, breaking after so many. (Also not tested, may contain errors, meant as an example)
def getFile(params_you_need)
begin
File.open(fullFileNamePath, 'wb') do |fo|
fo.write open(imageAddress).read
end
rescue OpenURI::HTTPError => ex
puts ex
File.delete(fullFileNamePath)
if ex == '520 Origin Error'
sleep(30) #generally a good time to pause
getFile(params_you_need)
end
end
end
I have a rails application running with puma server. Is there any way, we can see how many number of threads used in application currently ?
I was wondering about the same thing a while ago and came upon this issue. The author included the code they ended up using to collect those stats:
module PumaThreadLogger
def initialize *args
ret = super *args
Thread.new do
while true
# Every X seconds, write out what the state of this dyno is in a format that Librato understands.
sleep 5
thread_count = 0
backlog = 0
waiting = 0
# I don't do the logging or string stuff inside of the mutex. I want to get out of there as fast as possible
#mutex.synchronize {
thread_count = #workers.size
backlog = #todo.size
waiting = #waiting
}
# For some reason, even a single Puma server (not clustered) has two booted ThreadPools.
# One of them is empty, and the other is actually doing work
# The check above ignores the empty one
if (thread_count > 0)
# It might be cool if we knew the Puma worker index for this worker, but that didn't look easy to me.
# The good news: By using the PID we can differentiate two different workers on two different dynos with the same name
# (which might happen if one is shutting down and the other is starting)
source_name = "#{Process.pid}"
# If we have a dyno name, prepend it to the source to make it easier to group in the log output
dyno_name = ENV['DYNO']
if (dyno_name)
source_name="#{dyno_name}."+source_name
end
msg = "source=#{source_name} "
msg += "sample#puma.backlog=#{backlog} sample#puma.active_connections=#{thread_count - waiting} sample#puma.total_threads=#{thread_count}"
Rails.logger.info msg
end
end
end
ret
end
end
module Puma
class ThreadPool
prepend PumaThreadLogger
end
end
This code contains logic that is specific to heroku, but the core of collecting the #workers.size and logging it will work in any environment.
Using Rails 3.1.1 and Herkou
I have 1.000 products in my app. They all have a very slow controller which is effectively solved by fragment caching. Although the data doesn't change very often, it still needs to expire (which I do by sweeping) periodically, in my case once a week.
Now, after sweeping the cached views I don't want my users to create the new fragments by trying to access the products one after another (takes about 6-8 secs at the first load, 2-3 sec for the cached load). I assume I can do that with some sort of script that will load each Product Page one by one and thus make the server create those fragments.
I can imagine this can be handled in three ways:
Run a script on my local machine that will try to access each url with some sort of get-command - Downside: Not very pretty and will affect visitor stats in a way I would not prefer.
Run the same type script on the server after the sweeper, that will load each Product. How can I do that, in that case?
Using a smart Rails command to do this automatically. Is there such an elegant command?
I made this script and it works. The "product.slug" is because I have friendly_id installed. It will produce url-variables with names such as www.mydomain.com/productabc-123/ which will be read by Nokogiri (Nokogiri gem is needed for this solution).
PLEASE NOTE THAT I SWITCHED FROM FRAGMENT CACHING TO ACTION CACHING IN THIS SOLUTION (as opposed to the question, where I am using fragment caching). The important difference for this is when I check the cache if Rails.cache.exist?('views/www.mydomain.com/' + product.slug). For fragment_caching it should be the fragment name there instead.
require 'nokogiri'
require 'open-uri'
Product.all.each do |product|
url = 'http://www.mydomain.com/' + product.slug
begin
if Rails.cache.exist?('views/www.mydomain.com/' + product.slug)
puts url + " is already in cache"
else
doc = Nokogiri::HTML(open(url))
puts "Reads " + url
# Verifies if the caching worked. Only for trouble shooting
if Rails.cache.exist?('views/www.mydomain.com/' + product.slug)
puts "--->" + url + " is NOW in the cache"
else
puts "--->" + url + " is still not in the cache!"
end
sleep 1
end
rescue
puts 'Normal rescue of ' + url
rescue Timeout::Error
puts 'Timeout rescue of ' + url
puts 'Sleep for 5 sec'
sleep 5
retry
end
end
Create a script that runs as rake task, or better yet a worker, that runs and curls the page. There is no need to include a gem when you can just call curl
`curl -A "CacheRefresher" #{ENV['HOSTNAME']}/api/v1/#{klass.name.underscore.pluralize}/#{id} >/dev/null 2>&1`
I use mechanize gem to crawl websites. I wrote a very simple, one-threaded crawler inside a Rails rake task because I needed to access to Rails models.
The crawler runs just fine, but after watching it running for a while I can see that it eats more and more RAM over time, which is bad.
I use God gem to monitor my crawler.
Below is my rake task code, I'm wondering if it exposes any chance of memory leaking?
task :abc => :environment do
prefix_url = 'http://example.com/abc-'
postfix_url = '.html'
from_page_id = (AppConfig.last_crawled_id || 1) + 1
to_page_id = 100000
agent = Mechanize.new
agent.user_agent_alias = 'Mac Safari'
(from_page_id..to_page_id).each do |i|
url = "#{prefix_url}#{i}#{postfix_url}"
puts "#{Time.now} - Crawl #{url}"
page = agent.get(url)
page.search('#content > ul').each do |s|
var = s.css('li')[0].text()
value = s.css('li')[1].text()
MyModel.create :var => var, :value => value
end
AppConfig.last_crawled_id = i
end
# Finish crawling, let's stop
`god stop crawl_abc`
end
Unless you've got the very latest version of mechanize (2.1.1 was released only a day or so ago) by default mechanize operates with an unlimited history size, ie it keeps all the pages you visited and so will gradually use more and more memory.
In your case there isn't any point to this, so calling max_history= on your agent should limit how much memory is used in this fashion
Situation:
In a typical cluster setup, I have a 5 instances of mongrel running behind Apache 2.
In one of my initializer files, I schedule a cron task using Rufus::Scheduler which basically sends out a couple of emails.
Problem:
The task runs 5 times, once for each mongrel instance and each recipient ends up getting 5 mails (despite the fact I store logs of each sent mail and check the log before sending). Is it possible that since all 5 instances run the task at exact same time, they end up reading the email logs before they are written?
I am looking for a solution that will make the tasks run only once. I also have a Starling daemon up and running which can be utilized.
The rooster rails plugin specifically addresses your issue. It uses rufus-scheduler and ensures the environment is loaded only once.
The way I am doing it right now:
Try to open a file in exclusive locked mode
When lock is acquired, check for messages in Starling
If message exists, other process has already scheduled the job
Set the message again to the queue and exit.
If message is not found, schedule the job, set the message and exit
Here is the code that does it:
starling = MemCache.new("#{Settings[:starling][:host]}:#{Settings[:starling][:port]}")
mutex_filename = "#{RAILS_ROOT}/config/file.lock"
scheduler = Rufus::Scheduler.start_new
# The filelock method, taken from Ruby Cookbook
# This will ensure unblocking of the files
def flock(file, mode)
success = file.flock(mode)
if success
begin
yield file
ensure
file.flock(File::LOCK_UN)
end
end
return success
end
# open_lock method, taken from Ruby Cookbook
# This will create and hold the locks
def open_lock(filename, openmode = "r", lockmode = nil)
if openmode == 'r' || openmode == 'rb'
lockmode ||= File::LOCK_SH
else
lockmode ||= File::LOCK_EX
end
value = nil
# Kernerl's open method, gives IO Object, in our case, a file
open(filename, openmode) do |f|
flock(f, lockmode) do
begin
value = yield f
ensure
f.flock(File::LOCK_UN) # Comment this line out on Windows.
end
end
return value
end
end
# The actual scheduler
open_lock(mutex_filename, 'r+') do |f|
puts f.read
digest_schedule_message = starling.get("digest_scheduler")
if digest_schedule_message
puts "Found digest message in Starling. Releasing lock. '#{Time.now}'"
puts "Message: #{digest_schedule_message.inspect}"
# Read the message and set it back, so that other processes can read it too
starling.set "digest_scheduler", digest_schedule_message
else
# Schedule job
puts "Scheduling digest emails now. '#{Time.now}'"
scheduler.cron("0 9 * * *") do
puts "Begin sending digests..."
WeeklyDigest.new.send_digest!
puts "Done sending digests."
end
# Add message in queue
puts "Done Scheduling. Sending the message to Starling. '#{Time.now}'"
starling.set "digest_scheduler", :date => Date.today
end
end
# Sleep will ensure all instances have gone thorugh their wait-acquire lock-schedule(or not) cycle
# This will ensure that on next reboot, starling won't have any stale messages
puts "Waiting to clear digest messages from Starling."
sleep(20)
puts "All digest messages cleared, proceeding with boot."
starling.get("digest_scheduler")
Why dont you use mod_passenger (phusion)? I moved from mongrel to phusion and it worked perfect (with a timeamount of < 5 minutes)!