Jobs update with Dashing and Ruby - ruby-on-rails

I use Dashing for monitor trends and website statistics.
I create a jobs to check GooglesNews trends and Twitter trends .
The data is displayed well, however, they appear at first load and does put more update then. There is the code for twitter_trends.rb :
require 'nokogiri'
require 'open-uri'
url = 'http://trends24.in/france/~cloud'
data = Nokogiri::HTML(open(url))
list = data.xpath('//ol/li')
tags = list.collect do |tag|
tag.xpath('a').text
end
tags = tags.take(10)
tag_counts = Hash.new({value: 0})
SCHEDULER.every '10s' do
tag = tags.sample
tag_counts[tag] = {label: tag}
send_event('twitter_trends', {items: tag_counts.values})
end
I think I used bad "rufus-scheduler" to schedule my job jobs https://gist.github.com/pushmatrix/3978821#file-sample_job-rb
How to make the data will update correctly on a regular basis ?

Your scheduler looks fine, but it looks like you're making one call to the website:
data = Nokogiri::HTML(open(url))
But never calling it again. Is your intent to only check that site once along with the initial processing of it?
I assume you'd really want to wrap more of your logic into the scheduler loop - only things in there will be rerun when the schedule job hits.

When you covered everything in a scheduler, you are only taking one sample every 10 seconds (http://ruby-doc.org/core-2.2.0/Array.html#method-i-sample) then adding it to tag_counts. This is clearing the tag each time. Thing to remember about schedulers is it's basically a clean slate every time it runs. I'd recommend looping through tags and adding them to tag_counts that way instead of sampling. sampling is kind of unnecessary seeing as you are reducing it to 10 each time you run the scheduler.

If I move the SCHEDULER like this (after url on top), it works but that only one item appears randomly every 10 seconds.
require 'nokogiri'
require 'open-uri'
url = 'http://trends24.in/france/~cloud'
SCHEDULER.every '10s' do
data = Nokogiri::HTML(open(url))
list = data.xpath('//ol/li')
tags = list.collect do |tag|
tag.xpath('a').text
end
tags = tags.take(10)
tag_counts = Hash.new({value: 0})
tag = tags.sample
tag_counts[tag] = {label: tag}
send_event('twitter_trends', {items: tag_counts.values})
end
How to display a list of 10 items, which is updated regularly ?

Related

How do I get the browser to wait with Capybara & Kimurai?

I'm scraping [this page][1] to look for details of schools that are contained within the CSS selectors .box .column which is contained within a div .schools which is loaded dynamically and takes some time to appear.
I've done this with the watir gem and had no problems. Here's the code as reference.
browser = Watir::Browser.new
browser.goto('https://educationdestinationmalaysia.com/schools/pre-university')
js_doc = browser.element(css: '.schools').wait_until(&:present?)
schools_list = Nokogiri::HTML(js_doc.inner_html)
school_cards = schools_list.css('.box .columns .column:nth-child(2)')
I'm now trying to achieve the same with the kimurai gem but I'm not really familiar with Capybara.
What I've Tried
Changing the default max wait time
def parse(response, url:, data: {})
Capybara.default_max_wait_time = 20
puts browser.has_css?('div.schools')
end
using_wait_time
browser.using_wait_time(20) do
puts browser.has_css?('.schools')
end
Passing in a wait argument to has_css?
browser.has_css?('.schools', wait: 20)
Thanks for reading!
[1]: https://educationdestinationmalaysia.com/schools/pre-university
Your Watir code
js_doc = browser.element(css: '.schools').wait_until(&:present?)
returns the element, but in your Capybara code you're calling predicate methods (has_css?, has_xpath?, has_selector?, etc) that just return true or false. Those predicate methods will only wait if Capybara.predicates_wait is true. Is there a specific reason you're using the predicates though? Instead you can just find the element you're interested in, which will wait up to Capybara.default_max_wait_time or you can specify a custom wait option. The "equivalent" to your Watir example of
js_doc = browser.element(css: '.schools').wait_until(&:present?)
schools_list = Nokogiri::HTML(js_doc.inner_html)
school_cards = schools_list.css('.box .columns .column:nth-child(2)'
assuming you had Capybara.default_max_wait_time set to a number high enough for your app and testing setup
school_cards = browser.find('.schools').all('.box .columns .column:nth-child(2)')
If you do need to extend the wait for one of the finds you could do
school_cards = browser.find('.schools', wait: 10).all('.box .columns .column:nth-child(2)')
to wait up to 10 seconds for the .schools element to appear. This could also just be collapsed into
school_cards = browser.all('.schools .box .columns .column:nth-child(2)')
which will also wait (up to Capybara.default_max_wait_time) for at least one matching element to exist before returning it although depending on your exact HTML
school_cards = browser.all('.schools .column:nth-child(2)')
may be just as good and less fragile
Note: you do have to be using a Kimurai engine that supports JS - https://github.com/vifreefly/kimuraframework#available-engines - otherwise you won't be able to interact with dynamic websites

Run scripts in parallel in ruby

I need to convert videos in 4 threads
For example I have Active Record models Video with titles: Video1, Video2, Video3, Video4, Video5
So, I need to execute something like this
bundle exec script/video_converter start
Where script will process unconverted videos for 4 threads, for example
Video.where(state: 'unconverted').first.process
But if one of 4 videos are converted, next video must be automatically added to thread
What is the best solution for this ? Sidekiq gem? Daemons gem + Ruby Threads manually?
For now I am using this script:
THREAD_COUNT = 4
SLEEP_TIME = 5
logger = CONVERTATION_LOG
spawns = []
loop do
videos = Video.where(state:'unconverted').limit(THREAD_COUNT).reorder("ID DESC")
videos.each do |video|
spawns << Spawnling.new do
result = video.process
if result.nil?
video.create_thumbnail!
else
video.failured!
end
end
end
Spawnling.wait(spawns)
sleep(SLEEP_TIME)
end
But this script waits 4 videos, and after it takes another 4 videos. I want, that after one of 4-th video converted, it will be automatically added to new thread, which is empty.
If your goal is to keep processing videos by using just 4 threads (or whatever Spawnling is configured to use - as it supports fork and thread), then, you could use a Queue to queue all your video records to be processed, spawn 4 threads and let them keep processing records one by one until queue is empty.
require "rails"
require "spawnling"
# In your case, videos are read from DB, below array is for illustration
videos = ["v1", "v2", "v3", "v4", "v5", "v6", "..."]
THREAD_COUNT = 4
spawns = []
q = Queue.new
videos.each {|i| q.push(i) }
THREAD_COUNT.times do
spawns << Spawnling.new do
until q.empty? do
v = q.pop
# simulate processing
puts "Processing video #{v}"
# simulate processing time
sleep(rand(10))
end
end
end
Spawnling.wait(spawns)
This answer is inspired from this answer
PS: I have added few requires and defined videos array to make above code self-contained running example.

How to get all connections since a certain time period

I'm trying to get all connections (interactions) on a facebook page since a certain time period. I'm using the koala gem and filtering the request with "since: 1.month.ago.to_i" which seems to work fine. However, this gives me 25 results at a time. If I change the limit to 446 (the maximum it seems) that works better. But...if I use .next_page to give me the next set of results within the given time range, it instead just gives me a next set of results without obeying the time range.
For example, let's say I don't increase the limit and I have 25 results per request. I do something like:
#api.get_connections(#fan_page_id, "feed", {since: 1.month.ago.to_i})
let's assume there are 30 results for this and the first request gets me 25 (the default limit). then, if I do this:
#api.get_connections(#fan_page_id, "feed", {since: 1.month.ago.to_i}).next_page
instead of returning the last 5 results, it returns 25 more, 20 of which are not "since: 1.month.ago.to_i". I have a while loop cycling through the pages but I don't know where to stop since it just keep returning results to me no matter what as long as I keep calling .next_page.
is there a better way of doing this?
if not, what's the best way to check to make sure the post i'm looking at in the loop is still within the time range i want and to break out if not?
here's my code:
def perform(fan_page_id, pagination_options = {})
#since_date = pagination_options[:since_date] if pagination_options[:since_date]
#limit = pagination_options[:limit] if pagination_options[:limit]
#oauth = Koala::Facebook::OAuth.new
#api = Koala::Facebook::API.new #oauth.get_app_access_token
fb_page = #api.get_object(fan_page_id)
#fan_page_id = fb_page["id"]
# Collect all the users who liked, commented, or liked *and* commented on a post
process_posts(#api.get_connections(#fan_page_id, "feed", {since: #since_date})) do |post|
## do stuff based on each post
end
end
private
# Take each post from the specified feed and perform the provided
# code on each post in that feed.
#
# #param [Koala::Facebook::API::GraphCollection] feed An API response containing a page's feed
def process_posts(feed, options = {})
raise ArgumentError unless block_given?
current_feed = feed
begin
current_feed.each { |post| yield(post) }
current_feed = current_feed.next_page
end while current_feed.any?
end
current = #api.get_connections(#fan_page_id, "feed", {since: 1.month.ago.to_i})
next = current.next_page
next = next.next_page
.....
Please try these, I think they work.

Nokogiri Timeout::Error when scraping own site

Nokogiri works fine for me in the console, but if I put it anywhere... Model, View, or Controller, it times out.
I'd like to use it 1 of 2 ways...
Controller
def show
#design = Design.find(params[:id])
doc = Nokogiri::HTML(open(design_url(#design)))
images = doc.css('.well img') ? doc.css('.well img').map{ |i| i['src'] } : []
end
or...
Model
def first_image
doc = Nokogiri::HTML(open("http://localhost:3000/blog/#{self.id}"))
image = doc.css('.well img')[0] ? doc.css('.well img')[0]['src'] : nil
self.update_attribute(:photo_url, image)
end
Both result in a timeout, though they work perfectly in the console.
When you run your Nokogiri code from the console, you're referencing your development server at localhost:3000. Thus, there are two instances running: one making the call (your console) and one answering the call (your server)
When you run it from within your app, you are referencing the app itself, which is causing an infinite loop since there is no available resource to respond to your call (that resource is the one making the call!). So you would need to be running multiple instances with something like Unicorn (or simply another localhost instance at a different port), and you would need at least one of those instances to be free to answer the Nokogiri request.
If you plan to run this in production, just know that this setup will require an available resource to answer the Nokogiri request, so you're essentially tying up 2 instances with each call. So if you have 4 instances and all 4 happen to make the call at the same time, your whole application is screwed. You'll probably experience pretty severe degradation with only 1 or 2 calls at a time as well...
Im not sure what default value of timeout.
But you can specify some timeout value like below.
require 'net/http'
http = Net::HTTP.new('localhost')
http.open_timeout = 100
http.read_timeout = 100
Nokogiri.parse(http.get("/blog/#{self.id}").body)
Finally you can find what is the problem as you can control timeout value.
So, with tyler's advice I dug into what I was doing a bit more. Because of the disconnect that ckeditor has with the images, due to carrierwave and S3, I can't get any info direct from the uploader (at least it seems that way to me).
Instead, I'm sticking with nokogiri, and it's working wonderfully. I realized what I was actually doing with the open() command, and it was completely unnecessary. Nokogiri parses HTML. I can give it HTML in for form of #design.content! Duh, on my part.
So, this is how I'm scraping my own site, to get the images associated with a blog entry:
designs_controller.rb
def create
params[:design][:photo_url] = Nokogiri::HTML(params[:design][:content]).css('img').map{ |i| i['src']}[0]
#design = Design.new(params[:design])
if #design.save
flash[:success] = "Design created"
redirect_to designs_url
else
render 'designs/new'
end
end
def show
#design = Design.find(params[:id])
#categories = #design.categories
#tags = #categories.map {|c| c.name}
#related = Design.joins(:categories).where('categories.name' => #tags).reject {|d| d.id == #design.id}.uniq
set_meta_tags og: {
title: #design.name,
type: 'article',
url: design_url(#design),
image: Nokogiri::HTML(#design.content).css('img').map{ |i| i['src']},
article: {
published_time: #design.published_at.to_datetime,
modified_time: #design.updated_at.to_datetime,
author: 'Alphabetic Design',
section: 'Designs',
tag: #tags
}
}
end
The Update action has the same code for Nokogiri as the Create action.
Seems kind of obvious now that I'm looking at it, lol. I dwelled on this for longer than I'd like to admit...

Delaying a method based on page results

I am retrieving results from NCBI's online Blast tool with 'net/http' and 'uri'. To do this I have to search through an html page to check if one of the lines is "Status=WAITING" or "Status=READY". When the Blast tool has finished the status will change to ready and results will be posted on the html page.
I have a working version to check the status and then retrieve the information that I need, but it is inefficient and is broken into two methods when I believe that there could be some way to put them into one.
def waitForBlast(rid)
get = Net::HTTP.post_form(URI.parse('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?'), {:RID => "#{rid}", :CMD => 'Get'})
get.body.each{|line| (waitForBlast(rid) if line.strip == "Status=WAITING") if line[/Status=/]}
end
def returnBlast(rid)
blast_array = Array.new
get = Net::HTTP.post_form(URI.parse('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?'), {:RID => "#{rid}", :CMD => 'Get'})
get.body.each{|line| blast_array.push(line[/<a href=#\d+>/][/\d+/]) if line[/<a href=#\d+>/]}
return blast_array
end
The first method checks the status and is my main concern because it is recursive. I believe(and correct me if I'm wrong) that designed as is takes too much computing power when all that I need is some way to recheck the results within the same method(adding in a time delay is a bonus). The second method is fine, but I would prefer if it was combined with the first somehow. Any help appreciated.
Take a look at this implementation. This is what he does:
res='http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Get&FORMAT_OBJECT=SearchInfo&RID=' + #rid
while status = open(res).read.scan(/Status=(.*?)$/).to_s=='WAITING'
#logger.debug("Status=WAITING")
sleep(3)
end
I think using the string scanner might be a bit more efficient than iterating over every line in the page, but I haven't looked at it's implementation so I may be wrong.

Resources