I am try detect when a new blog post has been added to a blog. I am using mechanize for the scraping. Currently this is straight forward if you know the parent tags of a blog <article><header><h1>Blot Title here</h1></header></article> you can just do a diff of the titles you have now to the last time you checked. But I want to do this programatically. Is there a way to programmatically know what section or tags of a page that holds the titles of the blog posts without explicitly giving the heirachy of tags to the script?
Suppose there is a blog name blog.example.com. There are the post -
<article><header><h1>Blot Title here1</h1></header></article>
<article><header><h1>Blot Title here2</h1></header></article>
<article><header><h1>Blot Title here3</h1></header></article>
Using selector Gaget you will get an idea which css is responsible for the each article. To scrap the article you can use nokogiri or machanize gem.
Suppose macanize bot will visit blog.example.com and it will collect all the article and insert into your database.
require 'nokogiri'
require 'open-uri'
if 1==1
url = "http://www.eslemployment.com/country/esl-jobs-Vietnam.html"
doc = Nokogiri::HTML(open(url))
data = []
doc.css("#joblist td:nth-child(1) a").first(5).each do |titlecss|
country = "8"
jobtype = "1"
urlnext = titlecss.attr('href')
docnext = Nokogiri::HTML(open(urlnext))
docnext.css('#jobdescription div').remove
docnext.css('#detailjob , #job-summary').each do |detailscss|
docnext.css('#pagemsg h1').each do |titlenextcss|
data << JobPost.create(
:title => titlenextcss.text,
:jobslink => urlnext,
:description => detailscss.inner_html,
:country_id => country,
:job_type_id => jobtype
)
end
end
end
end
Here is a example of nokogiri gem. It collect the jobs from www.eslemployment.com . Now your question is how you can detected new article has added.
This code collect all the jobs from a page and added it into the database. I use here "distint" code into the model for this reason only new job will added to the database. no duplication job will added into the database. When new Jobs is added you can make a notification which job is added.
This is not effecient way. But it will work. Otherwise you can use the rss feed of that blog. This is the proper way to detect new post.
Related
I'm using the Mechanize gem to automate interaction with a website form.
The site i'm trying to interact with is http://www.tastekid.com/like/books
I'm trying to automatically submit a string to query in the form and return the suggested books in an array.
Following the guide, i've pretty printed the page layout to find the form name, but, I am just finding a form with no name, nill:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.tastekid.com/like/books')
pp page
How do I enter a string, submit the form and return the results in the form of an array?
These answers feel a little cluttered to me, so let me try to make it simpler:
page = agent.get 'http://www.tastekid.com/like/books'
there's only one form, so:
form = page.form
form['q'] = 'twilight'
submit the form
page = form.submit
print the text from the a's
puts page.search('.books a').map &:text
Following the guide, you can get the form:
form = page.form
I didn't see a name on the form, and I actually got two forms back: one on the page and one hidden.
I called
form.fields.first.methods.sort #not the hidden form
and saw that I could call value on the form, so I set it as such:
form.fields.first.value = "Blood Meridian"
then I submitted and pretty printed:
page = agent.submit(form)
This should work for you!
You could use the form_with method to locate the form you want. For example:
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.tastekid.com/like/books')
the_form_you_want = page.form_with(:id => "searchFrm") # form_with
the_form_you_want.q = 'No Country for Old Men'
page = agent.submit(the_form_you_want)
pp page
It looks like the book titles all have the same class attribute. To extract the book titles, use the links_with method and pass in the class as a locator:
arr = []
page.links_with(:class => "rsrc").each do |link|
arr << link.text
end
But #aceofbassgreg is right. You'll need to read up on the mechanize and nokogiri documentation...
I'm doing some work with Adcourier. They send me an xml feed with some job data, i.e. job_title, job_description and so on.
I'd like to provide them with a url in my application, i.e. myapp:3000/job/inbox. When they send their feed to that URL, it takes the data and stores it in my database on a Job object that I already created.
What's the best way to structure this? I'm quite new to MVC and i'm not sure where something like this would fit.
How can I get an action to interpret the XML feed from an external source? I use Nokogiri to handle local XMl documents, but never ones from a feed.
I was thinking about using http://api.rubyonrails.org/classes/ActionDispatch/Request.html#method-i-raw_post to handle the post. Doest anyone any thoughts on this?
In your job controller add a action inbox which gets the correct parameter(s) from the post request and saves them (or whatever you need to do with it).
def inbox
data = Xml::ParseStuff(params[:data])
title = data[:title]
description = data[:description]
if Job.create(:title => title, :description => description)
render :string => "Thanks!"
else
render :string => "Data was not valid :("
end
end
Next set your routes.rb to send posts request for that URL to the correct location
resources :jobs do
collection do
post 'inbox'
end
end
Note I did just made up the xml parse stuff here, just google a bit to find out what would be the best solution/gem for parsing your request.
In an ActiveAdmin page, I would like to include a link to a list of related resources. For example, given that a
Site has_many Sections and,
Section belongs_to a Site (in my ActiveRecord models),
I would like my Site's show page to include a link to Sections within the site, which would go to the Section index page, with the Site filter preset.
Note that
I do not want to use ActiveAdmin's belongs_to function;
I don't want nested resources for a number of reasons (depth of nesting > 2, as well as usability concerns).
What I want is to generate a URL similar to the one ActiveAdmin generates if I first go to the Sections index page and then filter by Site.
The query parameter list generated by ActiveAdmin's filtering feature is pretty crazy; is there a helper method I could use to achieve this goal?
Thanks!
I use this syntax:
link_to "Section", admin_sections_path(q: { site_id_eq: site.id })
I worked out a reasonably satisfactory solution after poking around in meta_search for a bit. Syntax is a bit clunky, but it does the trick.
index do
...
column "Sections" do |site|
link_to "Sections (#{site.sections.count})", :controller => "sections", :action => "index", 'q[site_id_eq]' => "#{site.id}".html_safe
end
end
As jgshurts pointed out, the trick is identifying that q[site_id_eq] query parameter.
However, if you don't like the clunky syntax, you can also just use a path helper:
link_to "Sections (#{site.sections.count})", admin_sections_path('q[site_id_eq]' => site.id)
The UrlHelper#link_to documentation shows additional examples of this.
#auto_link(resource, content = display_name(resource)) ⇒ Object
Automatically links objects to their resource controllers. If the
resource has not been registered, a string representation of the
object is returned.
The default content in the link is returned from
ActiveAdmin::ViewHelpers::DisplayHelper#display_name
You can pass in the content to display
eg: auto_link(#post, "My Link")
ActiveAdmin.register Girl do
index do
selectable_column
column :name do |girl|
auto_link(girl, girl.name)
end
column :email
column :created_at
actions
end
Useful-link: http://www.rubydoc.info/github/gregbell/active_admin/ActiveAdmin/ViewHelpers/AutoLinkHelper
Note: This is tested with ActiveAdmin (v1.1.0 and 2.0.0.alpha)
Hope this works with other version as well. Please update this answer if you are sure it works with other versions you know.
I want to scrape Groupon deals using Nokogiri. I want to scrape all these deals at the following link:
http://www.groupon.com/getaways?d=travel_countmein
On top of that, I want to access each individual link and scrape the title and price. Conceptually, is there a way to code a single rake task to do this?
I understand that there needs to be a loop of some sort, but I don't know how to parse the url for each deal from the main getaway page.
I've already written a scraper for the title and price:
task :fetch_travel => :environment do
require 'nokogiri'
require 'open-uri'
url = "http://www.groupon.com/deals/ga-flamingo-conferences-resort-spa?c=all&p=0"
doc = Nokogiri::HTML(open(url))
title = doc.at_css("#content//h2/a").text
price = doc.at_css("#amount").text[/[0-9\.]+/]
link = doc.at_css("#content//h2/a")[:href]
desc = doc.at_css(".descriptor").text
Traveldeal.create(:title => title, :price => price, :url => link, :description => desc)
end
Figured out that this requires nested loop where the inner loop is the code above and the outer loop will parse each deal for the url to be used in the inner loop.
I have a model called Book, which has_many :photos (file attachments handled by paperclip).
I'm currently building a client which will communicate with my Rails app through JSON, using Paul Dix's Typhoeus gem, which uses libcurl.
POSTing a new Book object was easy enough. To create a new book record with the title "Hello There" I could do something as simple as this:
require 'rubygems'
require 'json'
require 'typhoeus'
class Remote
include Typhoeus
end
p Remote.post("http://localhost:3000/books.json",
{ :params =>
{ :book => { :title => "Hello There" }}})
My problems begin when I attempt to add the photos to this query. Simply POSTing the file attachments through the HTML form creates a query like this:
Parameters: {"commit"=>"Submit", "action"=>"create", "controller"=>"books", "book"=>{"title"=>"Hello There", "photo_attributes"=>[{"image"=>#<File:/var/folders/1V/1V8Kw+LEHUCKonqJ-dp3oE+++TI/-Tmp-/RackMultipart20090917-3026-i6d6b9-0>}]}}
And so my assumption is I'm looking to recreate the same query in the Remote.post call.
I'm thinking that I'm letting the syntax of the array of hashes within a hash get the best of me. I've been attempting to do variations of what I was expecting would work, which would be something like:
p Remote.post("http://localhost:3000/books.json",
{ :params =>
{ :book => { :title => "Hello There",
:photo_attributes => [{ :image => "/path/to/image/here" }] }}})
But this seems to concatenate into a string what I'm trying to make into a hash, and returns (no matter what I do in the :image => "" hash):
NoMethodError (undefined method `stringify_keys!' for "image/path/to/image/here":String):
But I also don't want to waste too much time figuring out what is wrong with my syntax here if this isn't going to work anyway, so I figured I'd come here.
My question is:
Am I on the right track? If I clear up this syntax to post an array of hashes instead of an oddly concatenated string, should that be enough to pass the images into the Book object?
Or am I approaching this wrong?
Actually, you can't post files over xhr, there a security precaution in javascript that prevents it from handling any files at all. The trick to get around this is to post the file to a hidden iframe, and the iframe does a regular post to the server, avoiding the full page refresh. The technique is detailed in several places, possibly try this one (they are using php, but the principle remains the same, and there is a lengthy discussion which is helpful):
Posting files to a hidden iframe