How to "crawl" only the root URL with Anemone? - ruby-on-rails

In the example below I would like anemone to only execute on the root URL (example.com). I am unsure if I should apply the on_page_like method and if so what pattern I would need.
require 'anemone'
Anemone.crawl("http://www.example.com/") do |anemone|
anemone.on_pages_like(???) do |page|
# some code to execute
end
end

require 'anemone'
Anemone.crawl("http://www.example.com/", :depth_limit => 1) do |anemone|
# some code to execute
end
You can also specify the following in the options hash, below are the defaults:
# run 4 Tentacle threads to fetch pages
:threads => 4,
# disable verbose output
:verbose => false,
# don't throw away the page response body after scanning it for links
:discard_page_bodies => false,
# identify self as Anemone/VERSION
:user_agent => "Anemone/#{Anemone::VERSION}",
# no delay between requests
:delay => 0,
# don't obey the robots exclusion protocol
:obey_robots_txt => false,
# by default, don't limit the depth of the crawl
:depth_limit => false,
# number of times HTTP redirects will be followed
:redirect_limit => 5,
# storage engine defaults to Hash in +process_options+ if none specified
:storage => nil,
# Hash of cookie name => value to send with HTTP requests
:cookies => nil,
# accept cookies from the server and send them back?
:accept_cookies => false,
# skip any link with a query string? e.g. http://foo.com/?u=user
:skip_query_strings => false,
# proxy server hostname
:proxy_host => nil,
# proxy server port number
:proxy_port => false,
# HTTP read timeout in seconds
:read_timeout => nil
My personal experience is that Anemone was not very fast and had a lot of corner cases. The docs are lacking (as you have experienced) and the author doesn't seem to be maintaining the project. YMMV. I tried Nutch shortly but didn't play aroud as much but it seemed faster. No benchmarks, sorry.

Related

How to tag every log call in rails with request id (lograge)

I'm using lograge with rails and I have configured my logs using JSON format. I would like every time I call logger.info,logger.warn, etc. to include the request uuid. The way rails handles this with tagged logging falls short of what I would like because it does not seem able to merge the request uuid with the remainder of the JSON payload, instead prepending it on the line in non-JSON format.
For instance, if I call logger.info(client: :ig) I would expect the following log output:
{"request_id": <request uuid>, "client": "ig"}
But instead rails will prepend the request uuid (when configured via config.log_tags = [:uuid]) like so:
[<request uuid>] {"client": "ig"}
Does anyone know if there is a way to get the tagging behavior to merge with the JSON payload instead of prepending it on the same line? I'd like to configure our logs to forward to Splunk using just a simple JSON formatter and not deal with this prepending format.
Also, I have configured lograge to include request_id set to the request uuid in a lambda passed to custom_options in config/application.rb. This works but only when rails logs the request. If I explicitly call one of the logging methods anywhere else, the request_id is not included.
# application.rb
config.lograge.enabled = true
config.lograge.formatter = Lograge::Formatters::Json.new
config.lograge.custom_options = lambda do |e|
{
params: e.payload[:params].except("controller", "action", "utf8"),
request_id: e.payload[:request_id] # added this in `append_info_to_payload` in ApplicationController
}
end
Then in config/environments/production.rb
config.log_tags = [ -> (req) { { request_id: req.env["action_dispatch.request_id"] } } ]
Any help is appreciated, thanks.
The problem is that the payload doesn't have the request_id.
As you can see in:
./actionpack-3.2.11/lib/action_controller/metal/instrumentation.rb:18-25
raw_payload = {
:controller => self.class.name,
:action => self.action_name,
:params => request.filtered_parameters,
:format => request.format.try(:ref),
:method => request.method,
:path => (request.fullpath rescue "unknown")
}
I override this method (copy ./actionpack-3.2.11/lib/action_controller/metal/instrumentation.rb in config/initializer.rb) and add your param.
raw_payload = {
:controller => self.class.name,
:action => self.action_name,
:params => request.filtered_parameters,
:format => request.format.try(:ref),
:method => request.method,
:path => (request.fullpath rescue "unknown"),
:request_id => env["action_dispatch.request_id"]
}
Maybe there are a better way for override instrumentation, but it is enough.
Is easier to override initializer with class_eval as you can see in:
Access to Rails request inside ActiveSupport::LogSubscriber subclass

Ruby Geocoder connection refused

When I search for an IP address, I get the following error:
pry(main)> Geocoder.search("130.132.173.68")
Geocoding API connection refused.
=> []
If I search for anything else eg.
pry(main)> Geocoder.search("new haven")
I get the correct results.
Is there a problem with the freegeoip service?
Make sure you have an initializer setup for Geocoder with all of the config settings. This was the problem for me. Something along the lines like this:
# config/initializers/geocoder.rb
Geocoder.configure(
# geocoding options
:timeout => 7, # geocoding service timeout (secs)
:lookup => :google, # name of geocoding service (symbol)
:language => :en, # ISO-639 language code
:use_https => true, # use HTTPS for lookup requests? (if supported)
:http_proxy => '', # HTTP proxy server (user:pass#host:port)
:https_proxy => '', # HTTPS proxy server (user:pass#host:port)
:api_key => nil, # API key for geocoding service
:cache => nil, # cache object (must respond to #[], #[]=, and #keys)
:cache_prefix => "geocoder:", # prefix (string) to use for all cache keys
# IP address geocoding service (see below for supported options):
#:ip_lookup => :maxmind,
# to use an API key:
#:api_key => "...",
# exceptions that should not be rescued by default
# (if you want to implement custom error handling);
# supports SocketError and TimeoutError
# :always_raise => [],
# calculation options
# :units => :mi, # :km for kilometers or :mi for miles
# :distances => :linear # :spherical or :linear
)

Ruby add_item for eBay

I am attempting to write a ruby on rails app that posts an item to eBay. Cody Fauser/Garry Tan have a gem called ebayApi which is built on top of the ebay gem. When I attempt to post an item, I am getting an error back from ebay that says the condition ID is required for this category. I have found a category that does not require the condition, and I can post to that category. Searching through the eBay API documentation, I have found a tag conditionID under the "item" class. However, in the documentation for ebayAPI, there is no such tag. Looking back at the ebay API documentation, there is an older way to specify condition, using lookup_attributes. I have noted that the return xml is coming in API version 745, and Garry Gan's updated of the ruby interface is running version 609. I have tried using the lookup, and seem to get the same error (condition required). I am using the following code to specify the item:
#ebay = Ebay::Api.new :auth_token => #seller.ebay_token
item = Ebay::Types::Item.new( :primary_category => Ebay::Types::Category.new(:category_id => #ebayTemplate.categoryID),
:title => #ebayTemplate.name,
:description => #ebayTemplate.description,
:location => #ebayTemplate.location,
:start_price => Money.new((#ebayTemplate.startPrice*100).to_d, #ebayTemplate.currency),
:quantity => 1,
:listing_duration => #ebayTemplate.listingDuration,
:country => #ebayTemplate.country,
:currency => #ebayTemplate.currency,
:payment_methods => ['VisaMC', 'PayPal'],
:paypal_email_address => '********#gmail.com',
:dispatch_time_max => 3,
:lookup_attributes => [Ebay::Types::LookupAttribute.new( :name => "Condition", :value => "New")],
# :attribute_sets => [
# Ebay::Types::AttributeSet.new(
# :attribute_set_id => 2919,
# :attributes => [
# Ebay::Types::Attribute.new(
# :attribute_id => 10244,
# :values => [ Ebay::Types::Val.new(:value_id => 10425) ]
# )
# ]
# )
# ],
:shipping_details => Ebay::Types::ShippingDetails.new(
:shipping_service_options => [
# ShippingServiceOptions.new(
# :shipping_service_priority => 2, # Display priority in the listing
# :shipping_service => 'UPSNextDay',
# :shipping_service_cost => Money.new(1000, 'USD'),
# :shipping_surcharge => Money.new(299, 'USD')
# ),
Ebay::Types::ShippingServiceOptions.new(
:shipping_service_priority => 1, # Display priority in the listing
:shipping_service => #ebayTemplate.shipSvc,
:shipping_service_cost => Money.new((#ebayTemplate.shipSvcCost*100).to_d, #ebayTemplate.currency),
:shipping_surcharge => Money.new((#ebayTemplate.shipSurcharge*100).to_d, #ebayTemplate.currency)
)
],
:international_shipping_service_options => [
Ebay::Types::InternationalShippingServiceOptions.new(
:shipping_service => 'USPSPriorityMailInternational',
:shipping_service_cost => Money.new((#ebayTemplate.shipSvcCost*100).to_d, #ebayTemplate.currency),
:shipping_service_priority => 2,
:ship_to_location => #ebayTemplate.shipToLocation
)
]
),
:return_policy => Ebay::Types::ReturnPolicy.new (
:description => 'this product for suckers only!',
:returns_accepted_option => 'ReturnsAccepted'
)
#:condition_id => 1000
)
#response = #ebay.add_item(:item => item)
As you can see, it is just a mutation of the example given by Cody Fauser. The condition_id at the bottom will bring up an error as there is no such attribute. It seems to me there is no facility for this in the gem since the requirement came into existence after the gem was created. I have not been able to find any other gems to connect with ebay. I have also noticed, there are very little complaints about this even though people are still downloading the gem (10 people downloaded it today). I think there are quite a number of people writing for ebay. Is there a key word I am missing to specify the condition? A work around that people have been using? Another gem I have missed?
There is an existing item_conditions_codes.rb in the gem's type directory and only has two values New and Used. Guess you could add more values in there. However still needs mapping to ID's per the updating (and changed from Attributes) method
You have to modify in the gem library in .. ruby/1.8/gems/ebayapi-0.12.0/lib/ebay/types/item.rb
and add the following new lines
# added to allow ConditionID to be pushed into XML
numeric_node :condition_id, 'ConditionID', :optional => true
then in your ruby ebay code use the following convention
:condition_id => 1500,
At least that seems to work for me right now.

Rails 3 additional session configuration options (key, expires_after, secure)

Can someone point out what the new Rails 3.x session configuration options are?
I'm trying to duplicate the same configuration that I have in my Rails 2.3.x application.
This is the configuration that I used in the application:
#environment.rb
config.action_controller.session_store = :active_record_store
config.action_controller.session = {
:key => '_something', #non-secure for development
:secret => 'really long random string'
}
# production.rb - override environment.rb for production
config.action_controller.session = {
:key => '_something_secure',
:secret => 'really long random string',
:expire_after => 60*60,#time in seconds
:secure => true #The session will now not be sent or received on HTTP requests.
}
However, in Rails 3.x, I can only find mention of the following:
AppName::Application.config.session_store :active_record_store
AppName::Application.config.secret_token = 'really long random string'
AppName::Application.config.cookie_secret = 'another really long random string'
Are there other config settings to control the key, expire_after time, and secure option?
Regarding the latter, if "config.force_ssl = true" is set in production.rb, I assume the secure option is no longer required?
Thanks very much!
You now configure the Cookie-based session store through an initializer, probably in config/initializers/session_store.rb. In Rails 3 the session store is a piece of middleware, and the configuration options are passed in with a single call to config.session_store:
Your::Application.config.session_store :cookie_store, :key => '_session'
You can put any extra options you want in the hash with :key, e.g.
Your::Application.config.session_store :cookie_store, {
:key => '_session_id',
:path => '/',
:domain => nil,
:expire_after => nil,
:secure => false,
:httponly => true,
:cookie_only => true
}
(Those are just the standard defaults)
If you force SSL in production then setting secure on the cookie shouldn't really make a difference in practice, but you might want to set it just to be on the safe side...
Your::Application.config.session_store :cookie_store, {
:key => '_session_id',
:secure => Rails.env.production?
}

Rails 3: Can't seem to write cookies for top level domain :(

I setup the cookie store to domain => :all, like I could find in documentation and it seems to work, because devise's authentication works across the multiple domain.
MyApp::Application.config.session_store :cookie_store, :key => '_MyApp.com_session', :domain => :all
However when I am trying myself to write to a cookie, it always write down the sub domain... I don't get it:
I write the cookie in the simplest manner possible:
cookies.permanent[:remember_locale] = locale
But no matter what it won't set it for the top level domain whereas the one dropped by devise seems to manage it without a problem :(
Alex
ps: I am using rails 3.0.3
The configuration for the session_store only applies to the session cookie. When setting a separate cookie you have to specify the domain for that cookie as well.
cookies.permanent[:remember_locale] = { :value => locale, :domain => :all }
Note (pulled from rails source):
# Please note that if you specify a :domain when setting a cookie, you must also specify the domain when deleting the cookie:
#
# cookies[:key] = {
# :value => 'a yummy cookie',
# :expires => 1.year.from_now,
# :domain => 'domain.com'
# }
#
# cookies.delete(:key, :domain => 'domain.com')

Resources