How to create nested loops under unpredictable results - ruby-on-rails

I'm working on a web crawler application. It will list all links of a given domain as a part of categorized site map. I'm using Nokogiri gem for parsing and searching the HTML. This code works for a single page:
doc = Nokogiri::HTML(open("url"))
links = doc.css("a")
unless links.blank?
links.each do |t|
if t["href"].first == "/"
// link stuff
end
end
end
At the commented line, I can do another doc = Nokogiri::HTML(open(t_URL)) and receive the second set of links so on and so forth. But what about 3rd, 4th or 5th steps?
How will I crawl all other pages of the entire site and other pages having link at the previous pages? The number of links for per page is not predictable, so I can't use each or times. How can I keep visiting all pages and other nested pages and track the links of all of them?

All you need to do is keep track of the absolute URLs in a hash. The value of the hash could be a count or you may want to keep track of when you last scraped each page with a timestamp. Note when you scrape, you should get just the hrefs:
to_visit = {"url" => Time.now}
while !to_visit.empty? do
doc = Nokogiri::HTML(open(to_visit.shift.first))
doc.css("a[href]").each do |link|
url = make_absolute(link)
to_visit[url] = Time.now #add this page to the to_visit 'list'
end
end
Where you'll need to define make_absolute which should create a full URL complete with protocol, host, port, and path.

As you mentioned, each or times are to be used when the iterator is fixed in advance. When you do not have a fixed iterator, you need to use loops like loop, while, until, and break from it when all links have been found.

Related

URLs are UI in Rails 5

Came across this blog post recently and wanted to Incorporate its ideas into my Rails project - URLs should be short, human readable, shareable, and shorten-able. Specifically, I want to learn how to make URLs shorten-able with Rails. The example he gives is https://stackoverflow.com/users/6380/scott-hanselman and https://stackoverflow.com/users/6380 are the same URLs, the text after the ID is ignored and scott-hanselman will be added after navigating to the page. This improves readability and share-ability.
I would like the show action in my resource URLs to auto-add the page's <title> after the ID when navigating to the page but ignore it when the user pastes it into the search bar. This allows for malleable titles.
Example below. All these URLs should bring you to the resource with an ID of '1'
host/resource/1/exciting-blog-post
host/resource/1
host/resource/1/exciting-blog-post.html
host/resource/1/new-title-on-post
Edit:
The biggest difficulty I am having is editing the URL after the user submits it, ie transforming resource/1 to resource/1/name_column.
I have been able to redirect incorrect routes using the following in config/routes.rb - get "/events/:id/*other", to: redirect('events/%{id}')
Ok this was really tricky to figure out, didn't even know I had access to a lot of these parameters before. FriendlyID is not required, and not even capable of solving this issue.
The resource I'm using below is "events".
First edit your config/routes.rb to accept id/other_stuff
Rails.application.routes.draw do
resources :events
get "/events/:id/*other" => "events#show" #if any txt is trailing id, also send this route to events#show
end
Next modify event_controller.show to redirect if the URL is incorrect.
def show
#redirect if :name is not seen in the URL
if request.format.html?
name_param = #event.name.parameterize
url = request.original_url
id_end_indx = url.index(#event.id.to_s) + (#event.id.to_s).length + 1 #+1 for '/' character
##all URL txt after id does not match name.parameterize
if url[id_end_indx..-1] != #event.name.parameterize
redirect_to "/events/#{#event.id}/#{name_param}"
end
end
end
This will result in the exact same behavior as the Stack Overflow examples gave in the question.

Rails: Find all occurrences of internal 404 links (is Rails "leaking" URLs?)

I came across something curious and was wondering how to fix it within Rails.
In my app, I have a Country model; it already contains records for all countries I'll ever need. However, many of them don't contain any data or are otherwise not yet relevant. These countries yield a 404 error, as they would if the record didn't exist in the first place:
begin
#country = Country.friendly.find(params[:id])
rescue ActiveRecord::RecordNotFound
#country = nil
end
if !#country.nil? && Country.with_data.include?(#country)
# render the view
else
render :file => "#{Rails.root}/public/404", :status => :not_found
end
So assuming that e.g. France contains data while Andorra doesn't, the following should happen:
mysite/country/france --> HTTP 200 OK
mysite/country/andorra --> HTTP 404 Not found
mysite/country/randomstring123 --> HTTP 404 Not found
This all works fine. However, what's curious is that when I track my site in Google Webmaster Tools, it is actually aware of some of the URLs that point to "empty" countries, and shows them to me as 404-yielding "crawling errors". (E.g., it knows mysite/country/andorra.) What I can't see is where Google got those URLs from. Those links are also not included in the WT "Internal Links" section, so that doesn't help.
The routes.rb excludes the index action:
resources :countries, path: "country", except: :index
I generate a sitemap with a custom controller, but it excludes the countries in question.
I conclude that there are two likely options:
It is possible that an earlier version of the sitemap controller included "empty" countries. They might then still be tried by Google eventually (as are some old URLs from a much outdated site structure > 9 months ago).
Otherwise, Rails somehow would have to "leak" URLs to these empty countries. Is there any "internal" way to check that? I'll also run external 404 checks but it would be good to know if I can somehow get an efficient output much alike rake routes somehow.

Changing urls in ruby on rails depending on different conditions

I'm new to ruby on rails....I wanted to know if there is a way to change the URL displayed depending on the client's response. I mean... here's an example:
I'm making a project showing listings in various places...
Now in general I have a home page, a search page, and a detail page for listings. So, respective URLs are officespace/home, officespace/search?conditions, officespace/detailpage?id=(controller-officespace)[&Conditions eg.---price,size,place,type...]
So, every time the client makes a request for search, the same URL is shown, of course with the given conditions.
Now I want that if the client asks for only the place and mentions nothing about size, price, etc., the url should be /listing/location_name.
If he mentions other conditions, then it'll be listing/(office_type)/size(x sq feet)_office_for_rent_in_locationname)
B.t.w. (I already have a controller named listings and its purpose is something else.)
And so on ........... Actually, I want to change URLs for a number of things. Anyway, please help me. And please don't refer me to the manuals. I've already read them and they didn't give any direct help.
This is an interesting routing challenge. Essentially, your goal is to create a special expression that will match the kinds of URL's you want to display in the user's browser. These expressions will be used in match formulas in config/routes.rb. Then, you'll need to make sure the form actions and links on relevant search pages link to those specialized URL's and NOT the default pages. Here's an example to get started:
routes.rb
match "/listing/:officeType/size/:squarefeet/office_for/:saleOrRent/in/:locationName" => "searches#index"
match "/listing/*locationName" => "searches#index"
resources :searches
Since you explicitly mentioned that your listings controller is for something else, I just named our new controller searches. Inside the code for the index method for this controller, you have to decide how you want to collect the relevant data to pass along to your view. Everything marked with a : in the match expressions above will be passed to the controller in the params hash as if it were an HTTP GET query string parameter. Thus we can do the following:
searches_controller.rb
def index
if params[:squarefeet] && params[:officeType] && params[:locationName]
#listings = Listing.where("squarefeet >= ?", params[:squarefeet].to_i).
where(:officeType => params[:officeType],
:locationName => params[:locationName])
elsif params[:locationName]
#listings = Listing.where(:locationName => params[:locationName])
else
#listings = Listing.all
end
end
And to send the user to one of those links:
views/searches/index.html.erb
<%= link_to "Click here for a great office!", "/listing/corporate/size/3200/office_for/rent/in/Dallas" %>
The above example would only work if your Listing model is set up exactly the same way as my arbitrary guess, but hopefully you can work from there to figure out what your code needs to look like. Note that I wasn't able to get the underscores in there. The routes only match segments separated by slashes as far as I can tell. Keep working on it and you may find a way past that.

Searching based on field content - Ruby on Rails

So I am creating a ruby on rails application and in my view I have a list of link_to with each being a different console
In my database table I have a field called console.
What I want to do is when a user clicks on a link e.g. Playstation 3, it will return back all records that have Playstation 3 listed in that table column.
I was wondering how I would go about doing this, I have tried searching on the internet but have not found anything similar.
It is for a project that I don't have long to complete. I was owndering what I would state in the link to's in the view and what I would put in the games_controller.
Any help would be much appreciated.
The basic gist is to have a controller action which will return the list of games filtering by console. For example,
# GamesController.rb
def index
#games = Game.find_by_console(params[:console])
end
Then you can create a link for any particular console as such:
link_to 'XBOX', games_path(:console => 'XBOX')
This should result in a GET request to the URL /games?console=XBOX
If you've got a pre-defined set of consoles, you might look into making them into constants inside a Consoles module to avoid having to hardcode them everywhere.
UPDATE:
Since you are trying to implement both searching and filtering in the same chain, you need to make sure that find_by_console isn't called if it's not present.
# GamesController.rb
def index
#games = Game.search(params[:search])
#games = #games.find_by_console(params[:console]) unless params[:console].blank?
end

Referral program - cookies and more (Rails)

I'm building a referral program for my Ruby on Rails app, such that a user can share a link that contains their user ID (app.com/?r=ID). If a referrer ID is present when a visitor lands on app's homepage, the signup form on the homepage contains a hidden field that populates with the referrer's ID. The controller then detects the ID and creates a new referral in a referral table if the referred visitor signs up. It works, and here's that chunk of code:
#referrer = User.find(params[:r]) rescue nil
unless #referrer.nil?
#referral = Referral.new(:referrer_id=>#referrer.id)
end
Pretty simple stuff, but it's pretty easy to break (ex: if visitor navigates away from the homepage, referrer ID is lost). I feel like cookies could be a more robust method, where a cookie containing the referrer's ID is stored on the referred user's computer for x days. This is pretty commonplace, especially with affiliate programs like Groupon, but I have never worked with cookies and have no idea where to start.
Also, is there any good way to mask or change the URLs of the referral system? Instead of having app.com/?r=1842, I would prefer something like app.com/x39f3 <- a randomly generated sequence of numbers associated with a given user, without the ?r= portion.
Any help is greatly appreciated. Thanks!
To answer the cookie question, it's quite easy to set them:
cookies['app-referrer-id'] = params[:r]
And then it's the same format to read them back (but without the assignment). I would suggest putting this code in a before_filter in your application controller. This way, the cookie will be set irrespective of the page on which your visitor first lands on your site.
With regards to changing the structure of the urls to the suggested format, you would need to have the referral codes match a specific pattern, otherwise you are likely to run into routing problems. If, for example, they matched the format of 3 letters followed by three numbers, you could put the following your routes file:
match '/:referrer_id' => 'app#index', :constraints => {:referrer_id => /[a-zA-Z]{3}[0-9]{3}/}
The reference to app#index should be changed to the controller in which you handle referrals and you can access the referrer_id through params[:referrer_id].
Hope this is of some use.
Robin

Resources