Rails: how to save an external web page

Rails: how to save an external web page - ruby-on-rails

In a rails app, given an external URL, I need to make a local copy of web pages not created by my app. Much like "save as" from a browser. I looked into system("wget -r -l 1 http://google.com") It might work, but it copies too much for the pages I tried (like 10x too much). I need to follow the link references to stuff to make the page display properly, but don't want to follow all the a href's to other pages. Any package out there?

This wget command usually works for me, so maybe it will for you:
wget -nd -pHEKk "http://www.google.com/"
But once you get all the files, you'll have to parse them for references to the base url and replace that with ./, which shouldn't be too hard (I don't do Ruby, so I'm not helping with that).

You could also use something like Nokogiri or HPricot. An example with Nokogiri:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.google.com/'))
This will give you an actual Ruby object that can be 'queried' using the associated methods.

Related

How to import without specifing path in ruby

When I develop ruby app.
I found that some apps import other modules without specifying specific path like following
require 'test'
I always set some modules in certain directory and set following path.
require './sample/test'
How can I import without path ?
Are there any environment setting around this?
Am I missing important thing?
If someone has opinion,please let me know.
Thanks

As was mentioned in the comments, Ruby looks at the $LOAD_PATH global variable to know where to look for required libraries.
Normally, under most circumstances, you should not mess with it and just leave it as is.
When you see other libraries using require without a leading dot for relative path (e.g., require 'sinatra'), it usually means one of these:
They load a pre-installed gem, and since the gem home path is part of the $LOAD_PATH, it can be found and loaded.
The code you see is a part of a gem, and it loads one of its own files.
The code you see is a part of a larger framework (e.g., Rails), which has altered the $LOAD_PATH variable.
You can see all the folders available in the $LOAD_PATH like this:
pp $LOAD_PATH
If, for some reason, you insist on altering the $LOAD_PATH to load your own local files, you can do so like this:
$LOAD_PATH.unshift __dir__
require 'my-local-file'
Bottom line, my recommendation to you is: Do not use this technique just so that you require statements "look nicer", and if you have functionality that can be encapsulated in a gem (private or public), do it.

scrape json from viewsource page

So i'm trying to scrape json that exists in a website source and use it in my own site.
Heres an example site:
view-source:http://www.viagogo.co.uk/Theatre-Tickets/Musicals/The-Lion-King/The-Lion-King-London-Tickets/E-1545516
If you look partway down there is a var eventListings
I would like to get all the code that exists in that var
So far all i have is this:
url = "http://www.viagogo.co.uk/Theatre-Tickets/Musicals/The-Lion-King/The-Lion-King-London-Tickets/E-1545516"
doc = open(url).read
Any ideas how i can get this?
Thanks

The code you have so far will (basically) function using open-uri from the Ruby standard library. As with any standard library module, require 'open-uri' at the top of the file in which you use it.
Open::URI treats it job as to give you the contents of the file. If you are comfortable using tools to search the raw text for the particular contents you are looking for, that may be enough. There are a few gems, though, that assume you are likely to get back HTML and to provide special support for finding HTML elements and inspecting their contents. This post uses mechanize which in turn is built on top of nokogiri. It is likely to be easier to write working code when using this library, but be aware that installing nokogiri may be difficult in your staging or production environment when making the decision to use it.

Why is there Rails.rb files all over the place?

Was digging around my Rails applications and noticed that there are rails.rb files all over the place. In my ruby gems directories like:
...gems\devise-2.0.4\lib\devise\rails.rb
...gems\cucumber-rails-1.3.0\lib\cucumber\rails.rb
...gems\railties-3.2.3\lib\rails.rb
I am assuming that there are executed whenever you issue some command like "rails xxx". So all these extra rails.rb files combine with the original rails.rb file to essentially make one big rails.rb file. Essentially, when we type in "rails xxx" it goes thru all them all?
Just looking for some confirmation PLUS a little more knowledge about this. Thanks.

The best way to understand what these rails.rb files are doing, is to read the source code.
ralties
devise
cucumber-rails
As you can see, in any library the file assumes a different scope. The common behaviour is that the file rails.rb normally contains the code required to initialize the library when loaded from a Rails project.
BTW, this has nothing to do with the script/rails command and there is no "big rails.rb" file.

The files are not generated but are simply source files of these libraries you are using.
In this case they are probably rails-related classes that either extend Rails in some way or modify it or make the library interact with Rails.
Rails is a very common framework in Ruby land so most if not all libraries will have some sort of integration with Rails.
By no means are all of those loaded when you run rails XXX but rather when your application loads these libraries their rails.rb files may be executed to provide some sort of integration with Rails.

Making my first gem - Where do I start?

So I have this idea for a RubyGem that I think would be an awesome experience to learn more about Ruby and Rails but...I have no idea where to start.
My idea is to generate a folder "articles" where you can put markdown files. From this folder the main blog page displays only the titles as links to the articles themselves.
It sounds simple but I honestly have no idea where to start. What articles do you recommend I read if I want to insert lines into routes.rb, generate a folder and display markdown in Rails?

I would recommend one of these tutorials for gem creation:
http://net.tutsplus.com/tutorials/ruby/gem-creation-with-bundler/
http://railscasts.com/episodes/245-new-gem-with-bundler
To modify the routes.rb file, you'll just need File.open to read lines in. Use regular expressions to determine where you want to insert your line, and write the file back out.
To create a folder, look at the documentation for Dir.new
For Markdown in Ruby/Rails, I like the rdiscount gem: https://github.com/rtomayko/rdiscount
Railties provide a nice way to do certain things like this. You'll probably use http://api.rubyonrails.org quite a bit. There is some Railtie documentation on that site here: http://api.rubyonrails.org/classes/Rails/Railtie.html.

I recommend reading the RubyGems guides – especially What is a gem?, Make your own gem and Patterns.
Since you're likely already using Bundler, you can run bundle gem <name> to generate a gem project with stuff already in place. It does save work, but refer to the guides if there's something you don't understand.
Also, watch some open source projects on GitHub – observing other developers and taking note of how they do things certainly helps.

The simplest way is probably to read other gems that do anything similar to what you want to accomplish. Start with their .gemspec files that will list all the other files which are needed for the gem to work, and a list of gem dependencies.

Responding more to how to get started with creating gems, the following are 2 popular, documented gems that can help you.
https://github.com/seattlerb/hoe
https://github.com/technicalpickles/jeweler
Also, though it does more than you're trying to do with your gem (it's a static site generator), https://github.com/mojombo/jekyll is a very popular gem which you place .markdown files into a posts/ directory which are converted to static HTML pages via rake. I would imagine you could find at least some functionality you're after there.

How to respect Robots.txt using Nokogiri?

I'm testing it and Nokogiri does not seem to respect Robots.txt file. Is there someway to make it respect? It seems like common question, but I could not find any answer online.

Nokogiri parses the HTML or webpage that you give it. It does not know anything about the robots.txt file for the domain where the page you happen to have requested resides.
I presume that you want to ignore in-site links that are in robots.txt?
Since you've tagged this Rails, I'll assume you use Ruby. In that case you can use the Mechanize library which has the facility to use the robots.txt file.
There is also the original Perl version and other language ports if you prefer those.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Rails: how to save an external web page - ruby-on-rails

This wget command usually works for me, so maybe it will for you: wget -nd -pHEKk "http://www.google.com/" But once you get all the files, you'll have to parse them for references to the base url and replace that with ./, which shouldn't be too hard (I don't do Ruby, so I'm not helping with that).

You could also use something like Nokogiri or HPricot. An example with Nokogiri: require 'nokogiri' require 'open-uri' doc = Nokogiri::HTML(open('http://www.google.com/')) This will give you an actual Ruby object that can be 'queried' using the associated methods.

Related

How to import without specifing path in ruby

scrape json from viewsource page

Why is there Rails.rb files all over the place?

Making my first gem - Where do I start?

How to respect Robots.txt using Nokogiri?

Categories

Resources