I'm testing it and Nokogiri does not seem to respect Robots.txt file. Is there someway to make it respect? It seems like common question, but I could not find any answer online.
Nokogiri parses the HTML or webpage that you give it. It does not know anything about the robots.txt file for the domain where the page you happen to have requested resides.
I presume that you want to ignore in-site links that are in robots.txt?
Since you've tagged this Rails, I'll assume you use Ruby. In that case you can use the Mechanize library which has the facility to use the robots.txt file.
There is also the original Perl version and other language ports if you prefer those.
Related
So i'm trying to scrape json that exists in a website source and use it in my own site.
Heres an example site:
view-source:http://www.viagogo.co.uk/Theatre-Tickets/Musicals/The-Lion-King/The-Lion-King-London-Tickets/E-1545516
If you look partway down there is a var eventListings
I would like to get all the code that exists in that var
So far all i have is this:
url = "http://www.viagogo.co.uk/Theatre-Tickets/Musicals/The-Lion-King/The-Lion-King-London-Tickets/E-1545516"
doc = open(url).read
Any ideas how i can get this?
Thanks
The code you have so far will (basically) function using open-uri from the Ruby standard library. As with any standard library module, require 'open-uri' at the top of the file in which you use it.
Open::URI treats it job as to give you the contents of the file. If you are comfortable using tools to search the raw text for the particular contents you are looking for, that may be enough. There are a few gems, though, that assume you are likely to get back HTML and to provide special support for finding HTML elements and inspecting their contents. This post uses mechanize which in turn is built on top of nokogiri. It is likely to be easier to write working code when using this library, but be aware that installing nokogiri may be difficult in your staging or production environment when making the decision to use it.
So I have this idea for a RubyGem that I think would be an awesome experience to learn more about Ruby and Rails but...I have no idea where to start.
My idea is to generate a folder "articles" where you can put markdown files. From this folder the main blog page displays only the titles as links to the articles themselves.
It sounds simple but I honestly have no idea where to start. What articles do you recommend I read if I want to insert lines into routes.rb, generate a folder and display markdown in Rails?
I would recommend one of these tutorials for gem creation:
http://net.tutsplus.com/tutorials/ruby/gem-creation-with-bundler/
http://railscasts.com/episodes/245-new-gem-with-bundler
To modify the routes.rb file, you'll just need File.open to read lines in. Use regular expressions to determine where you want to insert your line, and write the file back out.
To create a folder, look at the documentation for Dir.new
For Markdown in Ruby/Rails, I like the rdiscount gem: https://github.com/rtomayko/rdiscount
Railties provide a nice way to do certain things like this. You'll probably use http://api.rubyonrails.org quite a bit. There is some Railtie documentation on that site here: http://api.rubyonrails.org/classes/Rails/Railtie.html.
I recommend reading the RubyGems guides – especially What is a gem?, Make your own gem and Patterns.
Since you're likely already using Bundler, you can run bundle gem <name> to generate a gem project with stuff already in place. It does save work, but refer to the guides if there's something you don't understand.
Also, watch some open source projects on GitHub – observing other developers and taking note of how they do things certainly helps.
The simplest way is probably to read other gems that do anything similar to what you want to accomplish. Start with their .gemspec files that will list all the other files which are needed for the gem to work, and a list of gem dependencies.
Responding more to how to get started with creating gems, the following are 2 popular, documented gems that can help you.
https://github.com/seattlerb/hoe
https://github.com/technicalpickles/jeweler
Also, though it does more than you're trying to do with your gem (it's a static site generator), https://github.com/mojombo/jekyll is a very popular gem which you place .markdown files into a posts/ directory which are converted to static HTML pages via rake. I would imagine you could find at least some functionality you're after there.
I need to take a database text field and parse it for
duplication and garbage
malice
whitelisted selectors
compress and output as a css file
Since there might be a rails way I'm unaware or something ready made I'm asking before I waste time trying to reinvent a wheel. My searching revealed nothing, mostly in rails seems aimed at view level, and css seems to be an unattended niche in this area (plenty of html though).
I'm aware of the sanitize gem (doesn't do css immediately, yet another thing I'd need to map out and code) and the built in rails stuff (not a lot of tutorial, aimed mostly at the view level). I need a gem, lib, module or something similar that I can work with in a controller or queue.
EDIT:
Without getting too deep into the specifics of the project: administrative users can add css for their portions of the site. As part of the flow I'm going to save the raw css and then process and save the processed css. The db stuff is archival mostly, the css file is output immediately. Because there is few places to add modified css and only admins have access to the css, it sort of works but I'm looking to make it more robust in the future where admins who may not be as conversant with the security needs or not as css aware can operate.
The most basic example is that it just a text field on an admin page. The admin cuts and pastes css there, submits, and the application turns it into a css file that gets included with the designated pages, which works because the current admins know the application, the css of the application, and what they can and cannot change. The goal is to make this more robust for future admins who might not be as savvy.
To simply sanitize CSS, you can use the SanitizeHelper built into Rails: http://api.rubyonrails.org/classes/ActionView/Helpers/SanitizeHelper.html#method-i-sanitize_css
Have you looked at Sass? It has all of the parsing logic built in, for a superset of CSS. You could add a feature (Sass support) and save yourself the need to parse/validate the CSS all in one go.
You can generate output CSS from Sass (or just plain CSS, since Sass [with the SCSS syntax] is a fully-backward-compatible superset of CSS) like this:
output_css = Sass::Engine.new(sass_content, :syntax => :scss).render
There are a bunch of options that you'll probably want to look into at http://sass-lang.com/
Another option is Less. The new Twitter Bootstrap framework uses Less, and Rails 3.1 uses Sass. The biggest difference is that the official Less parser/compiler is built in JavaScript, so you could actually validate and compile in the user's browser while they work and show them any errors before they save. Of course then you need to run a JavaScript engine (e.g. V8) in your Rails application if you want to use Less to validate the incoming CSS still.
I was just wondering if anyone knew of any good libraries for parsing .doc files (and similar formats, like .odt) to extract text, yet also keep formatting information where possible for display on a website.
Capability of doing similarly for PDFs would be a bonus, but I'm not looking as much for that.
This is for a Rails project, if that helps at all.
Thanks in advance!
Apache's POI is a very popular way to access Word and Excel documents. There's a Ruby POI binding that might be worth investigating, but it looks like you'll have to build it yourself. And the API doesn't seem very Ruby-like since it's virtually a direct port from the Java code. And it seems to only have been tested against Ruby 1.8.2.
Can anyone recommend a way of creating a view where users can upload images to my app through a WYSIWYG editor?
I've tried solving this using CK Editor and Paperclip but am having lots of trouble... Maybe I'm going about this the wrong way.
If someone's done this before I'd really like to know how! I don't have a editor or file storage mechanism preference so fire away...
This is all dependent on the WYSIWYG's file upload API. From there, just build an ImagesController to handle requests from that API, use whatever system (Paperclip is good) to handle those files internally, and you should be good to go. You won't find a plug-and-play solution; you'll have to hand-roll it.
Turns out that, with more targeted Google searching, you can find a preexisting solution. Here's one for TinyMCE and Rails. You may, however, end up finding that it doesn't meet your needs, in which case I would not be surprised to find that creating your own solution would be simpler than you expect :)
You could try Bootsy. It's a WYSIWYG editor with image upload capability. Includes a (rather simple) image manager as well.
https://github.com/volmer/bootsy
There is an other solution for rails out there:
https://github.com/spohlenz/tinymce-rails
You can load it as gem and configure it via a yml file. And it comes with an extra language gem.