I'm hoping to scrape together several tens of thousand pages of government data (in several thousand folders) that are online and put it all into a single file. To speed up the process, I figured I'd download the site first to my hard drive before crawling it with something like Anemone + Nokogiri. When I tried the sample code with the government site's online URL, everything worked fine, but when I change the URL to my local file path, the code runs, but doesn't produce any output. Here's the code:
url="file:///C:/2011/index.html"
Anemone.crawl(url) do |anemone|
titles = []
anemone.on_every_page { |page| titles.push page.doc.at
('title').inner_html rescue nil }
anemone.after_crawl { puts titles.compact }
end
So nothing gets outputted with the local file name, but it works successfully if I plug in the corresponding online URL. Is Anemone somehow unable to crawl local directory structures? If not, are there other suggested ways for doing this crawling/scraping, or should I simply run Anemone on the online version of the site? Thanks.
You have couple of problems with this approach
Anemone expect a web address to issue http request and you are passing it a file. You can just load the file with nokogiri instead and do the parsing through it
The links on the files might be full urls rather than the relative paths, in this case you still need to issue http request
What you could do is download the files locally, than traverse through them using nokogiri and convert the links to local path for Nokogiri to load next
Related
My goal is essentially to upload a video file, like this, but with Ruby on Rails instead of PHP. I can successfully send JSON data to my server, but haven't been able to get file uploads working. The end objective is to have the file be in the server's /tmp directory, just like files uploaded via a webpage file_field_tag.
This image:
shows what I've tried so far. The result is that on the server, the parameter list is empty, unlike if you had used a file_field_tag. In the PHP example, they are able to get the contents of the file from the input stream... maybe there is something similar in Rails?
I know my API works, as I was able to successfully make a request using a JavaScript XMLHttpRequest, so I'm led to believe the solution involves working around what App Inventor offers for HTTP requests.
Edit: Removed unsupported header since the PHP example doesn't use headers anyways
I am having a rails application on e.g. example.com . I am using a cloud storage provider for any kind of files (videos, images, ...).
No I would like to make them available for download without exposing the url of the actual storage location.
So I was thinking of a kind of proxy. A simple controller which could look like this :
data = open(params[:file])
filename = "#{RAILS_ROOT}/tmp/my_temp_file"
File.open(filename, 'r+') do |f|
f.write data.read
end
send_file filename, ...options...
( code taken from a link ).
Point being is that I would have to download the file first.
So I was wondering if it would be possible to stream the file right away without downloading from the cloud storage first.
best
philip
I was working on this exact issue a while ago and came to the conclusion that this would not be possible without having to download the file to your server and then pass it on to the client as you say.
I'd recommend generating a signed, expiring download link that you insert into a hidden iframe whenever a user clicks a download link on your page. In this way they will get the experience of downloading from your page, without the file making an unnecessary roundtrip to your server.
I've looked around a bit and can't seem to figure out how to link to a static file while using Silex. I've seen some similar questions/answers in regards to Symfony, but they involved YML routing files, which I don't use with Silex.
My Situation
I have some files in a /docs folder. Logged in users can upload new pdf files (so, I don't know ahead of time what all of the filenames will be; they're constantly changing).
My Intent
I need to be able to link to these PDF files, so that a click on a link somewhere will open www.myurl.com/docs/myfile.pdf.
The Problem
Due to the routing system in silex, it treats the url as a route (obviously) and throws a Page Not Found error.
Thanks in advance for any feedback!
You need to configure your web server in a way that it does not forward existing files to the front controller. The web servers section of the silex documentation has examples of such configurations for the most popular web servers.
As for the link itself, just link to the file directly, something along these lines:
{{ filename }}
I'm writing a Rails application that serves files stored on a remote server to the end user.
In my case the files are stored on S3 but the user requests the file via the Rails-application (hiding the actual URL). If the file was on my servers local file-system, I could use the Apache header X-Sendfile to free up the Ruby process for other requests while Apache took over the task of sending the file to the client. But in my case - where the file is not on the local file-system, but on S3 - it seems that I'm forced to download it temporarily inside Rails before sending it to the client.
Isn't there a way for Apache to serve a "remote" file to the client that is not actually on the server it self. I don't mind if Apache has to download the file for this to work, as long as I don't have to tie up the Ruby process while it's going on.
Any suggestions?
Thomas, I have similar requirements/issues and I think I can answer your problem. First (and I'm not 100% sure you care for this part), hiding the S3 url is quite easy as Amazon allows you to point CNAMES to your bucket and use a custom URL instead of the amazon URL. To do that, you need to point your DNS to the correct amazon URL. When I set mine up it was similar to this: files.domain.com points to files.domain.com.s3.amazonaws.com. Then you need to create the bucket with the name of your custom URL (files.domain.com in this example). How to call that URL will be different depending on which gem you use, but a word of warning was that the attachment_fu plugin I was using was incorrectly sending me to files.domain.com/files.domain.com/name_of_file.... I couldn't find the setting to fix it, so a simple .sub method for the S3 portion of the plugin fixed it.
On to your other questions, to execute some rails code (like recording the hit in the db) before downloading you can simply do this:
def download
file = File.find(...
# code to record 'hit' to database
redirect_to 3Object.url_for(file.filename,
bucket,
:expires_in => 3.hours)
end
That code will still cause the file to be served by S3, but and still give you the ability to run some ruby. (Of course the above code won't work as is, you will need to point it to the correct file and bucket and my amazon keys are saved in a config file. The above is also using the syntax for the AWS::S3 gem - http://amazon.rubyforge.org/).
Second, the Content-Disposition: attachment issue is a bit more tricky. Hopefully, your situation is a bit more simple than mine and the following solution can work. Assuming the object 'file' (in this example) is the correct S3 object, you can set the disposition to attachment by
file.content_disposition = "attachment"
file.save
The above code can be executed after the file exists on the S3 server (unlike some other headers and permissions), which is nice and it can also be added when you upload the file (syntax depends on your plugin). I'm still trying to find a way to tell S3 to send it as an attachment and only when requested (not every time), and if you find that, please let me know your solution. I need to be able to sometimes download it and other times save embed images (for example) into HTML. I'm not using the above mentioned redirect but fortunately it seems that if you embed (such as a HTML image tag) a file with the content-disposition/attachment header, and the browser still displays the image normally (but I haven't throughly tested that across enough browsers to send it in the wild).
Hope that helps! Good luck.
I'm familiar with tools like Deadweight for finding CSS not in use in your Rails app, but does anything exist for images? I'm sitting in a project with a massive directory of assets from working with a variety of designers and I'm trying to trim the fat in this project. It's especially a pain when moving assets to our CDN.
Any thoughts?
It depends greatly on the code using the images. It's always possible that a filename is computed (by concatenating two values or string substitution etc) so a simply grepping by filename isn't necessarily enough.
You could try running wget (probably already installed if you've got a linux machine, otherwise http://users.ugent.be/~bpuype/wget/ ) to mirror your whole site. Do this on the same machine or network if you can, it'll crawl your whole site and grab all the images
# mirror mysite.com accepting only jpg, png and gif files
wget -A jpg,png,gif --mirror www.mysite.com
Once you've done that, you're going to have a second copy of your site's hierarchy containing any images that are actively linked to by any page reachable by crawling your site. You can then backup your source image directory, and replace it with wget's copy. Next, monitor your log files for 404's pertaining to gif/jpg/png files. Hope that helps.
Finding unsed images should be easier than CSS.
Just find *.jpg *.png *gif with glob, put those filenames to dictionary or array and find those filenames againt html, css, js files, remove filename if found and you will get unused list, and move those images to another folder with same directory structure (It will be good for restoring for just in case)
Basically like this, and of course for the file names that encrypted/encoded/obcuscated will not work.
require "fileutils"
img=Dir.glob("**/*.jpg")+Dir.glob("**/*.png")+Dir.glob("**/*.gif")
data=Dir.glob("**/*.htm*")+Dir.glob("**/*.css")+Dir.glob("**/*.js")
puts img.length.to_s+" images found & "+data.length.to_s+" files found to search against"
content=""
data.each do |f|
content+=File.open(f, 'r').read
end
img.each do |m|
if not content=~ Regexp.new("\\b"+File.basename(m)+"\\b")
FileUtils.mkdir_p "../unused/"+File.dirname(m)
FileUtils.mv m,"../unused/"+m
puts "Image "+m+" moved to ../unused/"+File.dirname(m)+" folder"
end
end
PS: I used fileutils, because normal makedirs and mv are not works in my windows version of ruby
And I am not good at ruby, so please double check it before you use it.
Here is the sample results I ran in root folder of sample rails folder in my windows
---\ruby>ruby img_coverage.rb
5 images found & 12 files found to search against
Image depot/public/images/test.jpg moved to ../unused/depot/public/images folder
If your image URLs often come from many computed / concatenated strings and other stuff hard to track programmatically within your source code, and your application is in heavy use, you could try a soft "honeypot" approach like this:
Move all the assets to a different directory, e.g. /attic
Set up an empty /images directory (or what your asset directory is called)
Set up a .htaccess file (if you're on Apache of course) that, using the -f flag, redirects all requests to nonexistent image files to a script
The script copies the requested file from the /attic into the /images directory and displays it
The next request to that image will go directly to the image, because it exists now
After some time and sufficient usage, all needed images should have been copied to the assets directory.
It's a "soft" approach of course because a dialog / situation could have not been opened/entered/used by any user during that time (things like error message icons for example). But it will recognize all used files, no matter where they're requested from, and might help sort out much of the unneeded files.
If your file manager supports it, try sorting your images directory by the files' "last accessed" date. Files that haven't been accessed in a long time most likely aren't used any longer.
Along the same lines, you can also filter or grep through your web server's logs and make a list of the image files that it has served up in the last several months. Any images not in this list are likely unused.