Downloading all files from URL with extension filter - ruby-on-rails

I want to download all files from FTP or HTTP using extension filter
For example, I have one URL that contains many MKV files and I want to set the extension filtering to download all MKV files from URL or download all jpg
I can use open-Uri for this but this method only download one file and save it
require 'open-uri'
download = open('https://img.webmd.com/dtmcms/live/webmd/consumer_assets/site_images/article_thumbnails/video/caring_for_your_kitten_video/650x350_caring_for_your_kitten_video.jpg')
IO.copy_stream(download, '650x350_caring_for_your_kitten_video.png')

That's a limitation on the server site. If your server doesn't allow directory listing (which most of the time HTTP servers dont) there is not a lot you can do.
So to answer your question: open-uri will not allow you to list files.
As for FTP. you are able to list all files in a directory like this. Not sure if you're able to pass in wildcards. If not you will have to use the select method to filter on the filenames you want.

I found the solution to get all links and put it to a text file
Then using system command to get a text file to download by the download manager, for example, I need to get mkv links from URL pages
require 'mechanize'
mechanize = Mechanize.new
page = mechanize.get("URL")
page.search('a').each do |link|
uri = mechanize.resolve(' ' + link).to_s
if uri.include?(".mkv")
File.open("url.txt", "a") do |file|
file.puts uri
end
end
end
puts "The File has been created!"
#exec 'idman /d URL'
#puts "Done!"

Related

In Rails I want to read excel file form Live Path like http://www.carsa.jp/admin/data.xlsx

I wants to read a excel file existing on Live URL of another website.
When I hit that URL in browser file is downloading. While in my rails app it is giving below error
No such file or directory # rb_sysopen - http://www.carsa.jp/admin/data.xlsx (Errno::ENOENT)
My Rails app code is as below
data = Roo::Excelx.new('http://www.carsa.jp/admin/data.xlsx')
header = data.row(1)
puts header
Note: If I download file and place it within my application it is working fine but the requirement is to read it from the third-party website in a scheduled job as per the above script.
data = Roo::Excelx.new('lib/data.xlsx')
header = data.row(1)
puts header
Try using Roo::Spreadsheet.open instead of Roo::Excelx.new. According to the Roo Readme:
Roo::Spreadsheet.open can accept both paths and File instances.
This should do the trick:
Roo::Spreadsheet.open('http://www.carsa.jp/admin/data.xlsx')

Rails FTP OPEN CSV

I have the following code to connect my rails app to my FTP. This works great. However, I want to use open-uri to open the csv file so I can parse it. Any ideas how to do this? I think it's an easy thing to do but I'm missing something.
require 'net/ftp'
ftp = Net::FTP.new
ftp.connect("xxx.xxx.xx.xxx",21)
ftp.login("xxxxx","xxxx")
ftp.chdir("/")
ftp.passive = true
puts ftp.list("TEST.csv")
You'll need to use #gettextfile.
A) Get the file to a local temporary file and read its content
# Creating a tmp file can be done differently as well.
# It may also be omitted, in which case `gettextfile`
# will create a file in the current directory.
Dir::Tmpname.create(['TEST', ['.csv']) do |file_name|
ftp.gettextfile('TEST.csv', file_name)
content = File.read(file_name)
end
B) Pass a block to gettextfile and get the content one line at a time
content = ''
ftp.gettextfile('TEST.csv') do |line|
content << line
end

FTP getbinaryfile by file extension?

I am using Net::FTP's getbinaryfile functionality to pull in a zip file with FTP. My system is not aware of the full file name, so I would simply like to search the folder for the zip file extension. Usually I would simply input the filename as *.zip. This does not seem to work.
ftp = Net::FTP.new(domain)
path = "#{Rails.root}/public/ftp/#{self.id}.zip"
ftp.getbinaryfile("*.zip", path)
I used the following code to return the zip file name in the FTP folder. Then using the same code as above, I was able to run getbinaryfile with the correct zip file name.
files = ftp.nlst("*.zip")
I use the following to get all zip files (I'm using SFTP but hopefully this will point you in the right direction)
Net::SFTP.start(domain, user, :password => 'pass') do |sftp|
sftp.dir.glob("/yourdirectory","*.zip").each do |file|
sftp.download!(file, "/local/spot")
end
end

ruby reading files from S3 with open-URI

I'm having some problems reading a file from S3. I want to be able to load the ID3 tags remotely, but using open-URI doesn't work, it gives me the following error:
ruby-1.8.7-p302 > c=TagLib2::File.new(open(URI.parse("http://recordtemple.com.s3.amazonaws.com/music/745/original/The%20Stranger.mp3?1292096514")))
TypeError: can't convert Tempfile into String
from (irb):8:in `initialize'
from (irb):8:in `new'
from (irb):8
However, if i download the same file and put it on my desktop (ie no need for open-URI), it works just fine.
c=TagLib2::File.new("/Users/momofwombie/Desktop/blah.mp3")
is there something else I should be doing to read a remote file?
UPDATE: I just found this link, which may explain a little bit, but surely there must be some way to do this...
Read header data from files on remote server
Might want to check out AWS::S3, a Ruby Library for Amazon's Simple Storage Service
Do an AWS::S3:S3Object.find for the file and then an use about to retrieve the metadata
This solution assumes you have the AWS credentials and permission to access the S3 bucket that contains the files in question.
TagLib2::File.new doesn't take a file handle, which is what you are passing to it when you use open without a read.
Add on read and you'll get the contents of the URL, but TagLib2::File doesn't know what to do with that either, so you are forced to read the contents of the URL, and save it.
I also noticed you are unnecessarily complicating your use of OpenURI. You don't have to parse the URL using URI before passing it to open. Just pass the URL string.
require 'open-uri'
fname = File.basename($0) << '.' << $$.to_s
File.open(fname, 'wb') do |fo|
fo.print open("http://recordtemple.com.s3.amazonaws.com/music/745/original/The%20Stranger.mp3?1292096514").read
end
c = TagLib2::File.new(fname)
# do more processing...
File.delete(fname)
I don't have TagLib2 installed but I ran the rest of the code and the mp3 file downloaded to my disk and is playable. The File.delete would clean up afterwards, which should put you in the state you want to be in.
This solution isn't going to work much longer. Paperclip > 3.0.0 has removed to_file. I'm using S3 & Heroku. What I ended up doing was copying the file to a temporary location and parsing it from there. Here is my code:
dest = Tempfile.new(upload.spreadsheet_file_name)
dest.binmode
upload.spreadsheet.copy_to_local_file(:default_style, dest.path)
file_loc = dest.path
...
CSV.foreach(file_loc, :headers => true, :skip_blanks => true) do |row|}
This seems to work instead of open-URI:
Mp3Info.open(mp3.to_file.path) do |mp3info|
puts mp3info.tag.artist
end
Paperclip has a to_file method that downloads the file from S3.

Why does using OpenURI to download a file result in a partial file?

I'm trying to use OpenURI to download a file from S3, and then save it locally so I can send the file as an attachment with ActionMailer.
Something strange is going on. The images being downloaded and attached are corrupt, the bottom parts of the images are missing.
Here's the code:
require 'open-uri'
open("#{Rails.root.to_s}/tmp/#{a.attachment_file_name}", "wb") do |file|
source_url = a.authenticated_url()
io = open(URI.parse(source_url).to_s)
file << io.read
attachments[a.attachment_file_name] = File.read("#{Rails.root.to_s}/tmp/#{a.attachment_file_name}")
end
a is the attachment from ActionMailer.
What can I try next?
It looks like you're trying to read the file before it's been closed, which could leave part of the file buffer unwritten.
I'd do it like this:
require 'open-uri'
source_url = a.authenticated_url()
attachment_file = "#{Rails.root.to_s}/tmp/#{a.attachment_file_name}"
open(attachment_file, "wb") do |file|
file.print open(source_url, &:read)
end
attachments[a.attachment_file_name] = File.read(attachment_file)
It looks like source_url = a.authenticated_url() will be a string, so parsing the string into a URI then doing to_s on it will be redundant unless URI is doing some normalizing, which I don't think it does.
Based on my sysadmin experience: A side task is cleaning up the downloaded/spooled files. They could be deleted immediately after being attached, or you could have a cron job that runs daily, deleting all spooled files over one day old.
An additional concern for this is there is no error handling in case the URL can't be read, causing the attachment to fail. Using a temp spool file you could check for the existence of the file. Even better, you should probably be prepared to handle an exception if the server returns a 400 or 500 error.
To avoid using a temporary spool file try this untested code:
require 'open-uri'
source_url = a.authenticated_url()
attachments[a.attachment_file_name] = open(source_url, &:read)

Resources