How to recursively download FTP folder in parallel in Ruby? - ruby-on-rails

I need to cache an ftp folder locally in ruby. Right now I'm using ftp_sync to download the ftp folder but it's painfully slow, do you guys know any library that can download the folder files in parallel?
Thanks!

The syncftp gem may help you:
http://rubydoc.info/gems/syncftp/0.0.3/frames
Ruby has a decent built-in FTP library in case you want to roll your own:
http://www.ruby-doc.org/stdlib-1.9.3/libdoc/net/ftp/rdoc/Net/FTP.html
To download files in parallel, you can use multiple threads with timeouts:
Ruby Net::FTP Timeout Threads
A great way to get parallel work done is Celluloid, the concurrent framework:
https://github.com/celluloid/celluloid
All that said, if the download speed is limited to your overall network bandwidth, then none of these approaches will help much.
To speed up the transfers in this case, be sure you're only downloading the information that's changed: new files and changed sections of existing files.
Segmented downloading can give massive speedups in some cases, such as downloaded big log files where only a small percentage of the file has changed, and the changes are all at the end of the file, and are all appends.
You can also consider shelling out to the command line. There are many tools that can help you with this. A good general-purpose one is "curl", which supports simple ranges for FTP files as well, for example you can get the first 100 bytes of a document using FTP like this:
curl -r 0-99 ftp://www.get.this/README
Are you open to other protocols besides FTP? Take a look at the "rsync" command, which is excellent for download synchronization. The rsync command has many optimizations to transfer just the changed data. For example rsync can sync a remote directory to a local directory like this:
rsync -auvC me#my.com:/remote/foo/ /local/foo/

Take a look at Curb. It's a wrapper around Curl, and can do multiple connections in parallel.
This is a modified version of one of their examples:
require 'curb'
urls = %w[
http://ftp.ruby-lang.org/pub/ruby/1.9/ruby-1.9.3-p286.tar.bz2
http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tar.bz2
]
responses = {}
m = Curl::Multi.new
# add a few easy handles
urls.each do |url|
responses[url] = Curl::Easy.new(url)
puts "Queuing #{ url }..."
m.add(responses[url])
end
spinner_counter = 0
spinner = %w[ | / - \ ]
m.perform do
print 'Performing downloads ', spinner[spinner_counter], "\r"
spinner_counter = (spinner_counter + 1) % spinner.size
end
puts
urls.each do |url|
print "[#{ url } #{ responses[url].total_time } seconds] Saving #{ responses[url].body_str.size } bytes..."
File.open(File.basename(url), 'wb') { |fo| fo.write(responses[url].body_str) }
puts 'done.'
end
That'll pull in both the Ruby and Python source (which are pretty big so they'll take about a minute, depending on your internet connection and host). You won't see any files appear until the last block, where they get written out.

Related

reading large csv files in a rails app takes up a lot of memory - Strategy to reduce memory consumption?

I have a rails app which allows users to upload csv files and schedule the reading of multiple csv files with help of delayed_job gem. The problem is the app reads each file in its entirity into memory and then writes to the database. If its just 1 file being read its fine, but when multiple files are read the RAM on the server gets full and causes the app to hang.
I am trying to find a solution for this problem.
One solution I researched is to break the csv file into smaller parts and save them on the server, and read the smaller files. see this link
example: split -b 40k myfile segment
Not my preferred solution. Are there any other approaches to solve this where I dont have to break the file. Solutions must be ruby code.
Thanks,
You can make use of CSV.foreach to read just chunks of your CSV file:
path = Rails.root.join('data/uploads/.../upload.csv') # or, whatever
CSV.foreach(path) do |row|
# process row[i] here
end
If it's run in a background job, you could additionally call GC.start every n rows.
How it works
CSV.foreach operates on an IO stream, as you can see here:
def IO.foreach(path, options = Hash.new, &block)
# ...
open(path, options) do |csv|
csv.each(&block)
end
end
The csv.each part is a call to IO#each, which reads the file line by line (rb_io_getline_1 invokation) and leaves the line read to be garbage collected:
static VALUE
rb_io_each_line(int argc, VALUE *argv, VALUE io)
{
// ...
while (!NIL_P(str = rb_io_getline_1(rs, limit, io))) {
rb_yield(str);
}
// ...
}

Racket: How to retrieve the path of the running file?

I need a way to get the path of the running script (the directory that contains the source file), but
(current-directory)
never points there (in this case an external drive), but rather to some predefined location.
I created a file to try all the 'find-system-path's, but none of them are the running file! The Racket docs are not helping.
#lang web-server/insta
(define (start request)
(local [{define (build-ul items)
`(ul ,#(map itemize items))}
{define (itemize item)
`(li ,(some-system-path->string (find-system-path item)))}]
(response/xexpr
`(html
(head (title "Directories"))
(body (h1 ,"Some Paths")
(p ,(build-ul special-paths)))))))
(define special-paths (list 'home-dir
'pref-dir
'pref-file
'temp-dir
'init-dir
'init-file
;'links-file ; not available for Linux
'addon-dir
'doc-dir
'desk-dir
'sys-dir
'exec-file
'run-file
'collects-dir
'orig-dir))
The purpose is for a local web-server application (music server) that will modify sub-directories under the directory that contains the source file. I will be carrying the app on a USB stick, so it needs to be able to locate its own directory as I carry it between machines and operating systems with Racket installed.
Easy way: take the running script name, make it into a complete path,then take its directory:
(path-only (path->complete-path (find-system-path 'run-file)))
But you're more likely interested not in the file that was used to execute things (the web server), but in the actual source file that you're putting your code in. Ie, you want some resources to be close to your source. An older way of doing this is:
(require mzlib/etc)
(this-expression-source-directory)
A better way of doing this is to use `runtime-path', which is a way to define such resources:
(require racket/runtime-path)
(define-runtime-path my-picture "pic.png")
This is better since it also registers the path as something that your script depends on -- so if you were to package your code as an installer, for example, Racket would know to package up that png file too.
And finally, you can use it to point at a whole directory:
(define-runtime-path HERE ".")
... (build-path HERE "pic.png") ...
If you want the absolute path, then I think this should do it:
(build-path (find-system-path 'orig-dir)
(find-system-path 'run-file))

Downloading a YouTube video through Wget

I am trying to download YouTube videos through Wget. The first thing necessary is to capture the URL of the actual video resource. Suppose I want to download this video: video. Opening up the page in the Firebug console reveals something like this:
The link which I have encircled looks like the link to the resource, for there we see only the video: http://www.youtube.com/v/r-KBncrOggI?version=3&autohide=1. However, when I am trying to download this resource with Wget, a 4 KB file of name r-KBncrOggI#version=3&autohide=1 gets stored in my hard-drive, nothing else. What should I do to get the actual video?
And secondly, is there a way to capture different resources for videos of different resolutions, like 360px, 480px, etc.?
Here is one VERY simplified, yet functional version of the youtube-download utility I cited on my another answer:
#!/usr/bin/env perl
use strict;
use warnings;
# CPAN modules we depend on
use JSON::XS;
use LWP::UserAgent;
use URI::Escape;
# Initialize the User Agent
# YouTube servers are weird, so *don't* parse headers!
my $ua = LWP::UserAgent->new(parse_head => 0);
# fetch video page or abort
my $res = $ua->get($ARGV[0]);
die "bad HTTP response" unless $res->is_success;
# scrape video metadata
if ($res->content =~ /\byt\.playerConfig\s*=\s*({.+?});/sx) {
# parse as JSON or abort
my $json = eval { decode_json $1 };
die "bad JSON: $1" if $#;
# inside the JSON 'args' property, there's an encoded
# url_encoded_fmt_stream_map property which points
# to stream URLs and signatures
while ($json->{args}{url_encoded_fmt_stream_map} =~ /\burl=(http.+?)&sig=([0-9A-F\.]+)/gx) {
# decode URL and attach signature
my $url = uri_unescape($1) . "&signature=$2";
print $url, "\n";
}
}
Usage example (it returns several URLs to streams with different encoding/quality):
$ perl youtube.pl http://www.youtube.com/watch?v=r-KBncrOggI | head -n 1
http://r19---sn-bg07sner.c.youtube.com/videoplayback?fexp=923014%2C916623%2C920704%2C912806%2C922403%2C922405%2C929901%2C913605%2C925710%2C929104%2C929110%2C908493%2C920201%2C913302%2C919009%2C911116%2C926403%2C910221%2C901451&ms=au&mv=m&mt=1357996514&cp=U0hUTVBNUF9FUUNONF9IR1RCOk01RjRyaG4wTHdQ&id=afe2819dcace8202&ratebypass=yes&key=yt1&newshard=yes&expire=1358022107&ip=201.52.68.216&ipbits=8&upn=m-kyX9-4Tgc&sparams=cp%2Cid%2Cip%2Cipbits%2Citag%2Cratebypass%2Csource%2Cupn%2Cexpire&itag=44&sver=3&source=youtube,quality=large&signature=A1E7E91DD087067ED59101EF2AE421A3503C7FED.87CBE6AE7FB8D9E2B67FEFA9449D0FA769AEA739
I'm afraid it's not that easy do get the right link for the video resource.
The link you got, http://www.youtube.com/v/r-KBncrOggI?version=3&autohide=1, points to the player rather than the video itself. There is one Perl utility, youtube-download, which is well-maintained and does the trick. This is how to get the HQ version (magic fmt=18) of that video:
stas#Stanislaws-MacBook-Pro:~$ youtube-download -o "{title}.{suffix}" --fmt 18 r-KBncrOggI
--> Working on r-KBncrOggI
Downloading `Sourav Ganguly in Farhan Akhtar's Show - Oye! It's Friday!.mp4`
75161060/75161060 (100.00%)
Download successful!
stas#Stanislaws-MacBook-Pro:~$
There might be better command-line YouTube Downloaders around. But sorry, one doesn't simply download a video using Firebug and wget any more :(
The only way I know to capture that URL manually is by watching the active downloads of the browser:
That largest data chunks are video data, so you can copy its URL:
http://s.youtube.com/s?lact=111116&uga=m30&volume=4.513679238953965&sd=BBE62AA4AHH1357937949850490&rendering=accelerated&fs=0&decoding=software&nsivbblmax=679542.000&hcbt=105.345&sendtmp=1&fmt=35&w=640&vtmp=1&referrer=None&hl=en_US&nsivbblmin=486355.000&nsivbblmean=603805.166&md=1&plid=AATTCZEEeM825vCx&ns=yt&ptk=youtube_none&csipt=watch7&rt=110.904&tsphab=1&nsiabblmax=129097.000&tspne=0&tpmt=110&nsiabblmin=123113.000&tspfdt=436&hbd=30900552&et=110.146&hbt=30.770&st=70.213&cfps=25&cr=BR&h=480&screenw=1440&nsiabblmean=125949.872&cpn=JlqV9j_oE1jzk7Zc&nsivbblc=343&nsiabblc=343&docid=r-KBncrOggI&len=1302.676&screenh=900&abd=1&pixel_ratio=1&bc=26131333&playerw=854&idpj=0&hcbd=25408143&playerh=510&ldpj=0&fexp=920704,919009,922403,916709,912806,929110,928008,920201,901451,909708,913605,925710,916623,929104,913302,910221,911116,914093,922405,929901&scoville=1&el=detailpage&bd=6676317&nsidf=1&vid=Yfg8gnutZoTD4G5SVKCxpsPvirbqG7pvR&bt=40.333&mos=0&vq=auto
However, for a large video, this will only return a part of the stream unless you figure out the URL query parameter responsible for stream range to be downloaded and adjust it.
A bonus: everything changes periodically as YouTube is constantly evolving. So, don't do that manually unless you carve pain.

Tracking Upload Progress of File to S3 Using Ruby aws-sdk

Firstly, I am aware that there are quite a few questions that are similar to this one in SO. I have read most, if not all of them, over the past week. But I still can't make this work for me.
I am developing a Ruby on Rails app that allows users to upload mp3 files to Amazon S3. The upload itself works perfectly, but a progress bar would greatly improve user experience on the website.
I am using the aws-sdk gem which is the official one from Amazon. I have looked everywhere in its documentation for callbacks during the upload process, but I couldn't find anything.
The files are uploaded one at a time directly to S3 so it doesn't need to load it into memory. No multiple file upload necessary either.
I figured that I may need to use JQuery to make this work and I am fine with that.
I found this that looked very promising: https://github.com/blueimp/jQuery-File-Upload
And I even tried following the example here: https://github.com/ncri/s3_uploader_example
But I just could not make it work for me.
The documentation for aws-sdk also BRIEFLY describes streaming uploads with a block:
obj.write do |buffer, bytes|
# writing fewer than the requested number of bytes to the buffer
# will cause write to stop yielding to the block
end
But this is barely helpful. How does one "write to the buffer"? I tried a few intuitive options that would always result in timeouts. And how would I even update the browser based on the buffering?
Is there a better or simpler solution to this?
Thank you in advance.
I would appreciate any help on this subject.
The "buffer" object yielded when passing a block to #write is an instance of StringIO. You can write to the buffer using #write or #<<. Here is an example that uses the block form to upload a file.
file = File.open('/path/to/file', 'r')
obj = s3.buckets['my-bucket'].objects['object-key']
obj.write(:content_length => file.size) do |buffer, bytes|
buffer.write(file.read(bytes))
# you could do some interesting things here to track progress
end
file.close
After read the source code of the AWS gem, I've adapted (or mostly copy) the multipart upload method to yield the current progress based on how many chunks have been uploaded
s3 = AWS::S3.new.buckets['your_bucket']
file = File.open(filepath, 'r', encoding: 'BINARY')
file_to_upload = "#{s3_dir}/#{filename}"
upload_progress = 0
opts = {
content_type: mime_type,
cache_control: 'max-age=31536000',
estimated_content_length: file.size,
}
part_size = self.compute_part_size(opts)
parts_number = (file.size.to_f / part_size).ceil.to_i
obj = s3.objects[file_to_upload]
begin
obj.multipart_upload(opts) do |upload|
until file.eof? do
break if (abort_upload = upload.aborted?)
upload.add_part(file.read(part_size))
upload_progress += 1.0/parts_number
# Yields the Float progress and the String filepath from the
# current file that's being uploaded
yield(upload_progress, upload) if block_given?
end
end
end
The compute_part_size method is defined here and I've modified it to this:
def compute_part_size options
max_parts = 10000
min_size = 5242880 #5 MB
estimated_size = options[:estimated_content_length]
[(estimated_size.to_f / max_parts).ceil, min_size].max.to_i
end
This code was tested on Ruby 2.0.0p0

Load or Stress Testing Tool with URL Import Functionality

Can someone recommend a load testing tool which allows you to either:
a. replay an IIS (7) log(s) to simulate a real live site daily run;
b. import a CSV or equivalent list of URLS so we can achieve a similar thing as above but at a URL level;
c. .net API so I can create simple tests easily from my list of URLS is also a good way to go.
I do not really want to record my tests.
I think I can do B) with WAPT but need to create an XML file manually, not too much grief, but wondering if any tools cover these scenarios out the box.
Visual Studio Test Edition would require some code to parse the file into a suitable test run.
It is a great load testing solution.
Our load testing service lets you write a very simple script using JavaScript to pull data out of a CSV file and then fetch those URLs. For example, the following code would pluck 10 random URLs from the CSV file and fetch them as part of a single session:
var c = browserMob.openHttpClient();
var csv = browserMob.getCSV("urls.csv");
browserMob.beginTransaction();
for (var i = 0; i < 10; i++) {
browserMob.beginStep("Step 1");
var url = csv.random().get("url");
c.get(url);
browserMob.endStep();
}
browserMob.endTransaction();
The CSV file itself needs to be a normal CSV file with the first row containing a header named "url". This script would be run repeatedly for each virtual user participating in a load test.
We have support for so called 'uri-format' in our open-source tool called Yandex.Tank You simply put all your uris to a file, one uri -- one line, then specify headers in your load.ini like this:
[phantom]
address=example.org
rps_schedule=line(1, 1600, 2m)
headers = [Host: mts-maps.yandex.ru]
[Connection: close] [Bloody: yes]
ammo_file = ammo.uri
ammo.uri:
/
/index.html
/1/example.html
/2/example.html

Resources