How to parse ffmpeg progress real time in ruby - ruby-on-rails

So I was struggling with this problem a long time last night and finally figured it out so I wanted to post here in case someone ran across the same issue.
The goal is to parse the output of FFmpeg so that it would run in a Sidekiq worker and save progress and duration to an ActiveRecord model so I could get a progress bar in the UI by polling the database.
How do I parse ffmpeg duration and time real-time without waiting for the process to finish?

There are two problems:
How to parse output real-time in ruby?
Since FFmpeg writes progress output on one line, how do you get that output?
For the first one I looked at the ruby Open3 module and this article was particularly helpful. The gist is this:
require 'open3'
cmd = "bash my_long_running_command.sh"
Open3.popen3(cmd) do |stdin, stdout, stderr, thread|
stdout.each do |line|
# this will print as stdout is being written to the stream real-time
puts line
end
end
Now this works well for many commands and all but not with FFmpeg. FFmpeg is weird in that it writes it's progress and output to STDERR instead of STDOUT.
Not only that but as I finally noticed the line delimiter of FFmpeg is not \n as it is in most bash commands but rather \r which is how it is able to edit the progress inline. Now that we know that the only updates are the stream and the delimiter, which can be overridden in the each method as the first argument.
require 'open3'
cmd = "ffmpeg -i input.mp4 ... output.mp4"
Open3.popen3(cmd) do |stdin, stdout, stderr, thread|
duration = 0
progress = 0
stderr.each("\r") do |line|
next unless (line.include?("time=") || line.include?("Duration:"))
if duration == 0
if line =~ /Duration:\s+(\d{2}):(\d{2}):(\d{2}).(\d{1,2})/
duration = $1.to_i * 3600 + $2.to_i * 60 + $3
end
end
percentage = if line =~ /time(\d{2}):(\d{2}):(\d{2})/
progress = $1.to_i * 3600 + $2.to_i * 60 + $3
(progress.to_f/duration.to_f) * 100
end
puts "#{percentage}%"
end
end
You can also use a gem that I found that does something very similar called stremio-ffmpeg that I sadly found after I already implemented my solution. If you would like to do the same thing as above, just do:
require 'streamio-ffmpeg'
movie = FFMPEG::Movie.new('input.mp4')
duration = movie.duration
movie.transcode('output.mp4', options_as_a_string) do |progress|
percentage = (progress.to_f/duration.to_f) * 100
puts "#{percentage}%"
end

Related

Rufus Scheduler vs Loop with Sleep

I need to run an ActiveJob in my Rails 5 application as often as possible.
Up until now, I was using a cron job that was defined using the whenever gem (used Crono in the past).
This approach boots up the entire rails app before it does its thing, and then shuts down and does it all over again. I wish to avoid it, and instead have a persistent "service" that does the job.
I bumped into the Rufus Scheduler gem which seems nice, but then I started wondering if I need it at all.
So, my question is:
Are there any meaningful differences between these two options below:
# With Rufus
scheduler = Rufus::Scheduler.new
scheduler.every('1s') { puts 'hello' }
scheduler.join
# Without Rufus
loop { puts "hello" ; sleep 1 }
Note that either of these scripts will be executed with rails runner my_scheduler.rb as a docker container, and any solution must ensure the job runs once only (never two in parallel).
There are differences.
Rufus-scheduler every will try to run every 1s, so at t0 + 1, t0 + 2, t0 + 3, etc. While the "sleep" variant will run at t0, t0 + wt1 + 1, t0 + wt1 + 1 + wt2 + 1, ... (where wtN = work time for Nth call of puts hello).
Rufus-scheduler every will use rufus-scheduler work threads so that if there is an overlap (wt(Nb) > wt(Na)), Nb will occur anyway. This behaviour can be changed (see job options).
Rufus-scheduler interval is closer to your sleep option:
scheduler.interval('1s') { puts 'hello' }
It places 1 second (or whatever time interval you want) between each run of puts 'hello'.
For a simple do-sleep-loop, I'd go with the sleep option. But if you need more, rufus-scheduler has cron, interval, every and tuning options for them. And you can schedule multiple jobs.
scheduler.cron('0 1 1 * *') do
# every first of the month at 1am
flush_archive
end
scheduler.cron('0 8,13,19 * * mon-fri') do
# every weekday at 8am, 1pm, 7pm
clean_dishes
end
scheduler.interval('1s') do
puts 'hello'
end

Read data from csv file with foreach function

I have been reading data from csv, if there is a large csv file, for avoid this time-out(rack 12 sec timeout) i have read only 25 rows from csv after 25 rows it return and again make a request so this will continue until read all the rows.
def read_csv(offset)
r_count = 1
CSV.foreach(file.tempfile, options) do |row|
if r_count > offset.to_i
#process
end
r_count += 1
end
But here it is creating a new issue, let say first read 25 rows then when the next request comes offset is 25 that time it will read upto first 25 rows then it will start read from 26 and do process, so how can i skip this rows which already read?, i tried this if next to skip iteration but that fails, or is there any other efficient way to do this?
Code
def read_csv(fileName)
lines = (`wc -l #{fileName}`).to_i + 1
lines_processed = 0
open(fileName) do |csv|
csv.each_line do |line|
#process
lines_processed += 1
end
end
end
Pure Ruby - SLOWER
def read_csv(fileName)
lines = open("sample.csv").count
lines_processed = 0
open(fileName) do |csv|
csv.each_line do |line|
#process
lines_processed += 1
end
end
end
Benchmarks
I ran a new benchmark comparing your original method provided and my own. I also included the test file information.
"File Information"
Lines: 1172319
Size: 126M
"django's original method"
Time: 18.58 secs
Memory: 0.45 MB
"OneNeptune's method"
Time: 0.58 secs
Memory: 2.18 MB
"Pure Ruby method"
Time: 0.96
Memory: 2.06 MB
Explanation
NOTE: I added a pure ruby method, since using wc is sort of cheating, and not portable. In most cases it's important to use pure language solutions.
You can use this method to process a very large CSV file.
~2MB memory I feel is pretty optimal considering the file size, it's a bit of an increase of memory usage, but the time savings seems to be a fair trade, and this will prevent timeouts.
I did modify the method to take a fileName, but this was just because I was testing many different CSV files to make sure they all worked correctly. You can remove this if you'd like, but it'll likely be helpful.
I also removed the concept of an offset, since you stated you originally included it to try to optimize the parsing yourself, but this is no longer necessary.
Also, I keep track of how many lines are in the file, and how many were processed since you needed to use that information. Note, that lines only works on unix based systems, and it's a trick to avoid loading the entire file into memory, it counts the new lines, and I add 1 to account for the last line. If you're not going to count headers as line though, you could remove the +1 and change lines to "rows" to be more accurate.
Another logistical problem you may run into is the need to figure how to handle if the CSV file has headers.
You could use lazy reading to speed this up, the whole of the file wouldn't be read, just from the beginning of the file until the chunk you use.
See http://engineering.continuity.net/csv-tricks/ and https://reinteractive.com/posts/154-improving-csv-processing-code-with-laziness for examples.
You could also use SmarterCSV to work in chunks like this.
SmarterCSV.process(file_path, {:chunk_size => 1000}) do |chunk|
chunk.each do |row|
# Do your processing
end
do_something_else
end
enter code here
The way I did this was by streaming the result to the user, if you see what is happening it doesn't bother that much you have to wait. The timeout you mention won't happen here.
I'm not a Rails user so I give an example from Sinatra, this can be done with Rails also. See eg http://api.rubyonrails.org/classes/ActionController/Streaming.html
require 'sinatra'
get '/' do
line = 0
stream :keep_open do |out|
1.upto(100) do |line| # this would be your CSV file opened
out << "processing line #{line}<br>"
# process line
sleep 1 # for simulating the delay
end
end
end
A still better but somewhat complicated solution would be to use websockets, the browser would receive the results from the server once the processing is finished. You will need some javascript in the client also to handle this. See https://github.com/websocket-rails/websocket-rails

How to read a file block in Rails without read it again from beginning

I have a growing file (log) that I need to read by blocks.
I make a call by Ajax to get a specified number of lines.
I used File.foreach to read the lines I want, but it reads always from the beginning and I need to return only the lines I want, directly.
Example Pseudocode:
#First call:
File.open and return 0 to 10 lines
#Second call:
File.open and return 11 to 20 lines
#Third call:
File.open and return 21 to 30 lines
#And so on...
Is there anyway to make this?
Solution 1: Reading the whole file
The proposed solution here:
https://stackoverflow.com/a/5052929/1433751
..is not an efficient solution in your case, because it requires you to read all the lines from the file for each AJAX request, even if you just need the last 10 lines of the logfile.
That's an enormous waste of time, and in computing terms the solving time (i.e. process the whole logfile in blocks of size N) approaches exponential solving time.
Solution 2: Seeking
Since your AJAX calls request sequential lines we can implement a much more efficient approach by seeking to the correct position before reading, using IO.seek and IO.pos.
This requires you to return some extra data (the last file position) back to the AJAX client at the same time you return the requested lines.
The AJAX request then becomes a function call of this form request_lines(position, line_count), which enables the server to IO.seek(position) before reading the requested count of lines.
Here's the pseudocode for the solution:
Client code:
LINE_COUNT = 10
pos = 0
loop {
data = server.request_lines(pos, LINE_COUNT)
display_lines(data.lines)
pos = data.pos
break if pos == -1 # Reached end of file
}
Server code:
def request_lines(pos, line_count)
file = File.open('logfile')
# Seek to requested position
file.seek(pos)
# Read the requested count of lines while checking for EOF
lines = count.times.map { file.readline if !file.eof? }.compact
# Mark pos with -1 if we reached EOF during reading
pos = file.eof? ? -1 : file.pos
f.close
# Return data
data = { lines: lines, pos: pos }
end

End of File detection in ESPER

I am using ESPER to read events from a CSV file.
How can I make the query output something when reading a CSV file is finished.
For example i want to output every 30 min or at the end of the file
SELECT id FROM stream output every 30 min or [ EOF reached ]
Thanks in advance
Regards
The "adapter.start()" finishes when the CSV file is done and the code can send a EOF event into the engine. You could declare a context that ends on that EOF event and there is a "output every 30 minutes and when terminated" option.

How to read a file from bottom to top in Ruby?

I've been working on a log viewer for a Rails app and have found that I need to read around 200 lines of a log file from bottom to top instead of the default top to bottom.
Log files can get quite large, so I've already tried and ruled out the IO.readlines("log_file.log")[-200..-1] method.
Are there any other ways to go about reading a file backwards in Ruby without the need for a plugin or gem?
The only correct way to do this that also works on enormous files is to read n bytes at a time from the end until you have the number of lines that you want. This is essentially how Unix tail works.
An example implementation of IO#tail(n), which returns the last n lines as an Array:
class IO
TAIL_BUF_LENGTH = 1 << 16
def tail(n)
return [] if n < 1
seek -TAIL_BUF_LENGTH, SEEK_END
buf = ""
while buf.count("\n") <= n
buf = read(TAIL_BUF_LENGTH) + buf
seek 2 * -TAIL_BUF_LENGTH, SEEK_CUR
end
buf.split("\n")[-n..-1]
end
end
The implementation is a little naive, but a quick benchmark shows what a ridiculous difference this simple implementation can already make (tested with a ~25MB file generated with yes > yes.txt):
user system total real
f.readlines[-200..-1] 7.150000 1.150000 8.300000 ( 8.297671)
f.tail(200) 0.000000 0.000000 0.000000 ( 0.000367)
The benchmark code:
require "benchmark"
FILE = "yes.txt"
Benchmark.bmbm do |b|
b.report "f.readlines[-200..-1]" do
File.open(FILE) do |f|
f.readlines[-200..-1]
end
end
b.report "f.tail(200)" do
File.open(FILE) do |f|
f.tail(200)
end
end
end
Of course, other implementations already exist. I haven't tried any, so I cannot tell you which is best.
There's a module Elif available (a port of Perl's File::ReadBackwards) which does efficient line-by-line backwards reading of files.
Since I'm too new to comment on molf awesome answer I have to post it as a separate answer.
I needed this feature to read log files while they're written , and the last portion of the logs contain the string I need to know it's done and I can start parsing it.
Hence handling small sized files is crucial for me (I might ping the log while it's tiny).
So I enhanced molf code:
class IO
def tail(n)
return [] if n < 1
if File.size(self) < ( 1 << 16 )
tail_buf_length = File.size(self)
return self.readlines.reverse[0..n-1]
else
tail_buf_length = 1 << 16
end
self.seek(-tail_buf_length,IO::SEEK_END)
out = ""
count = 0
while count <= n
buf = self.read( tail_buf_length )
count += buf.count("\n")
out += buf
# 2 * since the pointer is a the end , of the previous iteration
self.seek(2 * -tail_buf_length,IO::SEEK_CUR)
end
return out.split("\n")[-n..-1]
end
end

Resources