I've been working on a log viewer for a Rails app and have found that I need to read around 200 lines of a log file from bottom to top instead of the default top to bottom.
Log files can get quite large, so I've already tried and ruled out the IO.readlines("log_file.log")[-200..-1] method.
Are there any other ways to go about reading a file backwards in Ruby without the need for a plugin or gem?
The only correct way to do this that also works on enormous files is to read n bytes at a time from the end until you have the number of lines that you want. This is essentially how Unix tail works.
An example implementation of IO#tail(n), which returns the last n lines as an Array:
class IO
TAIL_BUF_LENGTH = 1 << 16
def tail(n)
return [] if n < 1
seek -TAIL_BUF_LENGTH, SEEK_END
buf = ""
while buf.count("\n") <= n
buf = read(TAIL_BUF_LENGTH) + buf
seek 2 * -TAIL_BUF_LENGTH, SEEK_CUR
end
buf.split("\n")[-n..-1]
end
end
The implementation is a little naive, but a quick benchmark shows what a ridiculous difference this simple implementation can already make (tested with a ~25MB file generated with yes > yes.txt):
user system total real
f.readlines[-200..-1] 7.150000 1.150000 8.300000 ( 8.297671)
f.tail(200) 0.000000 0.000000 0.000000 ( 0.000367)
The benchmark code:
require "benchmark"
FILE = "yes.txt"
Benchmark.bmbm do |b|
b.report "f.readlines[-200..-1]" do
File.open(FILE) do |f|
f.readlines[-200..-1]
end
end
b.report "f.tail(200)" do
File.open(FILE) do |f|
f.tail(200)
end
end
end
Of course, other implementations already exist. I haven't tried any, so I cannot tell you which is best.
There's a module Elif available (a port of Perl's File::ReadBackwards) which does efficient line-by-line backwards reading of files.
Since I'm too new to comment on molf awesome answer I have to post it as a separate answer.
I needed this feature to read log files while they're written , and the last portion of the logs contain the string I need to know it's done and I can start parsing it.
Hence handling small sized files is crucial for me (I might ping the log while it's tiny).
So I enhanced molf code:
class IO
def tail(n)
return [] if n < 1
if File.size(self) < ( 1 << 16 )
tail_buf_length = File.size(self)
return self.readlines.reverse[0..n-1]
else
tail_buf_length = 1 << 16
end
self.seek(-tail_buf_length,IO::SEEK_END)
out = ""
count = 0
while count <= n
buf = self.read( tail_buf_length )
count += buf.count("\n")
out += buf
# 2 * since the pointer is a the end , of the previous iteration
self.seek(2 * -tail_buf_length,IO::SEEK_CUR)
end
return out.split("\n")[-n..-1]
end
end
Related
I have a large text file of ~750,000 lines that gets updated constantly every few seconds, and I want to be able to monitor the number of lines in real time. I am able to do that, but at a very heavy cost of response time.
function GetFileSize( filename )
local fp = io.open( filename )
if fp == nil then
return nil
end
file = {}
for line in fp:lines() do
if (file[line] ~= line) then
table.insert(file, line)
end
end
d(table.size(file))
local filesize = fp:seek( "end" )
fp:close()
return filesize
end
I'm trying to get two things, the size (bytes) and the number of lines.
However, filling the table up with 750,000 lines over and over, reading the file from top to bottom, constantly, causes quite a bit of processing time.
Is there a way to get both the file size in bytes, but also get the number of lines, without severely hindering my system.
Pretty much I'm guessing I have to create a permanent table outside of the function, where you read the file and add the lines to the table. However, I'm not sure how to stop it from duplicating itself every few seconds.
Should I just abandon the line count and stick with the byte return since that doesn't slow me down at all? or is there an efficient way to get both.
Thanks!
Try reading the whole file at once and count the number of lines with gsub. You'll have to test whether this is fast enough for you.
t = f:read("*a")
_,n = t:gsub("\n","")
To get the file size in bytes use Lua Filesystem. For the number of lines you might want to use the io.lines iterator. For better performance of the latter there is a trick described in »Programming in Lua«.
local file = arg[0] -- just use the source file for demo
-- Get the file size
local lfs = assert(require"lfs")
local attr = lfs.attributes(file)
print(attr.size)
-- Get number of lines
local count = 0
for line in io.lines(file) do
count = count + 1
end
print(count)
I can suggest this solution. Which does not require read all large file.
local function char_count(str, ch)
local n, p = 0
while true do
p = string.find(str, ch, p, true)
if not p then break end
n, p = n + 1, p + 1
end
return n
end
local function file_info(name, chunk_size)
chunk_size = chunk_size or 4096
local f, err, no = io.open(name, 'rb')
if not f then return nil, err, no end
local lines, size = 0, 0
while true do
local chunk = f:read(chunk_size)
if not chunk then break end
lines = lines + char_count(chunk, '\n')
size = size + #chunk
end
f:close()
return size, lines
end
But if you just need monitor one file and count lines in it may be just use any file monitor solution. I use one based on LibUV
I have been reading data from csv, if there is a large csv file, for avoid this time-out(rack 12 sec timeout) i have read only 25 rows from csv after 25 rows it return and again make a request so this will continue until read all the rows.
def read_csv(offset)
r_count = 1
CSV.foreach(file.tempfile, options) do |row|
if r_count > offset.to_i
#process
end
r_count += 1
end
But here it is creating a new issue, let say first read 25 rows then when the next request comes offset is 25 that time it will read upto first 25 rows then it will start read from 26 and do process, so how can i skip this rows which already read?, i tried this if next to skip iteration but that fails, or is there any other efficient way to do this?
Code
def read_csv(fileName)
lines = (`wc -l #{fileName}`).to_i + 1
lines_processed = 0
open(fileName) do |csv|
csv.each_line do |line|
#process
lines_processed += 1
end
end
end
Pure Ruby - SLOWER
def read_csv(fileName)
lines = open("sample.csv").count
lines_processed = 0
open(fileName) do |csv|
csv.each_line do |line|
#process
lines_processed += 1
end
end
end
Benchmarks
I ran a new benchmark comparing your original method provided and my own. I also included the test file information.
"File Information"
Lines: 1172319
Size: 126M
"django's original method"
Time: 18.58 secs
Memory: 0.45 MB
"OneNeptune's method"
Time: 0.58 secs
Memory: 2.18 MB
"Pure Ruby method"
Time: 0.96
Memory: 2.06 MB
Explanation
NOTE: I added a pure ruby method, since using wc is sort of cheating, and not portable. In most cases it's important to use pure language solutions.
You can use this method to process a very large CSV file.
~2MB memory I feel is pretty optimal considering the file size, it's a bit of an increase of memory usage, but the time savings seems to be a fair trade, and this will prevent timeouts.
I did modify the method to take a fileName, but this was just because I was testing many different CSV files to make sure they all worked correctly. You can remove this if you'd like, but it'll likely be helpful.
I also removed the concept of an offset, since you stated you originally included it to try to optimize the parsing yourself, but this is no longer necessary.
Also, I keep track of how many lines are in the file, and how many were processed since you needed to use that information. Note, that lines only works on unix based systems, and it's a trick to avoid loading the entire file into memory, it counts the new lines, and I add 1 to account for the last line. If you're not going to count headers as line though, you could remove the +1 and change lines to "rows" to be more accurate.
Another logistical problem you may run into is the need to figure how to handle if the CSV file has headers.
You could use lazy reading to speed this up, the whole of the file wouldn't be read, just from the beginning of the file until the chunk you use.
See http://engineering.continuity.net/csv-tricks/ and https://reinteractive.com/posts/154-improving-csv-processing-code-with-laziness for examples.
You could also use SmarterCSV to work in chunks like this.
SmarterCSV.process(file_path, {:chunk_size => 1000}) do |chunk|
chunk.each do |row|
# Do your processing
end
do_something_else
end
enter code here
The way I did this was by streaming the result to the user, if you see what is happening it doesn't bother that much you have to wait. The timeout you mention won't happen here.
I'm not a Rails user so I give an example from Sinatra, this can be done with Rails also. See eg http://api.rubyonrails.org/classes/ActionController/Streaming.html
require 'sinatra'
get '/' do
line = 0
stream :keep_open do |out|
1.upto(100) do |line| # this would be your CSV file opened
out << "processing line #{line}<br>"
# process line
sleep 1 # for simulating the delay
end
end
end
A still better but somewhat complicated solution would be to use websockets, the browser would receive the results from the server once the processing is finished. You will need some javascript in the client also to handle this. See https://github.com/websocket-rails/websocket-rails
I am trying to solve a HackerEarth problem using Ruby
The problem is provided in the following link:
https://www.hackerearth.com/problem/algorithm/find-product/
My solution for the problem is here :
n = gets.chomp.to_i
a = Array.new
if n <= 1000
n.times do
a << gets.chomp.to_i
end
end
a.each { |m| print m.to_s + " " }
print "\n"
answer = 1
a.each do |m|
answer = ( answer * m ) % ( (10**9) + 7)
end
puts "#{answer}"
The code throws a Runtime Non zero exit code (NZEC).I am not able to understand the concept of NZEC and what particulary wrong I am doing in this code. Can someone pls help me to understand NZEC and find a work around for it.
The NZEC error appears because, you read the problem a bit quickly.
The first line must contain a single integer n, and the second line must contain each element separated by a space.
When I launch your script, it seems I need to press enter between each entry of the array. so when you test your code in hackerhearth I presume that execution failed because it receives no response after the second entry.
There is also a similar problem with your output, you print the full array before display the answer. The problem definition specifies you have to only display the answer.
One possible solution could be the following:
n = gets.chomp.to_i
a = gets.chomp.split.map(&:to_i)
answer = 1
a.each do |m|
answer = ( answer * m ) % ( (10**9) + 7)
end
puts "#{answer}"
I have a growing file (log) that I need to read by blocks.
I make a call by Ajax to get a specified number of lines.
I used File.foreach to read the lines I want, but it reads always from the beginning and I need to return only the lines I want, directly.
Example Pseudocode:
#First call:
File.open and return 0 to 10 lines
#Second call:
File.open and return 11 to 20 lines
#Third call:
File.open and return 21 to 30 lines
#And so on...
Is there anyway to make this?
Solution 1: Reading the whole file
The proposed solution here:
https://stackoverflow.com/a/5052929/1433751
..is not an efficient solution in your case, because it requires you to read all the lines from the file for each AJAX request, even if you just need the last 10 lines of the logfile.
That's an enormous waste of time, and in computing terms the solving time (i.e. process the whole logfile in blocks of size N) approaches exponential solving time.
Solution 2: Seeking
Since your AJAX calls request sequential lines we can implement a much more efficient approach by seeking to the correct position before reading, using IO.seek and IO.pos.
This requires you to return some extra data (the last file position) back to the AJAX client at the same time you return the requested lines.
The AJAX request then becomes a function call of this form request_lines(position, line_count), which enables the server to IO.seek(position) before reading the requested count of lines.
Here's the pseudocode for the solution:
Client code:
LINE_COUNT = 10
pos = 0
loop {
data = server.request_lines(pos, LINE_COUNT)
display_lines(data.lines)
pos = data.pos
break if pos == -1 # Reached end of file
}
Server code:
def request_lines(pos, line_count)
file = File.open('logfile')
# Seek to requested position
file.seek(pos)
# Read the requested count of lines while checking for EOF
lines = count.times.map { file.readline if !file.eof? }.compact
# Mark pos with -1 if we reached EOF during reading
pos = file.eof? ? -1 : file.pos
f.close
# Return data
data = { lines: lines, pos: pos }
end
I want to find out one string with some Levenshtein distance inside bigger string. I have written the code for finding the distance between two string but want to efficiently implement when i want to find some substring with fixed Levenshtein distance.
module Levenshtein
def self.distance(a, b)
a, b = a.downcase, b.downcase
costs = Array(0..b.length) # i == 0
(1..a.length).each do |i|
costs[0], nw = i, i - 1 # j == 0; nw is lev(i-1, j)
(1..b.length).each do |j|
costs[j], nw = [costs[j] + 1, costs[j-1] + 1, a[i-1] == b[j-1] ? nw : nw + 1].min, costs[j]
end
end
costs[b.length]
end
def self.test
%w{kitten sitting saturday sunday rosettacode raisethysword}.each_slice(2) do |a, b|
puts "distance(#{a}, #{b}) = #{distance(a, b)}"
end
end
end
Check at the TRE library, which does exactly this (in C), and quite efficienly. Now look carefully at the matching function, which is basically 500 lines of unreadable (but necessary) code.
I'd say that, instead of rolling your own version and provided you don't intend to read all the much difficult papers on the subject (search for "approximate string matching") and don't have a few free months for studying the subject, you'd be much better of writing a small wrapper around the library itself. Your Ruby version would be inefficient anyway in comparison with what can be obtained in C.