TarWriter throws Gem::Package::TarWriter::FileOverflow - ruby-on-rails

I want to generate a tar from a buch of files.
out_file = File.new('some.tar', 'w')
tar = Gem::Package::TarWriter.new out_file
attachments = #Array of attachment objects
attachments.each{|a|
file = Attachment.new(a).read_file #returns a String
file.force_encoding('UTF-8')
tar.add_file_simple( a[:filename], 777, file.length) do |io|
io.write(file)
end
}
Gem::Package::TarWriter::FileOverflow - You tried to feed more data
than fits in the file.
Has anyone an idea why this happens and how to fix it?

String#length returns the number of characters in the String. Since a UTF-8 character can be represented by more than a single byte, the bytesize of a string is usually larger.
The TarWriter now expects the file size to be given in bytes. Thus, if you use anything else than plain ascii characters in your file, it will overflow.
To solve this, you should thus pass file.bytesize to the add_file_simple method instead of file.size.

Related

Problem with attachments' character encoding using gmail gem in ruby/rails

What I am doing:
I am using the gmail gem in a Rails 4 app to get email attachments from a specific account at regular intervals. Here is an extract from the core part (here for simplicity only considering the first email and its first attachment):
require 'gmail'
Gmail.connect(#user_email,#user_password) do |gmail|
if gmail.logged_in?
emails = gmail.inbox.emails(:from => #sender_email)
email = emails[0]
attachment = email.message.attachments[0]
File.open("~/temp.csv", 'w') do |file|
file.write(
StringIO.new(attachment.decoded.to_s[2..-2].force_encoding("ISO-8859-15").encode!('UTF-8')).read
)
end
end
end
The encoding of the attached file can vary. The particular one that I am currently having issues with is in Finnish. It contains Finnish characters and a superscripted 3 character.
This is what I expect to get when I run the above code. (This is what I get when I download the attachment manually through gmail user interface):
What the problem is:
However, I am getting the following odd results.
From cat temp.csv (Looks good to me):
With nano temp.csv (Here I have no idea what I am looking at):
This is what temp.csv looks like opened in Sublime Text (directly via winscp). First line and small parts look ok but then Chinese/Japanese characters:
This is what temp.csv looks like in Notepad (after download via winscp). Looks ok except a blank space has been inserted between each character and the new lines seems to be missing:
What I have tried:
I have without success tried:
.force_encoding(...) with all the different "ISO-8859-x" character sets
putting the force_encoding("ISO-8859-15").encode!('UTF-8') outside the .read (works but doesn't solve the problem)
encode to UTF-8 without first forcing another encoding but this leads to Encoding::UndefinedConversionError: "\xC4" from ASCII-8BIT to UTF-8
writing as binary with 'wb' and 'w+b' in the File.open() (which oddly doesn't seem to make a difference to the outcome).
searching stackoverflow and the web for other ideas.
Any ideas would be much appreciated!
Not beautiful, but it will work for me now.
After re-encoding, I convert the string to a char array, then remove the chars I do not want and then join the remaining array elements to form a string.
decoded_att = attachment.decoded
data = decoded_att.encode("UTF-8", "ISO-8859-1", invalid: :replace, undef: :replace).gsub("\r\n", "\n")
data_as_array = data.chars
data_as_array = data_as_array.delete_if {|i| i == "\u0000" || i == "ÿ" || i == "þ"}
data = data_as_array.join('').to_s
File.write("~/temp.csv", data.to_s)
This will work for me now. However, I have no idea how these characters have ended up in the attachment ("ÿ" and "þ" in the start of the document and "\u0000" between all remaining characters).
It seems like you need to do attachment.body.decoded instead of attachment.decoded

how to store string with hexadecimal character to environment variable and then retrieve in ruby?

How can I save a string with hexadecimal values to environment variable and later retrieve it in ruby?
Currently, when I retrieve it using ENV[] it will return with slashes escaped. So it retrieves like "\\x12\\x33". How can I make it so that when the environment variable is retrived from ruby then it returns the same exact string "\x12\x33".
Suppose I have a string with hexadecimal characters such as
s = "\x12\x33"
I appreciate any help! Thanks!
TEST='\x34\x33' ruby -e "
puts ENV['TEST'].split('\\x')[1..-1].map(&:to_i).map(&:chr)"
#⇒ "
# !
ok i solved this by first writing the binary to file and then reading it.
Writing
data = "\x12\x33"
File.open("data.bz2", "wb") do |f|
f.write(data)
end
Reading
file = File.open("data.bz2", "rb")
contents = file.read
print contents

Create file Rails

I'm sending an .AAC file from an iPhone to an API in RoR. What I'm doing is read the file byte by byte in the iPhone, convert the byte[] to a Base64 String, send the string to an API and then decode the String to the array and save that byte[] to a file.
The problem is the file being created on the serve side is different from the one sent, even though I checked and the byte[] is the same on the server side, when I save byte by byte I end up with a different filesize and it's also unplayable.
This is the code I'm using
File.open("test.aac", 'wb' ) do |output|
plain.each_byte do | byte |
output.print byte
puts byte
i=i+1
end
puts "_______"
puts i
puts "_______"
end
I've literally tried everything but I have no idea why it doesn't work.
This is the code that receives
mail=params["mail"]
archivo=params["byte"]
puts mail
puts archivo
plain=Base64.decode64(archivo);
variable plain has exactly the same bytes as the byte I sent from the iPhone.
This is in Xamarin:
byte[] info = File.ReadAllBytes (audioFilePath.Path.ToString ());
String bytesTo64 = Convert.ToBase64String (info);
Okay, so the problem was writing byte by byte. I changed the code with
Base64 encoded string to file(Ruby on Rails)
and it worked flawlessly. This is my final code.
File.open('Now.aac', 'wb') { |f|
f.write(Base64.decode64(archivo))
}

Ruby create tar ball in chunks to avoid out of memory error

I'm trying to re-use the following code to create a tar ball:
tarfile = File.open("#{Pathname.new(path).realpath.to_s}.tar","w")
Gem::Package::TarWriter.new(tarfile) do |tar|
Dir[File.join(path, "**/*")].each do |file|
mode = File.stat(file).mode
relative_file = file.sub /^#{Regexp::escape path}\/?/, ''
if File.directory?(file)
tar.mkdir relative_file, mode
else
tar.add_file relative_file, mode do |tf|
File.open(file, "rb") { |f| tf.write f.read }
end
end
end
end
tarfile.rewind
tarfile
It works fine as far as only small folders are involve but anything large will fail with the following error:
Error: Your application used more memory than the safety cap
How can I do it in chunks to avoid the memory problems?
It looks like the problem could be in this line:
File.open(file, "rb") { |f| tf.write f.read }
You are "slurping" your input file by doing f.read. slurping means the entire file is being read into memory, which isn't scalable at all, and is the result of using read without a length.
Instead, I'd do something to read and write the file in blocks so you have a consistent memory usage. This reads in 1MB blocks. You can adjust that for your own needs:
BLOCKSIZE_TO_READ = 1024 * 1000
File.open(file, "rb") do |fi|
while buffer = fi.read(BLOCKSIZE_TO_READ)
tf.write buffer
end
end
Here's what the documentation says about read:
If length is a positive integer, it try to read length bytes without any conversion (binary mode). It returns nil or a string whose length is 1 to length bytes. nil means it met EOF at beginning. The 1 to length-1 bytes string means it met EOF after reading the result. The length bytes string means it doesn’t meet EOF. The resulted string is always ASCII-8BIT encoding.
An additional problem is it looks like you're not opening the output file correctly:
tarfile = File.open("#{Pathname.new(path).realpath.to_s}.tar","w")
You're writing it in "text" mode because of "w". Instead, you need to write in binary mode, "wb", because tarballs contain binary (compressed) data:
tarfile = File.open("#{Pathname.new(path).realpath.to_s}.tar","wb")
Rewriting the original code to be more like I'd want to see it, results in:
BLOCKSIZE_TO_READ = 1024 * 1000
def create_tarball(path)
tar_filename = Pathname.new(path).realpath.to_path + '.tar'
File.open(tar_filename, 'wb') do |tarfile|
Gem::Package::TarWriter.new(tarfile) do |tar|
Dir[File.join(path, '**/*')].each do |file|
mode = File.stat(file).mode
relative_file = file.sub(/^#{ Regexp.escape(path) }\/?/, '')
if File.directory?(file)
tar.mkdir(relative_file, mode)
else
tar.add_file(relative_file, mode) do |tf|
File.open(file, 'rb') do |f|
while buffer = f.read(BLOCKSIZE_TO_READ)
tf.write buffer
end
end
end
end
end
end
end
tar_filename
end
BLOCKSIZE_TO_READ should be at the top of your file since it's a constant and is a "tweakable" - something more likely to be changed than the body of the code.
The method returns the path to the tarball, not an IO handle like the original code. Using the block form of IO.open automatically closes the output, which would cause any subsequent open to automatically rewind. I much prefer passing around path strings than IO handles for files.
I also wrapped some of the method parameters in enclosing parenthesis. While parenthesis aren't required around method parameters in Ruby, and some people eschew them, I think they make the code more maintainable by delimiting where the parameters start and end. They also avoid confusing Ruby when you're passing parameters and a block to a method -- a well-known cause for bugs.
minitar looks like it writes to a stream so I don't think memory will be a problem. Here is the comment and definition of the pack method (as of May 21, 2013):
# A convenience method to pack files specified by +src+ into +dest+. If
# +src+ is an Array, then each file detailed therein will be packed into
# the resulting Archive::Tar::Minitar::Output stream; if +recurse_dirs+
# is true, then directories will be recursed.
#
# If +src+ is an Array, it will be treated as the argument to Find.find;
# all files matching will be packed.
def pack(src, dest, recurse_dirs = true, &block)
Output.open(dest) do |outp|
if src.kind_of?(Array)
src.each do |entry|
pack_file(entry, outp, &block)
if dir?(entry) and recurse_dirs
Dir["#{entry}/**/**"].each do |ee|
pack_file(ee, outp, &block)
end
end
end
else
Find.find(src) do |entry|
pack_file(entry, outp, &block)
end
end
end
end
Example from the README to write a tar:
# Packs everything that matches Find.find('tests')
File.open('test.tar', 'wb') { |tar| Minitar.pack('tests', tar) }
Example from the README to write a gzipped tar:
tgz = Zlib::GzipWriter.new(File.open('test.tgz', 'wb'))
# Warning: tgz will be closed!
Minitar.pack('tests', tgz)

Extract text from document in memory using docsplit

With the docsplit gem I can extract the text from a PDF or any other file type. For example, with the line:
Docsplit.extract_pages('doc.pdf')
I can have the text content of a PDF file.
I'm currently using Rails, and the PDF is sent through a request and lives in memory. Looking in the API and in the source code I couldn't find a way to extract the text from memory, only from a file.
Is there a way to get the text of this PDF avoiding the creation of a temporary file?
I'm using attachment_fu if it matters.
Use a temporary directory:
require 'docsplit'
def pdf_to_text(pdf_filename)
Docsplit.extract_text([pdf_filename], ocr: false, output: Dir.tmpdir)
txt_file = File.basename(pdf_filename, File.extname(pdf_filename)) + '.txt'
txt_filename = Dir.tmpdir + '/' + txt_file
extracted_text = File.read(txt_filename)
File.delete(txt_filename)
extracted_text
end
pdf_to_text('doc.pdf')
If you have the content in a string, use StringIO to create a File-like object that IO can read. In StringIO, it doesn't matter if the content is true text, or binary, it's all the same.
Look at either of:
new(string=""[, mode])
Creates new StringIO instance from with string and mode.
open(string=""[, mode]) {|strio| ...}
Equivalent to ::new except that when it is called with a block, it yields with the new instance and closes it, and returns the result which returned from the block.

Resources