Extract text from document in memory using docsplit - ruby-on-rails

With the docsplit gem I can extract the text from a PDF or any other file type. For example, with the line:
Docsplit.extract_pages('doc.pdf')
I can have the text content of a PDF file.
I'm currently using Rails, and the PDF is sent through a request and lives in memory. Looking in the API and in the source code I couldn't find a way to extract the text from memory, only from a file.
Is there a way to get the text of this PDF avoiding the creation of a temporary file?
I'm using attachment_fu if it matters.

Use a temporary directory:
require 'docsplit'
def pdf_to_text(pdf_filename)
Docsplit.extract_text([pdf_filename], ocr: false, output: Dir.tmpdir)
txt_file = File.basename(pdf_filename, File.extname(pdf_filename)) + '.txt'
txt_filename = Dir.tmpdir + '/' + txt_file
extracted_text = File.read(txt_filename)
File.delete(txt_filename)
extracted_text
end
pdf_to_text('doc.pdf')

If you have the content in a string, use StringIO to create a File-like object that IO can read. In StringIO, it doesn't matter if the content is true text, or binary, it's all the same.
Look at either of:
new(string=""[, mode])
Creates new StringIO instance from with string and mode.
open(string=""[, mode]) {|strio| ...}
Equivalent to ::new except that when it is called with a block, it yields with the new instance and closes it, and returns the result which returned from the block.

Related

Ruby: Is there a way to specify your encoding in File.write?

TL;DR
How would I specify the mode of encoding on File.write, or how would one save image binary to a file in a similar fashion?
More Details
I'm trying to download an image from a Trello card and then upload that image to S3 so it has an accessible URL. I have been able to download the image from Trello as binary (I believe it is some form of binary), but I have been having issues saving this as a .jpeg using File.write. Every time I attempt that, it gives me this error in my Rails console:
Encoding::UndefinedConversionError: "\xFF" from ASCII-8BIT to UTF-8
from /app/app/services/customer_order_status_notifier/card.rb:181:in `write'
And here is the code that triggers that:
def trello_pics
#trello_pics ||=
card.attachments.last(config_pics_number)&.map(&:url).map do |url|
binary = Faraday.get(url, nil, {'Authorization' => "OAuth oauth_consumer_key=\"#{ENV['TRELLO_PUBLIC_KEY']}\", oauth_token=\"#{ENV['TRELLO_TOKEN']}\""}).body
File.write(FILE_LOCATION, binary) # doesn't work
run_me
end
end
So I figure this must be an issue with the way that File.write converts the input into a file. Is there a way to specify encoding?
AFIK you can't do it at the time of performing the write, but you can do it at the time of creating the File object; here an example of UTF8 encoding:
File.open(FILE_LOCATION, "w:UTF-8") do
|f|
f.write(....)
end
Another possibility would be to use the external_encoding option:
File.open(FILE_LOCATION, "w", external_encoding: Encoding::UTF_8)
Of course this assumes that the data which is written, is a String. If you have (packed) binary data, you would use "wb" for openeing the file, and syswrite instead of write to write the data to the file.
UPDATE As engineersmnky points out in a comment, the arguments for the encoding can also be passed as parameter to the write method itself, for instance
IO::write(FILE_LOCATION, data_to_write, external_encoding: Encoding::UTF_8)

How to open base64 spreadsheet on Ruby

I've been trying to manipulate a file that's base64 encoded that I'm recieving from my client.
I'm currently using https://github.com/zdavatz/spreadsheet/blob/master/GUIDE.md to manipulate it, however, there doesn't appear to be any way to open a file directly from the base64 blob, or should I write it and then read from it? can't that a potential security threat for the server?
for example, if I recieve a file :
file = params[:file] with contents:
data:application/vnd.ms-excel;base64,0M8R4KGxGuEAAAAAAAAAAAAAAAAAAAAAOwADAP7
(should I remove the data:application/vnd.ms-excel;base64, ?)
I'd like to open it with this:
Spreadsheet.client_encoding = 'UTF-8'
book = Spreadsheet.open "#{Rails.root}/app/assets/spreadsheet/event.xls"
(or with a blob or temp fle)
Sorry if it's pretty obvious, been looking for hours and there's not much info about it available, tried creating a temp file first but I don't think that's supported and there's not much I can get from the docs.
Shot in the dark: Maybe decode it, write to binary-enabled tempfile, and then feed that to Spreadsheet?
tmpfile = Tempfile.new.binmode
tmpfile << Base64.decode64(params[:file])
tmpfile.rewind
book = Spreadsheet.open(tmpfile)

How to write a Tempfile as binary

When trying to write a string / unzipped file to a Tempfile by doing:
temp_file = Tempfile.new([name, extension])
temp_file.write(unzipped_io.read)
Which throws the following error when I do this with an image:
Encoding::UndefinedConversionError - "\xFF" from ASCII-8BIT to UTF-8
When researching it I found out that this is caused because Ruby tries to write files with an encoding by default (UTF-8). But the file should be written as binary, so it ignores any file specific behavior.
Writing regular File you would be able to do this as following:
File.open('/tmp/test.jpg', 'rb') do |file|
file.write(unzipped_io.read)
end
How to do this in Tempfile
Tempfile.new passes options to File.open which accepts the options from IO.new, in particular:
:binmode
If the value is truth value, same as “b” in argument mode.
So to open a tempfile in binary mode, you'd use:
temp_file = Tempfile.new([name, extension], binmode: true)
temp_file.binmode? #=> true
temp_file.external_encoding #=> #<Encoding:ASCII-8BIT>
In addition, you might want to use Tempfile.create which takes a block and automatically closes and removes the file afterwards:
Tempfile.create([name, extension], binmode: true) do |temp_file|
temp_file.write(unzipped_io.read)
# ...
end
I have encountered the solution in an old Ruby forum post, so I thought I would share it here, making it easier for people to find:
https://www.ruby-forum.com/t/ruby-binary-temp-file/116791
Apparently Tempfile has an undocumented method binmode, which changes the writing mode to binary and thus ignoring any encoding issues:
temp_file = Tempfile.new([name, extension])
temp_file.binmode
temp_file.write(unzipped_io.read)
Thanks unknown person who mentioned it on ruby-forums.com in 2007!
Another alternative is IO.binwrite(path, file_content)

File object from URL

I'd like to create a file object from an image located at a specific url. I'm downloading the file with Net Http:
img = Net::HTTP.get_response(URI.parse('https://prium-solutions.com/wp-content/uploads/2016/11/rails-1.png'))
file = File.read(img.body)
However, I get ArgumentError: string contains null byte when trying to read the file and store in into the file variable.
How can I do this without having to store it locally ?
Since File deals with reading from storage, it's really not applicable here. The read method is expecting you to hand it a location to read from, and you're passing in binary data.
If you have a situation where you need to interface with a library that expects an object that is streaming, you can wrap the string body in a StringIO object:
file = StringIO.new(img)
# you can now call file.read, file.seek, file.rewind, etc.

Converting PDFs to PNGs with Dragonfly

I have a Dragonfly processor which should take a given PDF and return a PNG of the first page of the document.
When I run this processor via the console, I get back the PNG as expected, however, when in the context of Rails, I'm getting it as a PDF.
My code is roughly similar to this:
def to_pdf_thumbnail(temp_object)
tempfile = new_tempfile('png')
args = "'#{temp_object.path}[0]' '#{tempfile.path}'"
full_command = "convert #{args}"
result = `#{full_command}`
tempfile
end
def new_tempfile(ext=nil)
tempfile = ext ? Tempfile.new(['dragonfly', ".#{ext}"]) : Tempfile.new('dragonfly')
tempfile.binmode
tempfile.close
tempfile
end
Now, tempfile is definitely creating a .png file, but the convert is generating a PDF (when run from within Rails 3).
Any ideas as to what the issue might be here? Is something getting confused about the content type?
I should add that both this and a standard conversion (asset.png.url) both yield a PDF with the PDF content as a small block in the middle of the (A4) image.
An approach I’m using for this is to generate the thumbnail PNG on the fly via the thumb method from Dragonfly’s ImageMagick plugin:
<%= image_tag rails_model.file.thumb('100x100#', format: 'png', frame: 0).url %>
So long as Ghostscript is installed, ImageMagick/Dragonfly will honour the format / frame (i.e. page of the PDF) settings. If file is an image rather than a PDF, it will be converted to a PNG, and the frame number ignored (unless it’s a GIF).
Try this
def to_pdf_thumbnail(temp_object)
ret = ''
tempfile = new_tempfile('png')
system("convert",tmp_object.path[0],tmpfile.path)
tempfile.open {|f| ret = f.read }
ret
end
The problem is you are likely handing convert ONE argument not two
Doesn't convert rely on the extension to determine the type? Are you sure the tempfiles have the proper extensions?

Resources