how to convert pdf file into xlsx file in ruby on rails - ruby-on-rails

I have uploaded 1 PDF then convert it to xlsx file. I have tried different ways but not getting actual output.pdf2xls only displays single line format not whole file data. I want whole PDF file data to display on xlsx file.
i have one method convert PDF to xlsx but not display proper format.
def do_excel_to_pdf
#user=User.create!(pdf: params[:pdf])
#path_in = #user.pdf.path
temp1 = #user.pdf.path
#path_out = #user.pdf.path.slice(0..#user.pdf.path.rindex(/\//))
query = "libreoffice --headless --invisible --convert-to pdf " + #path_in + " --outdir " + #path_out
system(query)
file = #path_out+#user.pdf.original_filename.slice(0..#user.pdf.original_filename.rindex('.')-1)+".pdf"
send_file file, :type=>"application/msexcel", :x_sendfile=>true
end
if any one use please help me, any gem any script.

I would start with reading from the PDF, inserting the data in the XLSX is easy, if you have problems with that ask another question and specify which gem you use and what you tried for that part.
You use libreoffice to read the PDF but according to the FAQ your PDF needs to be hybrid, perhaps that is the problem.
As an alternative you could try to use some conversion tool for ebooks like the one in Calibre but I'm afraid you will lose too much formatting to recover the data you need.
All depends on how the data in your PDF is structured, if regular text without much formatting and positioning it can be as easy as using the gem pdf-reader
I used it in the past and my data had a lot of formatting - you would be surprised to know how complicated the PDF structure is - so I had to specify for each field at which location exactly which data had to be read, not for the faint of heart.
Here a simple example.
require 'pdf/reader' # gem install pdf-reader
reader = PDF::Reader.new("my.pdf")
reader.pages.each do |page|
# puts page.text
page.page_object.each do |e|
p e.first.contents
end
end

not able to find options to convert from PDF to xsls but API Options available for converting PDF to Image and PDF to powerpoint(Link Given Below)
Not sure u can change the requirement to show results in other formats!!
http://www.convertapi.com/

Related

Convert PDF/DOC/DOCx to Images in Ruby on Rails

Part 1: I have PDFs, Docs,Docx stored in my S3. When I download them I want them to first be converted to images (png or jpg) and then only download (as images or a thumbnail of images).
How to achieve this ?
Part 2: I have used mini-magick to convert pdf to image and its somewhat working like this:
require "mini_magick"
im=MiniMagick::Image.open("path/to_my_pdf.pdf")
im.format("png", 0)
im.write("some_thumbnail.png")
The problem here is a pdf can have multiple pages and I need each and every page to be converted into image format (may be an array of images) and I am not able to achieve it. I am only able to convert any one of the page of the pdf. Stuck here. Kindly help.
Answer any part of the question as you like. !!
You can do that using RMagick gem as following.
require 'RMagick'
pdf = Magick::ImageList.new("path/to_my_pdf.pdf")
pdf.each_with_index do |page, i|
page.write "#{i}_thumbnail.png"
end

How to extract only highlight data from pdf file in rails

I want to take out all the highlighted text from a pdf in rails does anyone have any idea I am not able to figure it out.Sample data
You can either use the pdf-reader gem (the example/text.rb example is simple and worked for me): https://github.com/yob/pdf-reader
Or the command-line utility pdftotext.
# Extract all text from a single PDF
require 'rubygems'
require 'pdf/reader'
filename = File.expand_path(File.dirname(__FILE__)) + "/../spec/data/cairo-unicode.pdf"
PDF::Reader.open(filename) do |reader|
reader.pages.each do |page|
puts page.text
end
end
This is a basic text extraction using the gem mentioned above. This should get you a nice head start. You can grab all the text from the document then figure out how to grab those specific sections based on the data you receive back.

Inserting external PDF into Prawn generated document

How can I insert an existing PDF into a Prawn generated document? I am generating a pdf for a bill (as a view), and that bill can have many attachments (png, jpg, or pdf). How can I insert/embed/include those external pdf attachments in my generated document? I've read the manual, looked over the source code, and searched online, but no luck so far.
The closest hint I've found is to use ImageMagick or something similar to convert the pdf to another format, but since I don't need to resize/manipulate the document, that seems wasteful. The old way to do it seems to be through templates, but my understanding is that the code for templating is unstable.
Does anyone know how to include PDF pages in a Prawn generated PDF? If Prawn won't do this, do you know of any supplementary gems that will? If someone can point me towards something like prawn-templates but more reliable, that would be awesome.
Edit: I am using prawnto and prawn to render PDF views in Rails 4.2.0 with Ruby 2.2.0.
Strategies that I've found but that seem inapplicable/too messy:
Create a jpg preview of a PDF on upload, include that in the generated document (downsides: no text selection/searching, expensive). This is currently my favorite option, but I don't like it.
prawn-templates (downside: unstable, unmaintained codebase; this is a business-critical application)
Merge PDFs through a gem like 'combine-pdf'–I can't figure out how to make this work for rendering a view with the external PDFs inserted at specific places (the generated pdf is a collection of bills, and I need them to follow the bill they're attached to)
You're right about the lack of existing documentation for this - I found only this issue from 2010 which uses the outdated methods you describe. I also found this SO answer which does not work now since Prawn dropped support for templates.
However, the good news is that there is a way to do what you want with Ruby! What you will be doing is merging the PDFs together, not "inserting" PDFs into the original PDF.
I would recommend this library, combine_pdf, to do so. The documentation is good, so doing what you want would be as simple as:
my_prawn_pdf = CombinePDF.new
my_prawn_pdf << CombinePDF.new("my_bill_pdf.pdf")
my_prawn_pdf << CombinePDF.new("attachment.pdf")
my_prawn_pdf.save "combined.pdf"
Edit
In response to your questions:
I'm using Prawn to render a pdf view in Rails, which means that I don't think I get that kind of post-processing
You do! If you look at the documentation for combine_pdf, you'll see that loading from memory is the fastest way to use the gem - the documentation even explicitly says that Prawn can be used as input.
I'm not just tacking the PDFs to the end: a bill attachment must directly follow the generated page(s) for a bill
The combine_pdf gem isn't just for adding pages on the end. As the documentation shows, you can cycle through a PDF adding pages when you want to, for example:
my_pdf # previously defined
new_pdf = CombinePDF.new
my_pdf.pages.each.do |page|
i += 1
new_pdf << my_pdf if i == bill_number # or however you want to handle this logic
end
new_pdf.save "new_pdf.pdf"

what are the other setting need to see a html table into excel sheet format in open office org?

I have generated a html table from my web application and save the table into .xls format(in a single word i am generating a .xls sheet from my web application ).
What other setting I have to show it in table form.
You are not producing an XLS file, you are producing a mal-formed HTML file with a name that ends in .xls.
Indeed, you aren't even doing that since there aren't files on the web (there are streams that may or may not end up in files).
Different versions of Open Office, with different settings, will differ in terms of how they deal with stuff that is wrong. The version on one of the machines you are doing is saying "eh, this isn't XLS, oh! it's HTML with a table, I know what to do", while the other is getting as far as "eh, this isn't XLS, it's a bunch of text with strange less-than and greater-than characters all over the place, what do I do".
What you want to do is to produce an actual stream that Open Office and other spreadsheets can deal with. XLS is possible, but pretty hard. Go for CSV instead.
If your table was going to be:
<table>
<tr>
<th>1 heading</th><th>2 & last heading</th>
</tr>
<tr>
<td>1st cell</td><td>This is the "ultimate" cell</td>
</tr>
</table>
Then it sould become:
"1 heading","2 & last heading"
"1st cell","This is the ""ultimate"" cell"
In otherwords newlines to indicate rows, commas to indicate cells, no HTML encoding, quotes around everything and quotes in your actual content doubled-up. (You don't need to always have quotes on your content, but it's never wrong so that's simpler than working out when you do need them).
Now, make your content type "text/csv".
You are now outputting a CSV stream that can be saved as a CSV file. Your spreadsheet software will have a much better idea about what to do with this (it may still ask about character ecodings on opening, but the preview will show you a spreadsheet of data, not a bunch of HTML source all over the place.
It's not really saving as a .xls file -- it appears to be saving as the HTML, but with a .xls extension. How are you generating the .xls? On the server-side, you can provide a button to generate .xls directly (different methods depending on your server platform -- using perl there is the Spreadsheet::WriteExcel module that writes .xls directly, using Java there is JExcel (http://jexcelapi.sourceforge.net/ and POI (http://poi.apache.org/)), other platforms will have their methods.
Okay Subodh, If you want to generate .xls or .csv files, You can't just change the extension of the file and have it open up correctly in that program.
2 Options you have at this point, both involve creating the file with the data on the server and then sending it to the user to download it.
.csv
CSV files are easier to generate from the server side. In a very basic way you can think of them as regular text files with commas(not necessarily only commas) separating individual cells that can be read by spreadsheet programs. For PHP there is an article Here that explains how to generate CSV files.
.xls
xls files are not as simple as simple to generate as CSV files. On the server-side you will need a solution to generate these. For PHP there is a resource Here.
Using xls over CSV has obvious advantage that you can specify formatting and can control visual representation of your data.
Edit :
Upon closely looking at the image you posted, I can see what you are trying to do. If you just want to get that file to open correctly in a spreadsheet program, then don't save it either as CSV or xls
hello.html
<table>
<tr><td>Hi</td><td>Hi</td><td>Hi</td><td>Hi</td></tr>
<tr><td>2</td><td>2</td><td>131</td><td>11312</td></tr>
</table>
Saved as an HTML file will open up correctly(as a proper table) in any spreadsheet program.
To narrow down the problem:
1) Are you opening the same .xls file on both machines?
- what version of OpenOffice is on Machine 1?
- what version of OpenOffice is on Machine 2?
2) How are you creating your .xls file?
- are you just using the response object to change the content-type, or some proprietary software?
- can you include a code sample?
3) Have you tried a pure HTML format?

In Ruby on Rails, how can I convert html to word?

how can I convert html to word
thanks.
I have created a Ruby html to word gem that should help you do just that. You can check it out at https://github.com/nickfrandsen/htmltoword - You simply pass it a html string and it will create a corresponding word docx file.
def show
respond_to do |format|
format.docx do
file = Htmltoword::Document.create params[:docx_html_source], "file_name.docx"
send_file file.path, :disposition => "attachment"
end
end
end
Hope you find it helpful.
I am not aware of any solution which does this, i.e. convert HTML to Word format. If you literally mean that, you will have to parse the HTML document first using something like Nokogiri. If you mean you want to output data persisted in your model objects, there is obviously no need to parse HTML! As far as outputting to Word, I'm afraid it looks as if you will have to directly interface with a running instance of Microsoft Word via OLE!
A quick google search for win32ole ruby word will get you started:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/241606
Good luck!
I agree with CodeJoust that it is better to generate a PDF. However, if you really need to generate a Word document then you can do the following:
If your server is a Windows machine, you can install Office in it and use ruby's OLE binding to generate the Word document into the public folder and then deliver the file in the response.
To use ruby's OLE binding, see the "Programming Ruby" ebook that comes with the one-click ruby installer for Windows. You may have to use custom logic to convert from HTML to Word unless you can find a function in the OLE api of Word to do that.
http://prawn.majesticseacreature.com/
You could allow the user to download a PDF or a .html file, but there aren't any helpful ruby libraries to do that. You're better off generating a 'printable and downloadable' version, without much styling, and/or a pdf version using a library like prawn.
You could always generate a simple .rtf file, I think word'll be pretty happy reading that...

Resources