I was just wondering if anyone knew of any good libraries for parsing .doc files (and similar formats, like .odt) to extract text, yet also keep formatting information where possible for display on a website.
Capability of doing similarly for PDFs would be a bonus, but I'm not looking as much for that.
This is for a Rails project, if that helps at all.
Thanks in advance!
Apache's POI is a very popular way to access Word and Excel documents. There's a Ruby POI binding that might be worth investigating, but it looks like you'll have to build it yourself. And the API doesn't seem very Ruby-like since it's virtually a direct port from the Java code. And it seems to only have been tested against Ruby 1.8.2.
Related
Update:
After looking more closely into the issue I think I am understanding the problem wrong. Since epub is essentially a zipped file I have to generate files at some point.
The actual question would be how to do this efficiently in production if the number of files and file size I need to generate become large?
The ebook content will be generated from entries in the database as html files. I am thinking about storing those files with Amazon S3 but I am not sure if that's the best option out there.
Original Question
I am trying to create a web-based epub generation application with Ruby On Rails.
Currently I am looking into the eeepub gem: https://github.com/jugyo/eeepub.
I am wondering if there is a way to feed the epub content from database without declaring files as shown in the example.
files [File.join(dir, 'foo.html'), File.join(dir, 'bar.html')]
There is an open issue regarding this:https://github.com/jugyo/eeepub/issues/17
from years ago....
I know the gem is very old and does not seem to be active at all. I have looked through the source code and still not seeing a solution. If anyone has any pointers on how to achieve this through eeepub or a better tool please help me out! Thanks in advance.
Hi #voidwalker You can check the best gems for e-publishing on Ruby-toolbox, here you can compare gems by their popularity and activity.
from this list I think the Git-scribe is the best gem as per your requirement. Please try it and let me know if it's helpful.
Thanks
I'm aware of pdf-stamper, but I'm trying to avoid switching everything to jruby right now.
I just need to "stamp" an image that I generate within the rails app (a PDF417 barcode) into a form field in the PDF document (there's an FDF; it's a document template kinda thing).
I'm filling out the text-based fields by just shelling out to pdftk, so if there's a way to do it using pdftk, I'd be fine with that, but I've looked high and low for one without any luck.
How about using a barcode font? some alternatives too. I haven't used that one but there may be others available too
I know I'm late to the party, but the PDF417 Rubygem should do what you need. https://rubygems.org/gems/pdf417 will generate it and if you have chunky_png installed you can easily write out PNGs to a file.
I'm testing it and Nokogiri does not seem to respect Robots.txt file. Is there someway to make it respect? It seems like common question, but I could not find any answer online.
Nokogiri parses the HTML or webpage that you give it. It does not know anything about the robots.txt file for the domain where the page you happen to have requested resides.
I presume that you want to ignore in-site links that are in robots.txt?
Since you've tagged this Rails, I'll assume you use Ruby. In that case you can use the Mechanize library which has the facility to use the robots.txt file.
There is also the original Perl version and other language ports if you prefer those.
After several Google searches, it appears that the way to create PDFs in Rails from HTML and CSS (versus a new markup language) is to use Prince.
With licensing at $3800 for my non-big-commercial app, I'm wondering if this is, in fact, consensus or people have an alternative they can share the whats and hows.
You may check out prawn too. Tutorial can be found on railscasts.com.
This may fit the bill: http://code.google.com/p/wkhtmltopdf/
We tried tow solutions:
using latex generate pdf, there is ruby gem code rtex
using java library iText, use it you may need rjb which allow you using java lib directly in ruby code, just like jruby, but you don't need build all you application on jruby.
I create tons of different PDF files on the fly from various data sources using Rails, including finest layout. I create need to create them for presenting products to customers.
After having tried all the tools mentioned above, Prince is the best tool for this task.
Prince's rendering quality & CSS support (better than some browsers) is its main selling point. If you're only generating documents with simple layouts, stick with Prawn.
Can anyone recommend a way of creating a view where users can upload images to my app through a WYSIWYG editor?
I've tried solving this using CK Editor and Paperclip but am having lots of trouble... Maybe I'm going about this the wrong way.
If someone's done this before I'd really like to know how! I don't have a editor or file storage mechanism preference so fire away...
This is all dependent on the WYSIWYG's file upload API. From there, just build an ImagesController to handle requests from that API, use whatever system (Paperclip is good) to handle those files internally, and you should be good to go. You won't find a plug-and-play solution; you'll have to hand-roll it.
Turns out that, with more targeted Google searching, you can find a preexisting solution. Here's one for TinyMCE and Rails. You may, however, end up finding that it doesn't meet your needs, in which case I would not be surprised to find that creating your own solution would be simpler than you expect :)
You could try Bootsy. It's a WYSIWYG editor with image upload capability. Includes a (rather simple) image manager as well.
https://github.com/volmer/bootsy
There is an other solution for rails out there:
https://github.com/spohlenz/tinymce-rails
You can load it as gem and configure it via a yml file. And it comes with an extra language gem.