Opening .doc files in Ruby - ruby-on-rails

Can I open a .doc file and get that file's contents using Ruby?

If you only need the plain text content, you might want to have a look at Yomu. It's a gem whichs acts as a wrapper for Apache TIKA and it supports a variety of document formats which includes the following:
Microsoft Office OLE 2 and Office Open XML Formats (.doc, .docx, .xls, .xlsx, .ppt, .pptx)
OpenOffice.org OpenDocument Formats (.odt, .ods, .odp)
Apple iWorks Formats
Rich Text Format (.rtf)
Portable Document Format (.pdf)

The gem docx is very simple to use
require 'docx'
puts Docx::Document.open('test.docx')
or
d = Docx::Document.open('test.docx')
d.each_paragraph do |p|
puts p
end
you can find it at https://github.com/chrahunt/docx and install it by gem install docx
docx however doesn't support .doc files (word 2007 and earlier), then you can use the WIN32OLE like this:
require 'win32ole'
begin
word = WIN32OLE.connect('Word.Application')
doc = word.ActiveDocument
rescue
word = WIN32OLE.new('word.application')
path_open = 'C:\Users\...\test.doc' #yes: backslashes in windows
doc = word.Documents.Open(path_open)
end
word.visible = true
doc.Sentences.each { |x| puts x.text }

Yes and No
In Ruby you can do something like:
thedoc = `externalProgram some_file`
And so what you need is a good externalProgram.
You could look at the software library wv or the (apparently not recently updated) program antiword. I imagine there are others. OpenOffice can read doc files and export text files, so driving OO via the CLI will probably also work.

If you're on Windows, this will work: http://www.ruby-doc.org/stdlib/libdoc/win32ole/rdoc/classes/WIN32OLE.html

I recently dealt with this in a project and found that I wanted a lighter-weight library to get the text from .doc, .docx and .pdf files. DocRipper uses a combination of Antiword, grep and Poppler/pdftotext command-line tools to grab the text contents from files and return them as a utf-8 string.
dr = DocRipper::TextRipper.new('/path/to/file')
dr.text
=> "Document's text"

Related

How to open base64 spreadsheet on Ruby

I've been trying to manipulate a file that's base64 encoded that I'm recieving from my client.
I'm currently using https://github.com/zdavatz/spreadsheet/blob/master/GUIDE.md to manipulate it, however, there doesn't appear to be any way to open a file directly from the base64 blob, or should I write it and then read from it? can't that a potential security threat for the server?
for example, if I recieve a file :
file = params[:file] with contents:
data:application/vnd.ms-excel;base64,0M8R4KGxGuEAAAAAAAAAAAAAAAAAAAAAOwADAP7
(should I remove the data:application/vnd.ms-excel;base64, ?)
I'd like to open it with this:
Spreadsheet.client_encoding = 'UTF-8'
book = Spreadsheet.open "#{Rails.root}/app/assets/spreadsheet/event.xls"
(or with a blob or temp fle)
Sorry if it's pretty obvious, been looking for hours and there's not much info about it available, tried creating a temp file first but I don't think that's supported and there's not much I can get from the docs.
Shot in the dark: Maybe decode it, write to binary-enabled tempfile, and then feed that to Spreadsheet?
tmpfile = Tempfile.new.binmode
tmpfile << Base64.decode64(params[:file])
tmpfile.rewind
book = Spreadsheet.open(tmpfile)

open office crashes after some time giving garbled font in converted PDF

We are converting word to pdf using the openoffice(3.4.1 version) in java with JODConverter.
below is the code used.
OpenOfficeConnection connection =
new SocketOpenOfficeConnection(2100);
try
{
connection.connect();
DocumentConverter converter =
new OpenOfficeDocumentConverter(connection);
converter.convert(inputFile, outputFile);
connection.disconnect();
return "Sucess " + DestinationPath + DestinationFileName;
}
catch (Exception localException1) {
}
The problem is that after random no of days the converted PDF contains the garbled fonts.
like # # ! $ $ " % &
The only solution we have so far is to restart the server. System guys are saying the the problem is with Open Office.
We are using open office to convert the document since it converts the doc files exactly including all the formatting and table structure.
what could be the solution to this.
So OpenOffice can be a little temperamental when running on a server, especially as it isn't multi-threaded and you end up having to run a pool of OpenOffice processes - see How can I use OpenOffice in server mode as a multithreaded service?.
Added to that often the rendering is off when converting to PDF - see https://forum.openoffice.org/en/forum/viewtopic.php?f=7&t=68865 which is why you may want to consider using a conversion service to automate the conversion tasks for you ?
For complete transparency I work for Zamzar (an online file conversion service), we have recently released a developer API - https://developers.zamzar.com/ that allows you to convert between a multitude of file types, specifically applicable to you here in that we support both doc and docx to pdf with little or not loss in the way the PDF is rendered. It maybe worth a look to see if this is a better alternative to trying to run your own solution through OpenOffice on a server.

Rails CSV file upload character encoding issues

I know there are a lot of threads about this already, but none of the solutions suggested do not seem to work for me for some reason...
I am using:
Ruby 1.9.2
Rails 2.3.8
My users author CSV files in MS Excel and then need to upload these files to the web application. My web application and the database backend uses UTF-8 and all special characters, such as the £ sign, get corrupted on upload.
I am reading in the file like this:
#file = params[:import_file][:uploaded_data]
Then get encoding of the file using:
source_encoding = "UTF-8"
if #file.external_encoding
source_encoding = #file.external_encoding.name
end
For my test file the source encoding value is ASCII-8BIT.
Then I try to do:
#file.each {|line|
print "#{line.force_encoding(source_encoding).encode!("UTF-8") }\n"
}
in order to see if all texts are displaying ok. However this gives me error like this:
"\xA3" from ASCII-8BIT to UTF-8
If I am trying to read the CSV with:
dataArray = CSV.read(#file, encoding: source_encoding)
No errors this time, but all special characters go as ? characters.
Any pointers where I might be going wrong or is importing CSV file authored with MS Excel just a mission impossible?
Regards,
Olli

How to read .docx file using F#

How can I read a .docx file using F#. If I use
System.IO.File.ReadAllText("D:/test.docx")
It is returning me some garbage output with beep sounds.
Here is a F# snippet that may give you a jump-start. It successfully extracts all text contents of a Word2010-created .docx file as a string of concatenated lines:
open System
open System.IO
open System.IO.Packaging
open System.Xml
let getDocxContent (path: string) =
use package = Package.Open(path, FileMode.Open)
let stream = package.GetPart(new Uri("/word/document.xml", UriKind.Relative)).GetStream()
stream.Seek(0L, SeekOrigin.Begin) |> ignore
let xmlDoc = new XmlDocument()
xmlDoc.Load(stream)
xmlDoc.DocumentElement.InnerText
printfn "%s" (getDocxContent #"..\..\test.docx")
In order to make it working do not forget to reference WindowsBase.dll in your VS project.
.docx files follow Open Packaging Convention specifications. At the lowest level, they are .ZIP files. To read it programmatically, see example here:
A New Standard For Packaging Your Data
Packages and Parts
Using F#, it's the same story, you'll have to use classes in the System.IO.Packaging Namespace.
System.IO.File.ReadAllText has type of string -> string.
Because a .docx file is a binary file, it's probable that some of the chars in the strings have the bell character. Rather than ReadAllText, look into Word automation, the Packaging, or the OpenXML APIs
Try using the OpenXML SDK from Microsoft.
Also on the linked page is the Microsoft tool that you can use to decompile the office 2007 files. The decompiled code can be quite lengthy even for simple documents though so be warned. There is a big learning curve associated with OpenXML SDK. I'm finding it quite difficult to use.

convert jruby 1.8 string to windows encoding?

I want to export some data from my jruby on rails webapp to excel, so I create a csv string and send it as a download to the client using
send_data(text, :filename => "file.csv", :type => "text/csv; charset=CP1252", :encoding => "CP1252")
The file seems to be in UTF-8 which Excel cannot read correctly. I googled the problem and found that iconv can convert encodings. I try to do that with:
ic = Iconv.new('CP1252', 'UTF-8')
text = ic.iconv(text)
but when I send the converted text it does not make any difference. It is still UTF-8 and Excel cannot read the special characters. there are several solutions using iconv, so this seems to work for others. When I convert the file on the linux shell manually with iconv it works.
What am I doing wrong? Is there a better way?
Im using:
- jruby 1.3.1 (ruby 1.8.6p287) (2009-06-15 2fd6c3d) (Java HotSpot(TM) Client VM 1.6.0_19) [i386-java]
- Debian Lenny
- Glassfish app server
- Iceweasel 3.0.6
Edit:
Do I have to include some gem to use iconv?
Solution:
S.Mark pointed out this solution:
You have to use UTF-16LE encoding to make excel understand it, like this:
text= Iconv.iconv('UTF-16LE', 'UTF-8', text)
Thanks, S.Mark for that answer.
According to my experience, Excel cannot handle UTF-8 CSV files properly. Try UTF-16 instead.
Note: Excel's Import Text Wizard appears to work with UTF-8 too
Edit: A Search on Stack Overflow give me this page, please take a look that.
According to that, adding a BOM (Byte Order Mark) signature in CSV will popup Excel Text Import Wizard, so you could use it as work around.
Do you get the same result with the following?
cp1252= Iconv.conv("CP1252", "UTF8", text)

Resources