Opening a Word (.doc) Document from a stream - stream

I have a bunch of Word documents (.doc) stored in my SQL database that I need open, clean the properties such as Title, Subject etc and then save the file back to the database.
Is it even possible to open a ".doc" file from a stream?

Word is not able to open .doc files from a stream in memory. To open the file you would have to save the document to a temporary location first.
However, Word's little-known RTF converter interface can be used to load documents from streams in RTF format. If using RTF instead of the binary format is an option for you1, you might want to have a look at the WinWord Converter SDK:
How to Obtain the WinWord Converter SDK (GC1039)
For an import converter you would have to implement the ForeignToRtf method that will be called by Word to receive the RTF input.
1Actually you can still save the files in the .doc format; however, you would have to convert the .doc file to RTF first using the SDK and then open the RTF stream in Word. The conversion from the binary format to RTF and vice versa should be mostly lossless as the RTF format has been developed in sync with the binary format. However, it should be borne in mind that using the RTF converter interface will not allow you to use any of the new features introduced with OpenXML/Office 2010.

I'm pretty sure the Word DOCUMENT object implements IPersistStream (the COM interface). I +KNOW+ it implements IPersistFile.
It's not the easiest thing to work with, and since it's COM, it doesn't interoperate well with .net streams, but I believe it'd be doable using IPersistStream.

Related

Node.js - Download .docx file exported as html from onedrive using microsoftgraph api call

When making a call like this example from here
client
.api('/me/drive/root/children/Doc.docx/content')
.getStream((err, downloadStream) => {
let writeStream = fs.createWriteStream('Mydoc.docx');
downloadStream.pipe(writeStream).on('error', console.log);
});
It works as expected. What I want is to get the .docx file as html. Is there any way to download it in html format? Or do I have to save the file and then try to export it to html. Thanks
Word Documents (.docx) do not use HTML, they use Office Open XML (OOXML). Technically they are a zipped package that contains several elements along with the raw OOXML of the document.
OneDrive itself does not provide any document conversion tools, it is just the cloud storage the document is stored in.
In order to convert a document from one format to another (OOXML to HTML for example), you'll need to use a 3rd party tool or service for that purpose. I'd suggest taking a look as Aspose. They offer a slew of file format conversion tools including one for Word. I've had a number of developers report good results using their Aspose Cloud services as well.
You can add the query parameter format=html to download in html format but supposedly you have to use the beta endpoint.

Can't get the F# Word typeprovider to work for custom Word documents

I've downloaded the FSharp 3 Sample Pack and tried out the sample for the Word documents typeprovider which works fine in the TestScript.fsx file when using the provided sample document (AA.docx). But when I try using it with a different Word document it doesn't work i.e. no properties are generated on the type provider instance (Person, MyCompany etc.). Even if I create a new document and copy the contents of AA.docx to it (keeping source formatting) it doesn't work. What could be the issue?
The word type provider uses the Open xML API. The same word content can have different XML representation at backend. I'd suggest you to download the Open XML SDK and use the tool to visualize its content.

Convert doc to pdf programmatically with out using WORD / thirdparty tools

Is it possible to convert a doc file to a pdf file programmatically, with out using WORD application/third party tools. Preferably in Delphi XE4. If so, how?
Yes, you can convert .doc/.docx files to .pdf without Word and without third-party controls. The specifications are publically available - [simply] read and parse the .doc/.docx file according to the specification and generate the content according to the .pdf specification.
Here is the specification for MS-DOC (.doc) file format :
MS-DOC Specification (622 pages) -- Word97 through 2007
MS-DOCX Extensions Specification (105 pages) -- Word2010 through 2013
See also - Open Document and OpenXML Format
And the specification for the .pdf format :
PDF Reference (1310 pages)
Really, I think you'll find you probably want to use a third party component...

Apache Tika Office to PDF conversion

I am trying to convert office files to PDF using POI and iText. I am able to do the basic conversion where I read the word file using WordExtractor and write the contents to PDF file using PDF writer.
However, this does not retain the structure (tables, styles etc). I have come across this forum that you can retain the formats using Tika. Are there any working examples for this?

Grabbying text from various document formats in Ruby on Rails

I'm new to Rails but am developing a web app that requires taking text from a large database of text files and displaying the text in html. The files are in .doc, .docx, .wps, and .pages, and are currently just sitting on a hardrive. There are a small enough number of files in .wps and .pages that I could convert these to .doc manually, but the question remains: how do I get to the text inside a .doc or .docx file so that I can save it into a sqlite database for later use?
Thanks!
Take a look at Yomu. It's a gem which acts as a wrapper for Apache TIKA and it supports a variety of document formats which includes the following:
Microsoft Office OLE 2 and Office Open XML Formats (.doc, .docx, .xls, .xlsx, .ppt, .pptx)
OpenOffice.org OpenDocument Formats (.odt, .ods, .odp)
Apple iWorks Formats
Rich Text Format (.rtf)
Portable Document Format (.pdf)
It's a long roundabout way, but open office can convert files, and there are programmatic ways to do that: http://railstech.com/2010/08/convert-open-office-document-to-another-open-office-format/
That may not be the best way yet, but maybe it will grease the wheels a bit.

Resources