OpenXml get page number to which each paragraph in a .docx file - openxml-sdk

I have a Word docx file and I want to retrieve all the paragraphs in OpenXml with c#.
I need to know:
1.-The number of pages of the Documents.
2.-The page number to which each paragraph belongs.
Can you show an example where the paragraphs of the document are read?

Unfortunately, As Why only some page numbers stored in XML of docx file? answers, docx dose not contains reliable page number service. Xml files carry no page number, until microsoft Word open it and render dynamically. Even you read openxml documents like https://learn.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.pagenumber?view=openxml-2.8.1 .
You can unzip some docx files, and search "page" or "pg". Then you will know it. I do this on different kinds of docx files in my situation. All tell me the same truth. Glad if this helps.
Few month ago, I reprogramed a python package call docx2python to do similar thing. I reproduced a structured(with level) xml format file from a docx file. As far as I know, a paragraph contains several Runs and each Run contain one only text. You can read this document to see how to do it. Plain paragraphes are not hard to read. https://learn.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.paragraph?view=openxml-2.8.1 . Glad if this helps.

Related

DOCBOOK to EPUB File Size

Some eBook reading devices (like older Kindles) perform better with OEBPS/Text file sizes in the 350KB range. When you go over that, page load and scrolling can be a miserable user experience.
Question: If you have a large text, 4 MB for example---Will the DOCBOOK to EPUB publishing flow put that into OEBPS/Text that as a monolithic 4MB file, or will it split it into smaller files for you?
If it splits the file, does it repair the anchor IDs to reflect the new file name?
I couldn't find the answer to this at docbook.org.
Question: If you have a large text, 4 MB for example---Will the DOCBOOK to EPUB publishing flow put that into OEBPS/Text that as a monolithic 4MB file, or will it split it into smaller files for you?
The "DocBook to EPUB publishing flow" (DocBook XSL) will split the input XML into smaller output files.
This process is called "chunking" and is described in detail here: http://www.sagehill.net/docbookxsl/Chunking.html (this is a section from the book DocBook XSL: The Complete Guide).
If it splits the file, does it repair the anchor IDs to reflect the new file name?
I am not completely sure what you mean by "repair the anchor IDs", but the chunking process does ensure that cross-references and entries that go in to *.opf and *.ncx files are correct.
EPUB is one of many output formats that can be created from DocBook sources. If you have never used DocBook XSL before, you should read "DocBook XSL: The Complete Guide" (see link above). This book does not cover EPUB output specifically (it was written before the EPUB stylesheets had been developed).
DocBook XSL provides stylesheets for both EPUB 2 and EPUB 3 (most of the effort goes into EPUB 3 these days):
README for EPUB 2: http://sourceforge.net/p/docbook/code/HEAD/tree/trunk/xsl/epub/README.
README for EPUB 3: http://sourceforge.net/p/docbook/code/HEAD/tree/trunk/xsl/epub3/README.
Best practices are to create separate HTML files for chapters (and sometimes sections).
As long as your file separates things into one of these elements, you can use chunking to produce the results you want.
All the anchor IDs will work like a charm. Even the indices will work!

Export data to XLS (not via CSV) on iOS

I need to export some data to an .XLS file, pdf, and print.
I already tried the simple solution: exporting it to .CSV with CHCSVWriter. It works for printing and saving it to pdf (I open the CSV in a UIWebView and get the PDF or print from there). However, to use the CSV to be open in excel has two main problems:
1 - First, as the name says, in the CSV the values are separated by commas, and in some versions of Excel, it requires the user to separate 'manually' in cells.
2 - I have hebrew characters, and I already tried all the string encodings, and can't have both hebrew and latin characters.
So, after giving up after days of trying to use CSV to solve the issues above, I gave up. How can I export my data to XLS?
The LibXL library provides this functionality for both xls and xlsx formats. There is no iOS version, but people say the iOS version is coming. You may want to contact LibXL support to confirm this.
EDIT:
The iOS version is available now.
This article explains how to programmatically create an Excel (.xls) file without using any external library. It just opens a file stream and it writes XML contents straight to it.
It is written in C#, but the core information coming out of it is the XML formatting used to create nodes and fill attributes for corresponding cell values and formatting.
Please consider I have not tried this myself, I found it while doing a search. Please feel free to ask if some C# bits are not clear. HTH

Grabbying text from various document formats in Ruby on Rails

I'm new to Rails but am developing a web app that requires taking text from a large database of text files and displaying the text in html. The files are in .doc, .docx, .wps, and .pages, and are currently just sitting on a hardrive. There are a small enough number of files in .wps and .pages that I could convert these to .doc manually, but the question remains: how do I get to the text inside a .doc or .docx file so that I can save it into a sqlite database for later use?
Thanks!
Take a look at Yomu. It's a gem which acts as a wrapper for Apache TIKA and it supports a variety of document formats which includes the following:
Microsoft Office OLE 2 and Office Open XML Formats (.doc, .docx, .xls, .xlsx, .ppt, .pptx)
OpenOffice.org OpenDocument Formats (.odt, .ods, .odp)
Apple iWorks Formats
Rich Text Format (.rtf)
Portable Document Format (.pdf)
It's a long roundabout way, but open office can convert files, and there are programmatic ways to do that: http://railstech.com/2010/08/convert-open-office-document-to-another-open-office-format/
That may not be the best way yet, but maybe it will grease the wheels a bit.

Creating Microsoft Word (.docx) documents in Ruby

Is there an easy way to create Word documents (.docx) in a Ruby application? Actually, in my case it's a Rails application served from a Linux server.
A gem similar to Prawn but for DOCX instead of PDF would be great!
As has been noted, there don't appear to be any libraries to manipulate Open XML documents in Ruby, but OpenXML Developer has complete documentation on the format of Open XML documents.
If what you want is to send a copy of a standard document (like a form letter) customized for each user, it should be fairly simple given that a DOCX is a ZIP file that contains various parts in a directory hierarchy. Have a DOCX "template" that contains all the parts and tree structure that you want to send to all users (with no real content), then simply create new (or modify existing) pieces that contain the user-specific content you want and inject it into the ZIP (DOCX file) before sending it to the user.
For example: You could have document-template.xml that contains Dear [USER-PLACEHOLDER]:. When a user requests the document, you replace [USER-PLACEHOLDER] with the user's name, then add the resulting document.xml to the your-template.docx ZIP file (which would contain all the images and other parts you want in the Word document) and send that resulting document to the user.
Note that if you rename a .docx file to .zip it is trivial to explore the structure and format of the parts inside. You can remove or replace images or other parts very easily with any ZIP manipulation tools or programmatically with code.
Generating a brand new Word document with completely custom content from raw XML would be very difficult without access to an API to make the job easier. If you really need to do that, you might consider installing Mono, then use VB.NET, C# or IronRuby to create your Open XML documents using the Open XML Format SDK 1.0. Since you would just be using the Microsoft.Office.DocumentFormat.OpenXml.Packaging Namespace to manipulate Open XML documents, it should work okay in Mono, which seems to support everything the SDK requires.
Maybe this gem is interesting for you.
https://github.com/trade-informatics/caracal/
It like prawn but with docx.
You can use Apache POI. It is written in Java, but integrates with Ruby as an extension
This is an old question but there's a new answer. If you'd like to turn an HTML doc into a Word (docx) doc, just use the 'htmltoword' gem:
https://github.com/karnov/htmltoword
I'm not sure why there was answer creep and everyone started posting templating solutions, but this answers the OP's question. Just like Prawn, except Word instead of PDF.
UPDATE:
There's also pandoc and an API wrapper for pandoc called docverter. Both have slightly complicated installs since pandoc is a haskell library.
I know if you serve a HTML document as a word document with the .doc extension, it will open in Word just fine. Just don't do anything fancy.
Edit: Here is an example using classic ASP. http://www.aspdev.org/asp/asp-export-word/
Using a technique very similar to that suggested by Grant Wagner I have created a Ruby html to word gem that should allow you to easily output Word docx files from your ruby app. You can check it out at http://github.com/nickfrandsen/htmltoword - Simply pass it a html string and it will create a corresponding word docx file.
def show
respond_to do |format|
format.docx do
file = Htmltoword::Document.create params[:docx_html_source], "file_name.docx"
send_file file.path, :disposition => "attachment"
end
end
end
Hope you find it useful. If you have any problems with it feel free to open a github issue.
Disclosure: I'm the leader of the docxtemplater project.
I know you're looking for a ruby solution, but because all other solutions only tell you how to do it globally, without giving you a library that does exactly what you want, here's a solution based on JS or NodeJS (works in both)
DocxTemplater Library
Demo of the library
You can also use it in the commandline:
npm install docxtemplater -g
docxtemplater <configFile>
----config.docxFile: The input file in docx format
----config.outputFile: The outputfile of the document
This is a way Doccy (doccyapp.com) has a api that does just that which you can use. Supports docx, odt and pages and converts to PDF as well if you like
Further to Grant's answer, you can also send Word a "Flat OPC" file, which is essentially the docx unzipped and concatenated to create a single xml file. This way, you can replace [USER-PLACEHOLDER] in one file and be done with it (ie no zipping or unzipping).
If anyone is still looking at this, this post explains how to use an XML data source. This works nicely for me.
http://seroter.wordpress.com/2009/12/23/populating-word-2007-templates-through-open-xml/
Check out this github repo: https://github.com/jawspeak/ruby-docx-templater
It allows you to create a document from a word template.
If you're running on Windows, of course, it's a matter of WIN32OLE and some pain with the Word COM objects.
Chances are that your serving from a *nix environment, though. Word 2007 uses the "Microsoft Office Open XML" format (*.docx) which can be opened using the appropriate compatibility pack from Microsoft.
Some of the more recent Office apps (2002/XP and 2003 at least) had their own XML formats which may also be useable.
I'm not aware of any Ruby tools to make the process easier, sadly.
If it can be made acceptable, I think I'd be inclined to go down the renamed-html file route. I just saved a document as HTML from WordXP, renamed it to a .doc and opened it without problem.
I encountered the same problem. Unfortunately I could not manipulate the xml because my clients should themselves to fill in templates. And to do this is not always possible (for example, office for mac does not allow this).
As a solution to this problem, I made ​​a simple gem, which can be used as an rtf document template with embedded ruby: https://github.com/eicca/rtf-templater
I tested it and it works ok for filling reports and documents. However, formatting badly displays for complex loops and conditions.

Search Words in pdf files

Is it possible to search "words" in pdf files with delphi?
I have code with which I can search in many others files like (exe, dll, txt) but it doesn't work with pdf files.
It depends on the structure of the specific PDF.
If the pdf is made of images (scanned pages) then you have to OCR each image and build a full text index inside the PDF. (To see if its image based, open it with notepad and look for obj tags full of random chars). There are a few utilities and apps that do this kind of work for you, CVision PDF Compressor is one that I have used before.
If the pdf is a standard PDF, then you should be able to open it like any other text file and search for the words.
Here is page that will detail some of the structure of a PDF. This a SO post for the same.
The components/libraries mentioned in the answer to this question should do what you need.
I'm just working on a project that does this. The method I use is to convert the PDF file to plain text (with pdftotext.exe) and create an index on the resulting text. We do the same with word and other office files, works pretty good!
Searching directly into pdf files from Delphi (without external app) is more difficult I think. If you find anything, please update here as I would also be very interested in that!
One option I have used is to use Microsoft's ifilter technology, this is used by windows desktop search and many other products such as sharepoint and SQL server full-text search.
It supports almost any office/office-like file format, even dwg, msg, pdf, and files in zip/rar archives.
The easiest way to use it is to run FiltDump.exe on any files you have, and index the text output.
To know about the filters installed on your PC, you can use ifilter explorer.
Wikipedia has some links on its ifilters page.
Quick PDF Library's GetPageText function can give you the words from a PDF as well as the page number and the co-ordinates of those words - sometimes useful for highlighting.
PDF is not just a binary representation. Think of it as a tree of objects, where an object node has some metadata and some content information. Some of these objects have string data, some don't. Some of these are even encrypted, and some are compressed. So, there's very little chance your string finder will work on any arbitrary PDF.

Resources