How can I read word-by-word (with styles) from a docx file. I want to compare two docx files word-by-word and based on the differences I have to write into another docx file (using c# and OOXML).
I have tried achieving this by using DocumentFormat.OpenXml.Extensions.dll, OpenXMLdiff.dll and ICSharpCode.SharpZipLib.dll but nothing is giving me the option to read word-by-word(ICSharpCode.SharpZipLib does give word-by-word but it will not give style associated with that word).
Any help on this will be very useful.
This MSDN article shows how to reliably retrieve the exact text of a document, paragraph by paragraph.
http://msdn.microsoft.com/en-us/library/ff686712.aspx
At the same time, you can determine the style for each paragraph. That is pretty easy. The following blog post shows how to retrieve the style and text for each paragraph:
http://blogs.msdn.com/b/ericwhite/archive/2009/02/16/finding-paragraphs-by-style-name-or-content-in-an-open-xml-word-processing-document.aspx
Comparing the two? It depends on your exact desired semantics. One approach would be to create an XML document that contains paragraphs and styles, then comparing the XML documents. The XML document might look something like this:
<Root>
<Para>
<Style>Normal</Style>
<Text>This is the text of the paragraph.</Text>
</Para>
<Para>
<Style>Heading1</Style>
<Text>Overview of the Process</Text>
</Para>
</Root>
The easiest way is to just unzip the DOCX file using your favorite ZIP library and then compare the text files with a file IO library.
Related
When I use scholar.google.com to get the full reference code (BibTeX) such as
#article{li2018design,
title={Design and implementation of building structure monitoring system based on radio frequency identification (RFID)},
author={Li, Hongwei and Ren, Yilei},
journal={International Journal of RF Technologies},
volume={9},
number={1-2},
pages={37--49},
year={2018},
publisher={IOS Press}
}
Then go to the journal template file, I want to copy the reference from scholar.google.com and paste it into the LaTex journal template file without doing any modification.
The template file looks like this:
Unfortunately, the paste in the template file does NOT work.
Here is the error message:
Can you assist of how to do it and make the file work correctly?
The format given by Google scholar has to be used when you compile, in order, with
Latex (or PDFLatex)
Bibtex
Latex (or PDFLatex)
Latex (or PDFLatex)
where you store the entries of your bibliography in a separate file named <mydoc>.bib.
You are simply using \thebibliography environment, which allows you to write \bibitems, which is a simpler approach (even if I suggest to use it when you have a small number of bibliography entries). If you have to use this latter approach, you have to rewrite the reference retrieved in Google scholar in the format needed by your document.
Have a look here: it explains very well the differences.
I am new to saxon and xslt, we have business in which feeder delivere more than one xml files, xslt generated from altov) create one output xml files , we have selected saxon as transformer.
so far i am able to transform single xml file
do any body have example where xslt takes more than one xmls as input stream, transform using saxon.
Thanks & regards,
Kumar
You haven't told us enough about the requirements, but there are several techniques to be aware of:
You can pass additional documents as stylesheet parameters declared using xsl:param
You can read a document (given its URI) using the doc() or document() functions
You can read a whole collection of documents (e.g. the contents of a directory) using the collection() function
Something that I've had a good hard look for and I have not been able to find, is how to efficiently obtain a hard copy of Javadocs? Obviously, one solution is simply to navigate to each page and execute a browser print, but there's got to be a better way! Do you guys have any ideas?
You can use DocBook Doclet (dbdoclet) to create DocBook XML from your JavaDoc Comments. The DocBook XML can then be transformed to PDF or (Singlepage-)HTML.
You can call the tool from the commandline. Point it to your class files and it will generate the DocBook XML. This works similar to the javadoc command which will generate the JavaDoc HTML. Example:
./dbdoclet -sourcepath ~/my-java-program/src/main/java -subpackages org.example
The result is a DocBook XML file in a dbdoclet subdirectory which can be used to create a PDF or HTML file. This can also be done from the command line; I am using the docbkx-maven-plugin for this.
You can do mass conversions with it, but it would require some time to make it work the way you want.
I need to output a file in the format of a Word document from a Ruby-based web app (Rails/Sinatra) based on some textual content in the app. Is there library support in Ruby for creating and structuring a Word document?
Take a look at WordML, the XML format for Word files.
John Durant's blog has a useful list of WordML and FAQ resources
Walkthrough: Word 2007 XML Format
Useful tool for creating XSLT transforms: Office 2003 Tool: WordprocessingML Transform Inference Tool
These SO posts might also be of interest:
Creating Word or XML document with VBA
Generating WordML Reports Using Templates and XPath using ASP.Net
Convert XHTML to Word ML
XML to WordML using XSLT 1.0 - replace html tags within xml content with wordML formatting tags
How can I convert convert docx or wordml xml files to xsl-fo?
You don't specify what "a Word document" means exactly. Is it a Word 2003-style doc file? Is it a Word 2007 docx file? Is it just something Word can open than supports styling?
If the latter is what you want, you could use RTF, which is somewhat easier than the doc format. There is a library called Ruby RTF that should do what you want, though I've honestly never used it myself.
Would it be easier to generate a Word 2003 document: Is there an easier to understand file format for a basic Word 2003 .doc that doesn't require a PhD in XML, etc?
I'm looking for a fast and reliable way to read/parse large PDF files in Ruby (on Linux and OSX).
Until now I've found the rather old and simple PDF-toolkit (a pdftotext-wrapper) and PDF-reader, which was unable to read most of my files. Though the two libraries provide exactly the functionality I was looking for.
My question: Have I missed something? Is there a tool that is better suited (faster and more reliable) to solve my problem?
You might find Docsplit useful:
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
After trying different methods, I'm using PDF-Toolkit now. It's quite old, but it's fast, stable and reliable. Besides, it really doesn't need to be new, because it just wraps the xpdf commandline utilities.
You could use JRuby and a Java PDF library parser such as ApachePDFBox (https://www.ohloh.net/p/pdfbox). See also http://java-source.net/open-source/pdf-libraries.
Did you have a look at the CombinePDF library?
It's a pure ruby solution that allows some PDF manipulation, such as extracting pages, overlaying one PDF page over another, page numbering, writing basic text and tables, etc'.
Here's an example for stumping an existing PDF file with a logo. The example reads a PDF file, extracts one page to use as a stamp and stamps another PDF file.
require 'combine_pdf'
company_logo = CombinePDF.load("company_logo.pdf").pages[0]
pdf = CombinePDF.load "content_file.pdf"
pdf.pages.each {|page| page << company_logo}
pdf.save "content_with_logo.pdf"
You can also stamp text, number pages or use :
require 'combine_pdf'
pdf = CombinePDF.load "content_file.pdf"
pdf.number_pages #adds page numbers. you can add formatting and placement options.
pdf.pages.each {|page| page.textbox "One Way To Stamp"}
#you can a shortcut method to stamp pages
pdf.stamp_pages "Another way to stamp"
#you can use the shortcut method for both text and PDF stamps
company_logo = CombinePDF.load("company_logo.pdf").pages[0]
pdf.stamp_pages company_logo
# you can use write simple tables
pdf.pages[0].write_table headers: ['first name', 'surname'], table_data: [['John', 'Doe'], ['Mr.', 'Smith']]
pdf.save "content_with_logo.pdf"
It's not meant for complex operations, but it complements most PDF authoring libraries and allows you to use PDF templates instead of writing the whole thing from scratch.
Here's some options:
http://en.wikipedia.org/wiki/List_of_PDF_software
From that link, and searching sourceforge, there's a couple of command line utilities that might do what you want, like this one: http://pdftohtml.sourceforge.net/
Depending on your requirements and what the PDFs look like, you could look at using the Google Docs API (uploading the PDF and then downloading it as text), or could also try something like gocr. I've had a lot of luck parsing image text with gocr in the past, and you'd just have to bounce out to the shell to do it, like gocr -i whatever.pdf (I think it works with PDFs).
The downside to all of these is that they're not pure-Ruby implementations, but lots of the good (and free) OCR projects seem to be done that way.
If you just need to get the text content out of a pdf file, pdftohtml at sourceforge is efficient.
it is not suited for dealing with images.