Microsoft word clipboard HTML documentation - parsing

I could not find any documentation describing conventions in text/html data in the clipboard resulting from copying part of a word document!
Specifically I want to know what classes like MsoNormal, TableGrid313, MsoTableGrid, MsoHeading9, MsoListParagraph are there! Or does styling information of texts always lay in style attribute of a span element containing the text?

The Word round-tip HTML is undocumented as it's not an official Word file format.
It was created to enable round-tripping Word documents for viewing (and some editing) in a browser, many years ago. Even then, it was not documented as its use was for internal Microsoft software. Being HTML, anyone could read and produce it, but MS made an conscious decision to not document it (and not need to put the resources into maintaining that documentation).

Related

Resources to simplify W3C standard implementation

At my company we’re looking to implement support for W3C standards, such as SVG 1.1 (2nd edition), in our app. I’ve been doing research on how people approach this problem.
There are various resources available at w3.org, but the standard only appears to be available in HTML files. I would like to be able to parse a single document that gives me all of the concepts in the standard, which I can then generate objects from in a programming language of my choice.
Apart from simply parsing the HTML itself, it seems possible to parse the document type definition file, but this doesn’t include type information for attributes such as “color,” whose constraints are described in EBNF in the html document.
What is the best approach to implementing a W3C standard, such as SVG 1.1 (2nd edition)?
Does it simply come down to manually tracking the different parts of the standard, as defined in the HTML?

Edit PDF documents in Delphi

We have a requirement to add the ability to edit PDF documents witin a Delphi application.
I.e. given a PDF document, open it and generate a form with edit boxes on it which the user can use to update the PDF document.
Can anyone suggest a third part component that would provide this functionality or suggest some way of achieving this.
Thanks
I use QuickPDF. Well documented, lots of examples, good support. However updating text in a PDF is an art, not a science, and unless you have full control over the producer of the PDF you may find it hard to do in the general case. For example: I have seen PDFs where text is formed from individual characters, each inserted at a specific location, so hard to edit as words; and of course in some PDFs the 'text' is actually an image of text, requiring OCR before you can edit it.
You can try Gnostice PDFtoolkit.
DISCLAIMER: I work for Gnostice.
Take a look at Amyuni PDF Creator ActiveX, it is supported in 32 bit and 64 bit applications, you may find it useful now that Delphi has a 64 compiler.
Usual disclaimer applies

Extracting ePub Excerpt

I've read about the ePub format, standard, structure, readers, tools and available developer techniques to manipulate/convert/create ePubs but there is no such thing as a magical function (so far) to extract a particular length of characters to create an excerpt of the book. And that's precisely what I'm looking for: A way to extract the first X words of an ePub.
The first approach I'm considering (not my favorite btw) is creating a parser to read all the ePub metadata and start parsing the xml files in the right order until I have enough words to create the excerpt of a determined ePub (I will appreciate some feedback in this direction)
The second way (which I can't find so far) is an existent tool/function or parser (in any language) which returns (hopefully) the plain text of the ePub so I can collect the first X words in order to create my excerpt.
Do you know about any tool which can help me achieve the second option?
You should have a look at Apache Tika: http://tika.apache.org/
You can use it from command line, or as a java library or even in server mode to extract text from ePub.
Hope this will help,
F.
Jose,
I'm not aware of any tool to do what you want. Let me comment on your first approach, though. If you do find a tool I hope these comments allow you to evaluate it.
I think your approach is fine and, if you want to do a good job of creating an extract, you may want to own this step anyway. I would suggest you,
grab the OPF file and look for a GUIDE section. If a GUIDE section exists, check the types that are given. Some are probably not relevant for an excerpt (cover,title-page,copyright-page). Many books will not have the types explicitly stated but this should help where they do.
now go through the files in sequence in the SPINE section, excluding anything that is irrelevant, and read through enough XHTML files to get your excerpt.
while in the OPF file grab a bunch of metadata if this is relevant for the excerpt (title, creator, date are mandatory, I think, and some authors will also put in a whole bunch of other metadata such as keywords).
If you are creating a mini-EPUB with this excerpt you will need to pick up any CSS, Audio, Video, Image and Custom Font files that get referenced in the XHTML files used to make your excerpt. You may even choose to use the original cover file for the cover file of your excerpt epub.
If you working with fixed layout books with fun stuff like Read Aloud AND you want to create a mini-EPUB as an excerpt, you may be better off going with a page count rather than a word count. Don't forget to include any SMIL files into your excerpt and to make it look nice: (i) don't split a two page spread and (ii) make sure that the first page is an odd numbered page if odd in the original or even if even numbered in the original - to do this you may need to add a blank filler page (get the odd/even wrong and subsequent two page spreads won't be facing each other)
I hope that helps.

How do I store and view graphically formatted data?

I have an app (written in D2010) which is similar to a text retrieval app... It has a list of questions, with their corresponding answers. Most answers are strictly text, but some answers have graphics, and formatting. My dilemma has to do with the formatted answer. The user should be able to copy this answer (formatting and graphics) in order to paste it into another app. I have tried using a Word OCX. This is a little problematic. User has to have word, it gives random errors when using inside a virtual machine, etc. I am now playing with using a built in browser component, and viewing the data as a PDF. This is nice and easy, but when I copy and paste it, I loose all formatting, and the graphic shows up as a large totally black box.
I can store the data in whatever format I choose. It is stored as a BLOB in a DB file. I write it to a temp file and then I call some type of viewing routine, so I have flexibility there. My issue is really, what viewer mechanism is simple to implement, and allows copying/pasting, while maintaining text formatting (bullets, indents, etc) and graphics.
Thanks,
GS
The TRichEdit (or any of TRichEdit descendants or similar classes) will allow the users to visualize text formatting and images, and when the content is copied, the RTF representation of the data will be copied into the clipboard.
When the clipboard data is pasted into a RTF compatible text editor (like Wordpad and Word), all the formatting, bullets and images are preserved.

What's a solid, full-featured open rich text representation usable on the Web?

I'm looking for an internal representation format for text, which would support basic formatting (font face, size, weight, indentation, basic tables, also supporting the following features:
Bidirectional input (Hebrew, Arabic, etc.)
Multi-language input (i.e. UTF-8) in same text field
Anchored footnotes (i.e. a superscript number that's a link to that numbered footnote)
I guess TEI or DocBook are rich enough, but here's the snag -- I want these text buffers to be Web-editable, so I need either an edit control that eats TEI or DocBook, or reliable and two-way conversion between one of them and whatever the edit control can eat.
UPDATE: The edit control I'm thinking of is something like TinyMCE, but AFAICT, TinyMCE lacks footnotes, and I'm not sure about its scalability (how about editing 1 or 2 megabytes of text?)
Any pointers much appreciated!
FCKeditor has a great API, supports several programming languages (considering it is javascript this isn't hard to achieve), can be loaded through HTML or instantiated in code; but most of all, allows easy access to the underlying form field, so having a jQuery or prototype ajax buffer shouldn't be terribly difficult to achieve.
The load time is very quick compared to previous versions. I'd give it a whirl.
In my experience a two-way conversion between HTML and XML formats like TEI or DocBook is very hard to make 100% reliable.
You could use Xopus (demo) to have your users directly edit TEI or DocBook XML. Xopus is a commercial browser based XML editor designed specifically for non-technical users. It supports bidi and UTF-8. The WYSIWYG view is rendered using XSLT, so that gives you sufficient control to render footnotes the way you describe.
As TEI and DocBook don't have means to store styling information, those formats will not allow your users to change font face, size and weight. But I think that is a good thing: users should insert headers and emphasis, designers should pick font face and size.
Xopus has a powerful table editor and indentation is handled by nesting sections or lists and XSLT reacting to that.
Unfortunately Xopus 3 will only scale to about 200KB of XML, but we're working on that.
I can't really decide on one of them. IMHO they are all not very good and complete. They all have their advantages and clear disadvantages. If TinyMCE is your favorite then afaik, it also does tables.
This list will probably come in handy: WysiwygEditorComparision.
I've also used FCKEditor and it performed well and was easy to integrate into my project. It's worth checking out.
Small correction to laurens' answer above: As of now (May 2012), Xopus supports UTF8, but not BiDi editing. Right-to-left text is displayed fine if it came from another source, cannot be edited correctly.
Source: I was recently asked to evaluate this, so have been testing it.

Resources