JAXB: How to keep consecutive spaces as they are in source XML during unmarshalling - space

I am using JAXB to unmarshall an XML message. It seems to replace multiple consecutive spaces by a single space.
<testfield>this is a test<\testfield>
(several spaces between a and test)
upon unmarshalling, the above becomes:
this is test
How do I keep consecutive spaces as they are in source XML?

From the msdn page:
Document authors can use the xml:space attribute to identify portions
of documents where white space is considered important. Style sheets
can also use the xml:space attribute as a hook to preserve white space
in presentation. However, because many XML applications do not
understand the xml:space attribute, its use is considered advisory.
You can try adding xml:space="preserve" so it doesn't replace the spaces
<poem xml:space="default">
<author xml:space="default">
<givenName xml:space="default">Alix</givenName>
<familyName xml:space="default">Krakowski</familyName>
</author>
<verse xml:space="preserve">
<line xml:space="preserve">Roses are red,</line>
<line xml:space="preserve">Violets are blue.</line>
<signature xml:space="default">-Alix</signature>
</verse>
</poem>
http://msdn.microsoft.com/en-us/library/ms256097%28v=vs.110%29.aspx

Related

Parse PDF file and output single character locations

I'm trying to extract text information from a (digital) PDF by identifying content and location of each character and each word. For words, pdftotext --bbox from xpdf / poppler works quite well, but I cannot find an easy way to extract character location.
What I've tried
The solution I currently have is to convert the pdf to svg (via pdf2svg), and then parse the resulting svg to extract single character (= glyph) locations. In a third step, the resulting boxes are compared, each character is assigned to a word and hopefully the numbers match.
Problems
While the above works for most "basic" fonts, there are two (main) situations where this approach fails:
In script fonts (or some extreme italic fonts), bounding boxes are way larger than their content; as a result, words overlap significantly, and it can well happen that a character is entirely contained in two words. In this case, the mapping fails, because once I translate to svg I have no information on what character is contained in which glyph.
In many fonts multiple characters can be ligated, giving rise to a single glyph. In this case, the count of character boxes does not match the number of characters in the word, and matching each letter to a box is again problematic.
The second point (which is the main one for me) has a partial workaround by identifying the common ligatures and (if the counts don't match) splitting the corresponding bounding boxes into multiple pieces; but that cannot always work, because for example "ffi" is sometimes ligated to a single glyph, sometimes in two glyphs "ff" + "i", and sometimes in two glyphs "f" + "fi", depending on the font.
What I would hope
It is my understanding that pdf actually contain glyph information, and not words. If so, all the programs that extract text from pdf (like pdftotext) must first extract and locate the various characters, and then maybe group them into words/lines; so I am a bit surprised that I could not find options to output location for each single character. Converting to svg essentially gives me that, but in that conversion all information about the content (i.e. the mapping glyph-to-character, or glyph-to-characters, if there was a ligature) is lost, because there is no font anymore. And redoing the effort of matching each glyph to a character by looking at the font again feels like rewriting a pdf parser...
I would therefore be very grateful for any idea of how to solve this. The top answer here suggests that this might be doable with TET, but it's a paying option, and replacing my whole infrastructure to handle just one limit case seems a big overkill...
A PDF file doesn't necessarily specify the position of each character explicitly. Typically, it breaks a text into runs of characters (all using the same font, anything up to a line, I think) and then for each run, specifies the position of the bounding box that should contain the glyphs for those characters. So the exact position of each glyph will depend on metrics (mostly glyph-widths) of the font used to render it.
The Python package pdfminer has a script pdf2txt.py. Try invoking it with -t xml. The docs just say XML format. Provides the most information. But my notes indicate that it will apply the font-metrics and give you a <text> element for every single glyph, with font and bounding-box info.
There are various versions in various places (e.g. PyPI and github). If you need Python 3 support, look for pdfminer.six.

JAXB unmarshalling empty tags [duplicate]

I changed the datatype of some parameters within my xsd file from string to their real type, in this case double. now I'm facing the problem that around here the comma is used as a separator and not the point as defined by the w3 (http://www.w3.org/TR/xmlschema-2/#double) causing erros all over during deserialization (C#, VS2008).
my question: Can I use the w3 schema for validation but with a different separator for double values?
thx for your help
You cannot do that if you want to continue to use the XML Schema simple types. decimal and the types derived from it are locked down to using a period. As you say: the spec here.
If you want to use a comma as a separator, you need to define your own simple type, for example:
<xs:simpleType name="MyDecimal">
<xs:restriction base="xs:string">
<xs:pattern value="\d+(,\d+)?"/>
</xs:restriction>
</xs:simpleType>
Before you do that, though, be careful; XML is a data storage format, not a presentation format. You might want to think about whether you sort this out after loading or during XSLT transformation, etc.

Can xsd be ambiguous?

I always considered XSD as a way to specify an XML-file's grammar. Now I stumbled upon something like this in a real-world XSD spec:
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xlink="http://www.w3.org/1999/xlink"
elementFormDefault="qualified"
attributeFormDefault="unqualified">
<xs:element name="parser-killer">
<xs:complexType>
<xs:sequence minOccurs="0" maxOccurs="unbounded">
<xs:element name="element" type="xs:string" minOccurs="0"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
The problem here is, that one cannot decide for an empty input, e.g.
<parser-killer>
</parser-killer>
whether the XML contains an empty sequence, or a sequence of empty contents.
This may not be a problem for the human eye, but if one tries to generate a parser from this file, it may end up looping forever (collecting infinitely many empty elements).
Is that simply abuse of XSD or is it required to "sanitize" any given XSD before code generation?
XML Schema validators are in no way required to "generate a parser". It's allowed of course, but it's an implementation details. In fact many don't, they build up an internal representation of the schema and interpret it to validate the file against the schema.
To find out if you example input document with the empty <parser-killer> element is valid or not, a validator does not need to know whether this element should be interpreted as:
zero or any number sequences of zero children named <element>
an infinite number of sequences of zero children named <element>
zero sequences of a hundred children named <element>
It's not relevant - it just needs to know whether the content of <parser-killer> has at least one path in the model group.
There's an easy rule-of-thumb that you could apply in a validator: if there are no element children and there is a sequence (or choice, or all) with minOccurs="0", then this is assumed to be the reason why the content of the parent is valid. That will stop the validator from looping infinitely in this situation.
Unique Particle Attribution
To be clear, the reason that there is no problem is that there is no ambiguity how the schema validated an element - because there is no element. So it's okay to have an ambiguity like this between a model group and a particle declaration - it's not okay to have an ambiguity between particle declarations; it needs to be clear by which unique path through the model group it was that the validator arrived at the particle.
But that is not the problem in the situation that you describe.

Parsing documents including angle brackets using TouchXML on iOS

I'm trying to parse an XML document using TouchXML for iOS. Normally this works great, but the current document I'm trying to parse contains angle brackets within the actual data. For example:
<reference>
<title>Title < 5</title>
</reference>
This fails due to an "invalid startTag" error. Is there anything I can do in TouchXML to get around this, or do I need to fix this in the source material?
Not an ideal solution, but I ended up basically pre-processing the XML document before passing it to TouchXML. I used regular expressions to search for multiple angle brackets in a row (eg. <<, or <...<, or <...<...<), and replaced the additional ones with < or >. I then replaced these symbols with the original angle brackets when parsing the individual nodes for data.
There may be some way to tell TouchXML to ignore errors, but I couldn't find it.

Sanitize pasted text from MS-Word

Here's my wild and whacky psuedo-code. Anyone know how to make this real?
Background:
This dynamic content comes from a ckeditor. And a lot of folks paste Microsoft Word content in it. No worries, if I just call the attribute untouched it loads pretty. But the catch is that I want it to be just 125 characters abbreviated. When I add truncation to it, then all of the Microsoft Word scripts start popping up. Then I added simple_format, and sanitize, and truncate, and even made my controller start spotting out specific variables that MS would make and gsub them out. But there's too many of them, and it seems like an awfully messy way to accomplish this. Thus so! Realizing that by itself, its clean. I thought, why not just slice it. However, the microsoft word text becomes blank but still holds its numbered position in the string. So I came up with this (probably awful) solution below.
It's in three steps.
When the text parses, it doesn't display any of the MSWord junk. But that text still holds a number position in a slice statement. So I want to use a regexp to find the first actual character.
Take that character and find out what its numbered position is in the total string.
Use a slice statement to cut it from.
def about_us_truncated
x = self.about_us.find.first(regExp representing first actual character)
x.charCount = y
self.about_us[y..125]
end
The only other idea i got, is a regex statement that allows it to explicitly slice only actual characters like so :
about_us([a-zA-Z][0..125]) , but that is definately not how it is written.
Here is some sample text of MS Word junk :
&Lt;! [If Gte Mso 9]>&Lt;Xml>&Lt;Br /> &Lt;O:Office Document Settings>&Lt;Br /> &Lt;O:Allow Png/>&Lt;Br /> &Lt;/O:Off...
You haven't provided much information to go off of, but don't be too leery of trying to build this regex on your own before you seek help...
Take your sample text and paste it in Rubular in the test string area and start building your regex. It has a great quick reference at the bottom.
Stumbled across this
http://gist.github.com/139987
it looks like it requires the sanitize gem.
This is technically not a straight answer, but it seems like the best possible one you can find.
In order to prevent MS Word, you should be using CK Editor's built-in MS word sanitizer. This is because writing regex for it can be very complicated and you can very easily break tags in half and destroy your site with it.
What I did as a workaround, is I did a force paste as plain text in the CK Editor.

Resources