I need suggestions to merge multiple elements and sibling text nodes as a single element. Refer xref element in the below mentioned sample.
Input: <section><p>These pages are all about XSLT, an XML-based language <xref ref-type="bibr" rid="r1">1</xref><xref ref-type="bibr" rid="r2"/>--<xref ref-type="bibr" rid="r3">3</xref> for translating one set of XML into another set of XML, <xref ref-type="bibr" rid="r3">3</xref>, <xref ref-type="bibr" rid="r5">5</xref><xref ref-type="bibr" rid="r6"/>--<xref ref-type="bibr" rid="r7">7</xref> or into HTML. Of course, there are all sorts of other pages <xref ref-type="bibr" rid="r1">7</xref>, <xref ref-type="bibr" rid="r3">8</xref> around that cover XSLT. <xref ref-type="bibr" rid="r12">12</xref>, <xref ref-type="bibr" rid="r15">15</xref><xref ref-type="bibr" rid="r16"/><xref ref-type="bibr" rid="r17"/><xref ref-type="bibr" rid="r18"/><xref ref-type="bibr" rid="r19"/>--<xref ref-type="bibr" rid="r20">20</xref></p></section>
Output: <section><p>These pages are all about XSLT, an XML-based language <xref ref-type="bibr" rid="r1 r2 r3">1--3</xref> for translating one set of XML into another set of XML, <xref ref-type="bibr" rid="r3 r5 r6 r7">3, 5--7</xref> or into HTML. Of course, there are all sorts of other pages <xref ref-type="bibr" rid="r7 r8">7, 8</xref> around that cover XSLT. <xref ref-type="bibr" rid="r12 r15 r16 r17 r18 r19 r20">12, 15--20</xref></p></section>
The merge should happen only if characters (, )comma with space or ( ) space or (--) two hyphens or empty xref element (<xref ref-type="bibr" rid="r2"/>) appear inbetween xref elements.
E.g.
Input content: <xref ref-type="bibr" rid="r3">3</xref>, <xref ref-type="bibr" rid="r5">5</xref><xref ref-type="bibr" rid="r6"/>--<xref ref-type="bibr" rid="r7">7</xref>
Expected ouput: <xref ref-type="bibr" rid="r1 r2 r3">1--3</xref>
Thanks and Regards
Bala
Using XSLT 2.0, you can find adjacent nodes using for-each-group select="node()" group-adjacent="boolean(self::xref | self::text()[matches(., $pattern)] so you can use an approach like
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0">
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="*[xref]">
<xsl:copy>
<xsl:apply-templates select="#*"/>
<xsl:for-each-group select="node()"
group-adjacent="boolean(self::xref | self::text()[matches(., '^[\s\p{P}]+$')])">
<xsl:choose>
<xsl:when test="current-grouping-key()">
<xsl:copy>
<xsl:copy-of select="#* except #rid"/>
<xsl:attribute name="rid" select="current-group()/#rid"/>
<xsl:value-of select="current-group()"/>
</xsl:copy>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates select="current-group()"/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Related
Let's say I have the following EBNF defined for a simpler two-term adder:
<expression> ::= <number> <plus> <number>
<number> ::= [0-9]+
<plus> ::= "+"
Shown here.
What would be the proper way to allow any amount of whitespace except a newline/return between the terms? For example to allow:
1 + 2
1 <tab> + 2
1 + 2
etc.
For example, doing something like the following fails:
<whitespace>::= " " | \t
Furthermore, it seems (almost) every term would be preceded and followed by an optional space. Something like:
<plus> ::= <whitespace>? "+" <whitespace>?
How would that be properly addressed?
The XML standard, as an example, uses the following production for whitespace:
S ::= (#x20 | #x9 | #xD | #xA)+
You could omit CR (#xD) and LF (#xA) if you don't want those.
Regarding your observation that grammars could become overwhelmed by whitespace non-terminals, note that whitespace handling can be done in lexical analysis rather than in parsing. See EBNF Grammar for list of words separated by a space.
I used this tool to generate the SLR(1) parsing table for this LL(1)/LR(1) grammar (which generates a small subset of XML):
document ::= element EOF
element ::= < elementPrefix
elementPrefix ::= NAME attribute elementSuffix
attribute ::= NAME = STRING attribute
attribute ::= EPSILON
elementSuffix ::= > elementOrData endTag
elementSuffix ::= />
elementOrData ::= < elementPrefix elementOrData
elementOrData ::= DATA elementOrData
elementOrData ::= EPSILON
endTag ::= </ NAME >
The tool correctly generates the table and associated automaton, which suggests that the grammar is SLR(1). Is that really the case? I understand that every LR(0) grammar is also SLR(1), but I was not sure how that relates to LL(1)/LR(1) grammars.
LL(1) and SLR(1) are both subsets of LR(1). They don't have a simple relationship to each other.
Considering the following grammar for propositional logic:
<A> ::= <B> <-> <A> | <B>
<B> ::= <C> -> <B> | <C>
<C> ::= <D> \/ <C> | <D>
<D> ::= <E> /\ <D> | <E>
<E> ::= <F> | -<F>
<F> ::= <G> | <H>
<G> ::= (<A>)
<H> ::= p | q | r | ... | z
Precedence for conectives is: -, /\, /, ->, <->.
Associativity is also considered, for example p\/q\/r should be the same as p\/(q\/r). The same for the other conectives.
I pretend to make a predictive top-down parser in java. I dont see here ambiguity or direct left recursion, but not sure if thats all i need to consider this a LL(1) grammar. Maybe undirect left recursion?
If this is not a LL(1) grammar, what would be the steps required to transform it for my intentions?
It's not LL(1). Here's why:
The first rule of an LL(1) grammar is:
A grammar G is LL(1) if and only if whenever A --> C | D are two distinct productions of G, the following conditions hold:
For no terminal a , do both C and D derive strings beginning with a.
This rule is, so that there are no conflicts while parsing this code. When the parser encounters a (, it won't know which production to use.
Your grammar violates this first rule. All your non-terminals on the right hand of the same production , that is, all your Cs and Ds, eventually reduce to G and H, so all of them derive at least one string beginning with (.
This XPath expression:
for $n in 1 to 5 return $n
Returns
1 2 3 4 5
Is it possible to do something similar with alphabetic characters?
Yep:
for $n in 65 to 70 return fn:codepoints-to-string($n)
returns:
A
B
C
D
E
In ascii/iso-8859-1 at least.
for $n in fn:string-to-codepoints('A') to fn:string-to-codepoints('E')
return fn:codepoints-to-string($n)
should work in any locale.
Or, in XPath 3.0 (XSLT 3.0):
((32 to 127) ! codepoints-to-string(.))[matches(., '[A-Z]')]
Here we don't know whether or not the wanted characters have adjacent character codes (and in many real cases they wouldn't).
A complete XSLT 3.0 transformation using this XPath 3.0 expression:
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:sequence select=
"((32 to 127) ! codepoints-to-string(.))[matches(., '[A-Z]')]
"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied (I am using Saxon-EE 9.4.0.6J) on any XML document (not used), the wanted, correct result is produced:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
In case we know the wanted result characters have all-adjacent character codes, then:
(string-to-codepoints('A') to string-to-codepoints('Z')) ! codepoints-to-string(.)
Explanation:
Use of the new XPath 3.0 simple map operator !.
I have a large xml file like below
:
:
<CN>222</CN>
<CT>Raam</CT>
:
:
I would like to merge these two elements as
<CN>222 Raam</CN>
then like to convert it as
<div>222 Raam</div>
which is the final output.
Well if all you need is merging the two consecutive elements in a div (I don't understand what the intermediary CN is for) then use
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="CN[following-sibling::*[1][self::CT]]">
<div>
<xsl:value-of select="concat(., ' ', following-sibling::*[1][self::CT])"/>
</div>
</xsl:template>
<xsl:template match="CT[preceding-sibling::*[1][self::CN]]"/>