Creating Generic XSLT to parse Text file to XML - xslt-2.0

I had been assigned a task to use XSLT to convert tons of text files into different XML. As I am extremely new to this, I took a shot at it myself.Here is what I've found that would do the trick: 1) XSLT started to support this in v2.0, and 2) unparsed-text() is the way to go
So I have finally got it working to parse a file into XML, but the method I used requires me to have an XSLT file per each text file(1 to 1) due to a lot of needed hardcoding for fucntions like xsl:analyze-string and I am now trying to find a way that can parse all of my files with just 1 generic XSLT. Please note that the text files might contain different patterns, but if I can find a way to generically parse more than 1 file(with similar patterns) then I will be happy.
Here are 2 sample files that I have:
***********
* Sample1
***********
SET: <block>
NAME: Name1 <string> /* some words/words */
!---end--- </block>
SET: <block>
NAME: Name2 <string>
NESTEDSET: <block>
VALUE1: FIRST <string>
---end---
NESTEDSET: <block>
VALUE1: SECOND <string>
VALUE2: ANYVALUE <string>
---end---
!---end--- </block>
and below are 2nd Sample file
**********
* Sample2
**********
NEW_SET: <block>
NAME: Set1 <string>
* Col1 Col2 Col3
ENTRY: 1 Win 0.2 <integer,string,floating>
ENTRY: 2 Win 0.3 <integer,string,floating>
ENTRY: 3 Lost 0.4 <integer,string,floating>
!--- end of block --- </block>
here is the xslt that I created for sample1:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs"
version="2.0">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:variable name="sourcefile"
select="unparsed-text('file:///C:/Users/aUser/Desktop/Sample1.txt')"/>
<xsl:template name="text2xml">
<xsl:analyze-string select="translate(normalize-space($sourcefile), ' ',',')"
regex="SET:.*?!---end---" flags="s">
<xsl:matching-substring>
<SET>
<xsl:analyze-string select="." regex="(NAME):,([^,]*)">
<xsl:matching-substring>
<xsl:element name="{(regex-group(1))}">
<xsl:value-of select="(regex-group(2))"/>
</xsl:element>
</xsl:matching-substring>
</xsl:analyze-string>
<xsl:analyze-string select="."
regex="NESTEDSET.*?,---end---" flags="s">
<xsl:matching-substring>
<NESTEDSET>
<xsl:analyze-string select="."
regex="(VALUE1|VALUE2):,([^,]*)">
<xsl:matching-substring>
<xsl:element name="{(regex-group(1))}">
<xsl:value-of select="(regex-group(2))"/>
</xsl:element>
</xsl:matching-substring>
</xsl:analyze-string>
</NESTEDSET>
</xsl:matching-substring>
</xsl:analyze-string>
</SET>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:template>
<xsl:template match="/">
<xsl:call-template name="text2xml"/>
</xsl:template>
</xsl:stylesheet>
the xml output for this xslt is:
<?xml version="1.0" encoding="UTF-8"?>
<SET>
<NAME>Name1</NAME>
</SET>
<SET>
<NAME>Name2</NAME>
<NESTEDSET>
<VALUE1>FIRST</VALUE1>
</NESTEDSET>
<NESTEDSET>
<VALUE1>SECOND</VALUE1>
<VALUE2>ANYVALUE</VALUE2>
</NESTEDSET>
</SET>
May I ask if there is a better way to do this? to not hardcode anything and can create something generic that other files containing the same pattern can be used?
the way I am doing it(from my understanding) is that I am using unparsed-text() to force a text file into a single line then hardcoding a string(in regex) to tell it where to start/stop looking. so just want to look for a better way.
Thank you all for any suggestions/feedbacks.

Related

Looking for saxon:evaluate() example code

I have a transform.xsl file with will process a input.xml. But there is also an additional config.xml file which will define additional clauses. For e.g. this is the content of the config.xml.
<Location >
<DisplayName>
<Attribute1>ABC</Attribute1>
<Attribute2>XYZ</Attribute2>
<action>concat($Attribute1,$Attribute2)</action>
</DisplayName>
</Location >
So when transform.xsl will encounter the DisplayName variable within the input.xml, then it will form the value with the RESULT of the action expression defined in the config.xml file. transform.xml will call the config.xml just to get the result. (The action can be modified by the end user and hence these are placed outside the xsl file, within the config.xml).
We are using saxon xml processor version 9 and xslt 2.0. So we need to use saxon:evaluate(). I tried to find more examples of saxon:evaluate(), but couldn't find it more. Can anyone show me some examples of how to use it?
Thanks in advance.
***** This is an edited query to highlight the need of saxon:evaluate *****
Here is an example to use an XSLT 3 processor supporting xsl:evaluate (https://www.w3.org/TR/xslt-30/#dynamic-xpath) (i.e. Saxon 9.8 or later with the commercial PE or EE editions or Altova 2017 or later) to process your "config" file:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:map="http://www.w3.org/2005/xpath-functions/map"
xmlns:mf="http://example.com/mf"
exclude-result-prefixes="#all"
version="3.0">
<xsl:param name="config-url" as="xs:string">test2018121301.xml</xsl:param>
<xsl:param name="config-doc" select="doc($config-url)"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:key name="element" match="*" use="node-name()"/>
<xsl:function name="mf:config-evaluation" as="item()*">
<xsl:param name="config-doc" as="document-node()"/>
<xsl:param name="element-name" as="xs:QName"/>
<xsl:variable name="display" select="key('element', $element-name, $config-doc)/DisplayName"/>
<xsl:evaluate xpath="$display/regex" with-params="map:merge($display!(* except regex)!map { QName('', local-name()) : string() })"/>
</xsl:function>
<xsl:template match="*[key('element', node-name(), $config-doc)]">
<xsl:copy>
<xsl:value-of select="mf:config-evaluation($config-doc, node-name()), ."/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
So with a config.xml
<Location >
<DisplayName>
<Attribute1>ABC</Attribute1>
<Attribute2>XYZ</Attribute2>
<regex>concat($Attribute1,$Attribute2)</regex>
</DisplayName>
</Location >
this would transform an input sample with e.g.
<Root>
<Items>
<Item>
<Data>data 1</Data>
<Location>location 1</Location>
</Item>
<Item>
<Data>data 2</Data>
<Location>location 2</Location>
</Item>
</Items>
</Root>
into
<Root>
<Items>
<Item>
<Data>data 1</Data>
<Location>ABCXYZ location 1</Location>
</Item>
<Item>
<Data>data 2</Data>
<Location>ABCXYZ location 2</Location>
</Item>
</Items>
</Root>
That gives you a great flexibility to allow XPath expressions in the configuration files but as pointed out in https://www.w3.org/TR/xslt-30/#evaluate-effect, also is a security problem: "Stylesheet authors need to be aware of the security risks associated with the use of xsl:evaluate. The instruction should not be used to execute code from an untrusted source.".
As for using the saxon:evaluate function supported in older versions of Saxon not supporting the XSLT 3 xsl:evaluate instruction, a simple example is
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:saxon="http://saxon.sf.net/"
exclude-result-prefixes="#all"
version="2.0">
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="example">
<xsl:copy>
<xsl:value-of select="saxon:evaluate(#expression, #foo, #bar)"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
which transforms the input
<root>
<example expression="concat($p1, $p2)" foo="This is " bar="an example."/>
<example expression="replace(., $p1, $p2)" foo="\p{L}" bar="X">This is example 2.</example>
</root>
into the result
<root>
<example>This is an example.</example>
<example>XXXX XX XXXXXXX 2.</example>
</root>
Try checking the xsl-attribute tag along with the xsl-value-of tag. If I get what you're asking for, you could probably read the config.xml using the transform.xsl (or a second xsl for an intermediate file) to set the text inside the regex tag to correspond to the value of an tag attribute within the xsl.
https://www.w3schools.com/xml/ref_xsl_el_attribute.asp
Also, check this tutorial for regex in XSLT 2, it may help:
https://www.xml.com/pub/a/2003/06/04/tr.html

Changing values that are the same from different nodes

I need to localize values within siblings that are the same. If they are the same I need to alter them.
I think I need to use following-sibling and preceding-sibling and group-by in some way. First group-by the value I am looking for so that I get the one's that are the same in the position after each other. Then using the sibling functions to find out if they are equal.
Sample:
<programs>
<event>
<start>2018-11-25T13:55:00</start>
</event>
<event>
<start>2018-11-27T17:00:00</start>
</event>
<event>
<start>2018-11-25T13:55:00</start>
</event>
<event>
<start>2018-11-25T13:55:00</start>
</event>
</programs>
Code:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs"
version="2.0">
<xsl:template match="/">
<output>
<xsl:for-each select="/programs/event">
<xsl:variable name="starttime" select="./start"/>
<startOfProgram><xsl:value-of select="$starttime"/></startOfProgram>
</xsl:for-each>
</output>
</xsl:template>
</xsl:stylesheet>
Desired results:
<output>
<startOfProgram>2018-11-25T13:55:00</startOfProgram>
<startOfProgram>2018-11-25T13:56:00</startOfProgram>
<startOfProgram>2018-11-25T13:57:00</startOfProgram>
<startOfProgram>2018-11-27T17:00:00</startOfProgram>
</output>
I know this is a long shot so if anyone could point me in the right direction or help me with one part of the problem I'd be very grateful.
There is lots of other elements in the sample that I have taken out that is also carried though to the output. If it matters I can include a variety of them.
Ps. Note that the value could easily be 2018-11-25T18:30:00, which would then need to be 2018-11-25T18:30:00 and the consecutive 2018-11-25T18:31:00 if there are more of the same.
The result you have shown looks as if you want to group the values as xs:dateTime values and then simply add one minute to each item in the group depending on the position:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="#all"
expand-text="yes"
version="3.0">
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="programs">
<output>
<xsl:for-each-group select="event/start/xs:dateTime(.)" group-by=".">
<xsl:for-each select="current-group()">
<startOfProgram>{. + (position() - 1) * xs:dayTimeDuration('PT1M')}</startOfProgram>
</xsl:for-each>
</xsl:for-each-group>
</output>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/pPqsHUv/1 and the above is XSLT 3 but for an XSLT 2 processor I think you only need to change the text value template I have used to an xsl:value-of:
<startOfProgram><xsl:value-of select=". + (position() - 1) * xs:dayTimeDuration('PT1M')"/></startOfProgram>
See http://xsltransform.hikmatu.com/6qVRKvJ

Merging and inheriting parameters

Using xslt version 3.0 (saxon):
I have something like the following
<root>
<template ID='1'>
<params>
<a>1</a>
<b>1</b>
</params>
</template>
<document1 templateID='1'>
<params>
<b>4</b>
<c>5</c>
</params>
</document1>
</root>
Basicly I need to convert into something like
<root>
<document1 templateID='1'>
<params>
<a>1</a>
<b>4</b>
<c>5</c>
</params>
</document1>
</root>
In the example parameter a is inherited from the template while parameter b is overwritten by the document itself and parameter c is not known or set in the template. It is akin to inheritance or how css work. I hope you get the idea. Before starting the task I thought this should not be too difficult (and still hoping Im just overlooking something).
I have tried something with concat'ing the two nodeset (using nodeset1 , nodeset2 to preserve the order) and using a preceding-sibling name based 'select'/'filtering' - but this strategy seems not to work as it seems they are not actual siblings. Could this be done with a clever group-by ? Can it be done at all ? (I think it can)
I am using xslt version 3.0 (saxon)
I think you want to group or merge, merging in XSLT 3 would be
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="#all"
version="3.0">
<xsl:output indent="yes"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:key name="template-by-id" match="template" use="#ID"/>
<xsl:template match="template"/>
<xsl:template match="*[#templateID]/params">
<xsl:copy>
<xsl:merge>
<xsl:merge-source name="template" select="key('template-by-id', ../#templateID)/params/*">
<xsl:merge-key select="string(node-name())"/>
</xsl:merge-source>
<xsl:merge-source name="doc" select="*">
<xsl:merge-key select="string(node-name())"/>
</xsl:merge-source>
<xsl:merge-action>
<xsl:copy-of select="(current-merge-group('doc'), current-merge-group('template'))[1]"/>
</xsl:merge-action>
</xsl:merge>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/jyH9rN8/
grouping would be
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="#all"
version="3.0">
<xsl:output indent="yes"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:key name="template-by-id" match="template" use="#ID"/>
<xsl:template match="template"/>
<xsl:template match="*[#templateID]/params">
<xsl:copy>
<xsl:for-each-group select="key('template-by-id', ../#templateID)/params/*, *" group-by="node-name()">
<xsl:copy-of select="head((current-group()[2], .))"/>
</xsl:for-each-group>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/jyH9rN8/1
I think, as xsl:merge requires input to be sorted on any merge key or to sort the input first, the grouping above is easier and more reliable, unless your params child elements are really named with sorted letters or words from the alphabet.

Exponent or power calculation in saxon9HE

Please suggest how to do math functions like power in XSLT2 with Saxon 9HE.
Getting following error:
Cannot find a matching 2-argument function named {http://exslt.org/math}power()
XML:
<root><num>12.3</num></root>
XSLT 2.0:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:exsl="http://exslt.org/common"
xmlns:math="http://exslt.org/math"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
extension-element-prefixes="exsl math">
<xsl:output method="text" encoding="UTF-8"/>
<xsl:strip-space elements="*"/>
<!--1/(1+e^(-t))--><!-- this is required formula -->
<xsl:template match="num">
<xsl:variable name="varE"><xsl:value-of select="."/></xsl:variable>
<xsl:variable name="varT"><xsl:text>0.718</xsl:text></xsl:variable>
<xsl:variable name="varPower">
<xsl:value-of select="1 div (1 + math:power(number($varE), number(-$varT)))"/>
</xsl:variable>
<xsl:value-of select="$varPower"/>
</xsl:template>
</xsl:stylesheet>
There are XPath standardized math functions like math:pow e.g. math:pow(2, 4) in the namespace https://www.w3.org/2005/xpath-functions/math e.g with the namespace declaration xmlns:math="http://www.w3.org/2005/xpath-functions/math" available in all editions of Saxon (at least with 9.8 but I think it also works with earlier version like 9.7 and 9.6 (documentation http://saxonica.com/html/documentation9.6/functions/math/pow.html says since 9.6 in all editions).

Identify values that dont match in all Nodes and Attributes: XSLT2.0

I need to go over all the xml attributes and text nodes to identify existence of character from list and output the values the characters values that didnt match.
I am able to check the text() nodes but I am not able to perform a check on attributes.
<xsl:template match="#*|node()">
<xsl:variable name="getDelimitersToUseNodes" select="('$' ,'#' ,'*' ,'~')[not(contains(current(),.))]"/>
<xsl:variable name="getDelimitersToUseAttr" select="string-join(('$','#','*','~')[not(contains(#*/,.))],',')"/>
<xsl:variable name="getDelimitersToUse" select="concat(string-join($getDelimitersToUseNodes,','),',',string-join($getDelimitersToUseAttr,','))"/>
<!--xsl:variable name="delim" select="distinct-values($getDelimitersToUse,',')"/-->
<xsl:value-of select="$getDelimitersToUse"/>
</xsl:template>
My mocked up sample file is below
<?xml version="1.0"?>
<sample>
<test1 name="#theGoofy">My$#test</test1>
<test2 value="$##">description test2*</test2>
</sample>
You could process all those text and attribute nodes and make that same check as before. You haven't really said which output format you want, assuming text you could use
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xsl:param name="characters" as="xs:string*" select="'$' ,'#' ,'*' ,'~'"/>
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:apply-templates select="//text() | //#*"/>
</xsl:template>
<xsl:template match="text() | #*">
<xsl:value-of select="'Text', ., 'does not contain', $characters[not(contains(current(), .))], '
'"/>
</xsl:template>
</xsl:stylesheet>
to get a result like
Text #theGoofy does not contain $ * ~
Text My$#test does not contain * ~
Text $## does not contain * ~
Text description test2* does not contain $ # ~
If you simply want to check all characters not contained in all text nodes and attribute nodes then an approach like
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xsl:param name="characters" as="xs:string*" select="'$' ,'#' ,'*' ,'~'"/>
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>
<xsl:variable name="nodes-to-inspect" as="node()*" select="//text() | //#*"/>
<xsl:template match="/">
<xsl:value-of select="for $c in $characters return $c[not($nodes-to-inspect[contains(., $c)])]"/>
</xsl:template>
</xsl:stylesheet>
should do.

Resources