Use xmlstarlet to convert XML containing repeated and missing fields to tab delimited - xml-parsing

I have a large complex XML file containing a pattern like the following
<?xml version="1.0" encoding="UTF-8"?>
<records>
<record>
<field1>field1</field1>
<field2>field2</field2>
<field2>field2</field2>
<field3>field3</field3>
<field4>field4</field4>
<field4>field4</field4>
</record>
<record>
<field1>field1</field1>
<field1>field1</field1>
<field3>field3</field3>
<field4>field4</field4>
<field4>field4</field4>
</record>
</records>
I would like to use xmlstarlet to convert it to tab delimited with repeated fields subdelimited with a semicolon, e.g.
field1\tfield2;field2\tfield3\tfield4;field4
field1;field1\t\tfield3\t\field4;field4
I can do what I need by collapsing repeated fields with a string processing routine before feeding the file to xmlstarlet but that feels hacky. Is there an elegant way to do it all in xmlstarlet?

It's been a while since you asked. Nevertheless...
Using xmlstarlet version 1.6.1 to extract and sort field names
to determine field order, and then produce tab-separated values:
xmlstarlet sel \
-N set="http://exslt.org/sets" -N str="http://exslt.org/strings" \
-T -t \
--var fldsep="'$(printf "\t")'" \
--var subsep="';'" \
--var allfld \
-m '*/record/*' \
-v 'name()' --nl \
-b \
-b \
--var uniqfld='set:distinct(str:split(normalize-space($allfld)))' \
-m '*/record' \
--var rec='.' \
-m '$uniqfld' \
--sort 'A:T:-' '.' \
-v 'substring($fldsep,1,position()!=1)' \
-m '$rec/*[local-name() = current()]' \
-v 'substring($subsep,1,position()!=1)' \
-v '.' \
-b \
-b \
--nl < file.xml
EDIT: --sort moved from $allfld to -m $uniqfld.
where:
all field names in input are collected in the $allfld variable
the exslt functions set:distinct
and str:split
are used to create a nodeset of unique field names from $allfld
the $uniqfld nodeset determines field output order
repeated fields are output in document order here but -m '$rec/*[…]'
accepts a --sort clause
the substring expression emits a separator when position() != 1
Given the following input, which is different from OP's,
<?xml version="1.0" encoding="UTF-8"?>
<records>
<record>
<field2>r1-f2A</field2>
<field2>r1-f2B</field2>
<field3>r1-f3</field3>
<field4>r1-f4A</field4>
<field4>r1-f4B</field4>
<field6>r1-f6</field6>
</record>
<record>
<field1>r2-f1A</field1>
<field1>r2-f1B</field1>
<field3/>
<field4>r2-f4A</field4>
<field4>r2-f4B</field4>
<field5>r2-f5</field5>
<field5>r2-f5</field5>
</record>
<record>
<field6>r3-f6</field6>
<field4>r3-f4</field4>
<field2>r3-f2B</field2>
<field2>r3-f2A</field2>
</record>
</records>
output becomes:
r1-f2A;r1-f2B r1-f3 r1-f4A;r1-f4B r1-f6
r2-f1A;r2-f1B r2-f4A;r2-f4B r2-f5;r2-f5
r3-f2B;r3-f2A r3-f4 r3-f6
or, piped through sed -n l to show non-printables,
\tr1-f2A;r1-f2B\tr1-f3\tr1-f4A;r1-f4B\t\tr1-f6$
r2-f1A;r2-f1B\t\t\tr2-f4A;r2-f4B\tr2-f5;r2-f5\t$
\tr3-f2B;r3-f2A\t\tr3-f4\t\tr3-f6$
Using a custom field output order things can be reduced to:
xmlstarlet sel -T -t \
--var fldsep="'$(printf "\t")'" \
--var subsep="';'" \
-m '*/record' \
--var rec='.' \
-m 'str:split("field6 field4 field2 field1 field3 field5")' \
-v 'substring($fldsep,1,position()!=1)' \
-m '$rec/*[local-name() = current()]' \
-v 'substring($subsep,1,position()!=1)' \
-v '.' \
-b \
-b \
--nl < file.xml
again emitting repeated fields in document order in absence of --sort.
Note that using an EXSLT function in a --var clause requires the
corresponding namespace to be declared explicitly with -N to avoid
a runtime error (why?).
To list the generated XSLT 1.0 / EXSLT code add -C before the -t option.
To make one-liners of the formatted xmlstarlet commands above -
stripping line continuation chars and leading whitespace - pipe
them through this sed command:
sed -e ':1' -e 's/^[[:blank:]]*//;/\\$/!b;$b;N;s/\\\n[[:blank:]]*//;b1'
To list elements in the input file xmlstarlet's el command can be
useful, -u for unique:
xmlstarlet el -u file.xml
Output:
records
records/record
records/record/field1
records/record/field2
records/record/field3
records/record/field4
records/record/field5
records/record/field6

You can use the following xmlstarlet command:
xmlstarlet sel -t -m "/records/record" -m "*[starts-with(local-name(),field)]" -i "position()=1" -v "." --else -i "local-name() = local-name(preceding-sibling::*[1])" -v "concat(';',.)" --else -v "concat('\t',.)" -b -b -b -n input.xml
In pseudo-code, it represents something like this
For-each /records/record
For-each element-name that starts with field
If it's the first element, output the item
Else
If check if the current element name equals the preceding one
Then output ;Item
Else output \tItem
b does mean "break" out of loop or if clause
n outputs a newline
Its output is
field1\tfield2;field2\tfield3\tfield4;field4
field1;field1\tfield3\tfield4;field4
The above code differentiates on the base of element names. If you want to differentiate based on element values instead, use the following command:
xmlstarlet sel -t -m "/records/record" -m "*[starts-with(local-name(),field)]" -i "position()=1" -v "." --else -i ". = preceding-sibling::*[1]" -v "concat(';',.)" --else -v "concat('\t',.)" -b -b -b -n a.xml
With your example XML the output is the same (because field names and field values are identical).

Related

Regular expression to fetch the Jenkins node workDir

I am trying to fetch the secret and workdir of a jenkins node in a script.
curl -L -s -u user:password -X GET https://cje.example.com/computer/jnlpAgentTest/slave-agent.jnlp
This gives
<jnlp codebase="https://cje.example.com/computer/jnlpAgentTest/" spec="1.0+">
<information>
<title>Agent for jnlpAgentTest</title>
<vendor>Jenkins project</vendor>
<homepage href="https://jenkins-ci.org/"/>
</information>
<security>
<all-permissions/>
</security>
<resources>
<j2se version="1.8+"/>
<jar href="https://cje.example.com/jnlpJars/remoting.jar"/>
</resources>
<application-desc main-class="hudson.remoting.jnlp.Main">
<argument>b8c80148ce36de10c9358384fac9e28fbba941055a9a6ab2277e75ddc29a8744</argument>
<argument>jnlpAgentTest</argument>
<argument>-workDir</argument>
<argument>/tmp/jnlpAgenttest</argument>
<argument>-internalDir</argument>
<argument>remoting</argument>
<argument>-url</argument>
<argument>https://cje.example.com/</argument>
</application-desc>
</jnlp>
Here I want to fetch the secret and workDir, secret can be fetched using this command
curl -L -s -u admin:password -H "Jenkins-Crumb:dac7ce5614f8cb32a6ce75153aaf2398" -X GET https://owood.ci.cloudbees.com/computer/jnlpAgentTest/slave-agent.jnlp | sed "s/.*<application-desc main-class=\"hudson.remoting.jnlp.Main\"><argument>\([a-z0-9]*\).*/\1/"
b8c80148ce36de10c9358384fac9e28fbba941055a9a6ab2277e75ddc29a8744
But I couldn't find a way to fetch workDir value, here it is /tmp/jnlpAgenttest which exists immediately after the tag -workDir
You can use
sed -n '/-workDir/{n;s/.*>\([^<>]*\)<.*/\1/p}'
sed -n -e '/-workDir/{n' -e 's/.*>\([^<>]*\)<.*/\1/p' -e '}'
See the online demo.
Details:
/-workDir/ - find the line with -workDir tag
n - reads the next line into pattern space discarding the previous one read
s/.*>\([^<>]*\)<.*/\1/ - matches any text up to >, then captures into Group 1 any zero or more chars other than < and >, then matches < and the rest of the string and replaces with Group 1 value
p - print the result of substitution.

Exporting encrypted SNMPv3 traps to JSON with TShark

I have a pcap file with recordings of encrypted SNMPv3 traps from Wireshark (Version 3.2.2). For analyzing the traps, I want to export the protocol data to json using tshark.
tshark.exe -T ek -Y "snmp" -P -V -x -r input.pcap > output.json
Currently, I supply the infos to decrypt the packages via the "snmp_users" file in C:\Users\developer\AppData\Roaming\Wireshark.
# This file is automatically generated, DO NOT MODIFY.
,"snmp_user","SHA1","xxxxxx","AES","yyyyyyy"
Is it possible to supply the options via commandline?
I have tried:
tshark.exe -T ek -Y "snmp" -P -V -x -o "snmp.users_table.username:snmp_user" ...
But that causes an error:
tshark: -o flag "snmp.users_table.username:snmp_user" specifies unknown preference
Update 16.09.2020:
Option -Y used instead of -J:
-Y|--display-filter
Cause the specified filter (which uses the syntax of read/display
filters, rather than that of capture filters) to be applied before
printing a decoded form of packets or writing packets to a file.
You need to specify the option as a User Access Table or uat, with the specific table being the name of the file, namely snmp_users. So, for example:
On Windows:
tshark.exe -o "uat:snmp_users:\"\",\"snmp_user\",\"SHA1\",\"xxxxxx\",\"AES\",\"yyyyyyy\"" -T ek -J "snmp" -P -V -x -r input.pcap > output.json
And on *nix:
tshark -o 'uat:snmp_users:"","snmp_user","SHA1","xxxxxx","AES","yyyyyyy"' -T ek -J "snmp" -P -V -x -r input.pcap > output.json
Unfortunately, the Wireshark documentation is currently lacking in describing the uat option. There is a Google Summer of Code project underway though, of which Wireshark is participating, so perhaps documentation will be improved here.

Lex compilation error of IOS example of opencascade library in Xcode

I'm trying to build an IOS example of OPENCascade library on MacOs. Xcode version were used: 10.2, 10, 3, 11.1. RIght now I'm getting the following types of errors:
../occt_lib/src/BRepFeat/BRepFeat_MakeCylindricalHole.lxx:60: bad character: =
../occt_lib/src/BRepFeat/BRepFeat_MakeCylindricalHole.lxx:60: bad character: =
../occt_lib/src/BRepFeat/BRepFeat_MakeCylindricalHole.lxx:60: bad character: =
../occt_lib/src/BRepFeat/BRepFeat_MakeCylindricalHole.lxx:60: bad character: =
../occt_lib/src/BRepFeat/BRepFeat_MakeCylindricalHole.lxx:60: bad character: =
../occt_lib/src/BRepFeat/BRepFeat_MakeCylindricalHole.lxx:60: bad character: =
../occt_lib/src/BRepFeat/BRepFeat_MakeCylindricalHole.lxx:60: bad character: =
../occt_lib/src/BRepFeat/BRepFeat_MakeCylindricalHole.lxx:60: bad character: =
../occt_lib/src/BRepFeat/BRepFeat_MakeCylindricalHole.lxx:62: name defined twice
../occt_lib/src/BRepFeat/BRepFeat_MakeCylindricalHole.lxx:63: bad character: {
../occt_lib/src/BRepFeat/BRepFeat_MakeCylindricalHole.lxx:65: bad character: }
../occt_lib/src/BRepFeat/BRepFeat_MakeCylindricalHole.lxx:66: premature EOF
flex: error deleting output file ../project.build/DerivedSources/BRepFeat_MakeCylindricalHole.yy.cxx
Command ../XcodeDefault.xctoolchain/usr/bin/lex failed with exit code 1
Possible reasons in my opinion:
1) I don't have all of the files in the project (I've checked it, so it shouldn't be a reason)
2) Xcode doesn't treat .lxx files in a proper way.
Within OCCT file name conversions, .lxx is an extension for inline C++ header files, included by co-named .hxx header files. BRepFeat package has no any .yacc/.lex files, thus BRepFeat_MakeCylindricalHole.yy.cxx should not exist at all.
It looks like the issue is somewhere within building routine (CMake or Tcl script) generating XCode project / Makefile. It is unclear from question if an issue happens on building OCCT itself (and which steps have been taken) or while building iOS sample (is it the one coming with OCCT or written from scratch?).
CMake build for OCCT can be configured via the following cross-compilation toolchain and pseudo bash script:
https://github.com/leetal/ios-cmake
export IPHONEOS_DEPLOYMENT_TARGET=8.0
aFreeType=$HOME/freetype-2.7.1-ios
cmake -G "Unix Makefiles" \
-D CMAKE_TOOLCHAIN_FILE:FILEPATH="$HOME/ios-cmake.git/ios.toolchain.cmake" \
-D PLATFORM:STRING="OS64" \
-D ARCHS:STRING="arm64" \
-D IOS_DEPLOYMENT_TARGET:STRING="$IPHONEOS_DEPLOYMENT_TARGET" \
-D ENABLE_VISIBILITY:BOOL="TRUE" \
-D CMAKE_C_USE_RESPONSE_FILE_FOR_OBJECTS:BOOL="OFF" \
-D CMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS:BOOL="OFF" \
-D CMAKE_BUILD_TYPE:STRING="Release" \
-D BUILD_LIBRARY_TYPE:STRING="Static" \
-D INSTALL_DIR:PATH="work/occt-ios-install" \
-D INSTALL_DIR_INCLUDE:STRING="inc" \
-D INSTALL_DIR_LIB:STRING="lib" \
-D INSTALL_DIR_RESOURCE:STRING="src" \
-D INSTALL_NAME_DIR:STRING="#executable_path/../Frameworks" \
-D 3RDPARTY_FREETYPE_DIR:PATH="$aFreeType" \
-D 3RDPARTY_FREETYPE_INCLUDE_DIR_freetype2:FILEPATH="$aFreeType/include" \
-D 3RDPARTY_FREETYPE_INCLUDE_DIR_ft2build:FILEPATH="$aFreeType/include" \
-D 3RDPARTY_FREETYPE_LIBRARY_DIR:PATH="$aFreeType/lib" \
-D USE_FREEIMAGE:BOOL="OFF" \
-D BUILD_MODULE_Draw:BOOL="OFF" \
"/Path/To/OcctSourceCode"
aNbJobs="$(getconf _NPROCESSORS_ONLN)"
make -j $aNbJobs
make install

Grep sorted files in S3 folder

I'd like to sort files in S3 folders and then check if files contain a certain string.
When I usually want to grep a file I do the following:
aws s3 cp s3://s3bucket/location/file.csv.gz - | zcat | grep 'string_to_find'
I see I can sort files like this:
aws s3api list-objects-v2 \
--bucket s3bucket \
--prefix location \
--query 'reverse(sort_by(Contents,&LastModified))'
Tried something like this so far but got broken pipe:
aws s3api list-objects-v2 \
--bucket s3bucket \
--prefix location \
--query 'reverse(sort_by(Contents,&LastModified))' | cp - | zcat | grep 'string_to_find'
You can specify which fields to output and force them into text-only:
aws s3api list-objects-v2 \
--bucket s3bucket \
--prefix location \
--query 'reverse(sort_by(Contents,&LastModified))[].[Key]' \
--output text
Basically, the sort_by and reverse output the Contents array, and this extracts the Key element. I put [Key] in square brackets to force each result onto its own line.

Run cvb in mahout 0.8

The current Mahout 0.8-SNAPSHOT includes a Collapsed Variational Bayes (cvb) version for Topic Modeling and removed the Latent Dirichlet Analysis (lda) approach, because cvb can be parallelized way better. Unfortunately there is only documentation for lda on how to run an example and generate meaningful output.
Thus, I want to:
preprocess some texts correctly
run the cvb0_local version of cvb
inspect the results by looking at the top n words in each of the generated topics
So here are the subsequent Mahout commands I had to call in a linux shell to do it.
$MAHOUT_HOME points to my mahout/bin folder.
$MAHOUT_HOME/mahout seqdirectory \
-i path/to/directory/with/texts \
-o out/sequenced
$MAHOUT_HOME/mahout seq2sparse -i out/sequenced \
-o out/sparseVectors \
--namedVector \
-wt tf
$MAHOUT_HOME/mahout rowid \
-i out/sparseVectors/tf-vectors/ \
-o out/matrix
$MAHOUT_HOME/mahout cvb0_local \
-i out/matrix/matrix \
-d out/sparseVectors/dictionary.file-0 \
-a 0.5 \
-top 4 -do out/cvb/do_out \
-to out/cvb/to_out
Inspect the output by showing the top 10 words of each topic:
$MAHOUT_HOME/mahout vectordump \
-i out/cvb/to_out \
--dictionary out/sparseVectors/dictionary.file-0 \
--dictionaryType sequencefile \
--vectorSize 10 \
-sort out/cvb/to_out
Thanks to JoKnopp for the detail commands.
If you get:
Exception in thread "main" java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String
you need to add the command line option "maxIterations":
--maxIterations (-m) maxIterations
I use -m 20 and it works
refer to:
https://issues.apache.org/jira/browse/MAHOUT-1141

Resources