xmllint extract value without xpath - xml-parsing

I need to verify xml textvalue of an element in my xml as show below, so far I got the value of all the dmdindex:field but I just need the last one drep.rightsFacet result output. Keep in mind that xmllint version I have does not have xpath so I have to resort to xmllint --shell. Any help is appreciated.
here's the xml file snippet:
<?xml version="1.0" encoding="UTF-8"?>
<dmdindex:dmdindex xmlns:dmdindex="http://www.example.com/example/dmdindex/"
xmlns:functx="http://www.functx.com"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<dmdindex:record>
<dmdindex:field name="drep.yearStart">1881</dmdindex:field>
<dmdindex:field name="drep.pubConcat">[Detroit : Parke, Davis & Co., 1881?]</dmdindex:field>
<dmdindex:field name="drep.rights">blahblahblah</dmdindex:field>
<dmdindex:field name="drep.rightsLink">https://creativecommons.org/publicdomain/mark/1.0/</dmdindex:field>
<dmdindex:field name="drep.rightsFacet">Public domain</dmdindex:field>
</dmdindex:record>
</dmdindex:dmdindex>
here's the command I used
echo "cat //*[local-name()='dmdindex']/*[local-name()='record']/*[local-name()='field']" | xmllint --shell example.xml | sed '/^\/ >/d' | sed 's/<[^>]*.//g'
which returned
-------
Philadelphia : Sunshine Press, [1897]
-------
blahblahblah
-------
https://creativecommons.org/publicdomain/mark/1.0/
-------
Public domain
I only need to grab the value of "Public domain" and ignore the rest. thanks!

You can add a predicate to your xpath to test the value of the name attribute...
echo "cat //*[local-name()='dmdindex']/*[local-name()='record']/*[local-name()='field'][#name='drep.rightsFacet']/text()" | xmllint --shell example.xml | sed '/^\/ >/d'

Related

Grep: Count the number of times a string occurs if another string does not occur

I have a set of many .json.gz files. In each file, there are entries such as this:
{"type":"e1","public":true, "login":"username1", "org":{"dict","of":"lots_of_things"}}
{"type":"e2","public":true, "login":"username2"}
No matter where in each nested dict "login" appears, I want to be able to detect it and take the username, only if the key "org" does not exist anywhere in the nested dict. I also want to count the number of times each username appears in the files.
My final output should be a file of dicts that looks like this:
{'username2: 1}
because of course username1 wouldn't be counted: the key "org" appears in its dict.
I'm looking for something like:
zgrep -Rv "org" . | zgrep -o 'login":"[^"]*"' /path/to/files/* | cut -d'"' -f3 | sort | uniq -c | sed '1i{
s/\s*\([0-9]*\)\s*\(.*\)/"\2": \1,/;$a}' > outputfile.txt
I'm not sure about this part:
zgrep -Rv "org" . |
The rest successfully creates the type of file I'm looking for. I'm just unsure about the order of operations here.
EDIT
I should have been more clear, I apologize. There are also often multiple instances of the key "login" per main dict object. For example (using "k" for any key that is not login and not org, and using "v" for a value):
{"k":"v","k":{"k":{"k":"v","login":"username1"},"k":"v"},"k":{"k":"v","login":"username2"}}
{"k":{"k":"v","k":"v"},"k":{"org":{"k":"v","k":v,"login":"username3"},"k":"v"},"k":{"k":"v","login":"username4"}}
{"k":{"k":"v"},"k":{"k":{"k":"v","login":"username1"},"login":"username2"}}
Since the key org appears in the second dict, I want to exclude usernames 3 and 4 from the dict I make and save to a file.
For example, I want this in a file:
{'username1': 2}
{'username2': 2}
AWK solution and replacing find -R with more reliable find:
find . -type f -name "*.json.gz" -print0 | xargs -0 zgrep -v -h '"org"' | awk '{ if ( match($0,/"login":"[^"]+"/) ) logins[substr($0,RSTART+8,RLENGTH-8)]++; } END { for ( i in logins ) print("{" i ":" logins[i] "}"); }'
Example output:
{"username2":1}
not grep but gnu sed job with script, your data in 'a'
i=
for e in $(sed -nE '/.*\borg\b.*/!s/.*"login":"(\w+)".*/{\1:}/p' a)
{
let i++;echo ${e/:/:$i}
}
use '>' at end to save in file
if better regex : 'pcregrep' installed, it does as well;
pcregrep -io '(?!.*\borg\b.*)(?<="login":")\w+(?=".*)' a
replace sed... script above, with a bit adjusted printout
This worked:
zgrep -v "org" *.json.gz | zgrep -o 'login":"[^"]*"' | cut -d'"' -f3 | sort | uniq -c | sed '1i{
s/\s*\([0-9]*\)\s*\(.*\)/"\2": \1,/;$a}' > usernames_2011.txt

Restrict grep command to print file name only once

I am interested in finding a string within a specific file type.
The command below serves my purpose.
find /any/path -type f -name "*.log" | xargs grep -B2 -A2 'SUMMARY' {} \;
It gives the following output:
--
/path/to/file.log-line1
/path/to/file.log-line2
/path/to/file.log:text SUMMARY text
/path/to/file.log-line1
/path/to/file.log-line2
--
I would like the file name not to be prepended to each line. Is it possible to have the output as below?
--
/path/to/file.log
line1
line2
text SUMMARY text
line1
line2
--
If you're running this under linux with bash, you could use a bash script like this:
#!/bin/bash
for fn in `grep --include \*.txt -lr 'SUMMARY' /any/path`; do
echo $fn
grep -A2 -B2 'SUMMARY' $fn
done
This will find all files containing the word "SUMMARY" in a recursive manner starting from the directory "/any/path". All matched files are then printed by name and the matched portion is printed with the second grep line.

Grep: Capture just number

I am trying to use grep to just capture a number in a string but I am having difficulty.
echo "There are <strong>54</strong> cities | grep -o "([0-9]+)"
How am I suppose to just have it return "54"? I have tried the above grep command and it doesn't work.
echo "You have <strong>54</strong>" | grep -o '[0-9]' seems to sort of work but it prints
5
4
instead of 54
Don't parse HTML with regex, use a proper parser :
$ echo "There are <strong>54</strong> cities " |
xmllint --html --xpath '//strong/text()' -
OUTPUT:
54
Check RegEx match open tags except XHTML self-contained tags
You need to use the "E" option for extended regex support (or use egrep). On my Mac OSX:
$ echo "There are <strong>54</strong> cities" | grep -Eo "[0-9]+"
54
You also need to think if there are going to be more than one occurrence of numbers in the line. What should be the behavior then?
EDIT 1: since you have now specified the requirement to be a number between <strong> tags, I would recommend using sed. On my platform, grep does not have the "P" option for perl style regexes. On my other box, the version of grep specifies that this is an experimental feature so I would go with sed in this case.
$ echo "There are <strong>54</strong> 12 cities" | sed -rn 's/^.*<strong>\s*([0-9]+)\s*<\/strong>.*$/\1/p'
54
Here "r" is for extended regex.
EDIT 2: If you have the "PCRE" option in your version of grep, you could also utilize the following with positive lookbehinds and lookaheads.
$ echo "There are <strong>54 </strong> 12 cities" | grep -o -P "(?<=<strong>)\s*([0-9]+)\s*(?=<\/strong>)"
54
RegEx Demo

Use awk to parse and modify every CSV field

I need to parse and modify a each field from a CSV header line for a dynamic sqlite create table statement. Below is what works from the command line with the appropriate output:
echo ",header1,header2,header3"| awk 'BEGIN {FS=","}; {for(i=2;i<=NF;i++){printf ",%s text ", $i}; printf "\n"}'
,header1 text ,header2 text ,header3 text
Well, it breaks when it is run from within a bash shell script. I got it to work by writing the output to a file like below:
echo $optionalHeaders | awk 'BEGIN {FS=","}; {for(i=2;i<=NF;i++){printf ",%s text ", $i}; printf "\n"}' > optionalHeaders.txt
This sucks! There are a lot of examples that show how to parse/modify specific Nth fields. This issue requires each field to be modified. Is there a more concise and elegant Awk one liner that can store its contents to a variable rather than writing to a file?
sed is usually the right tool for simple substitutions on a single line. Take your pick:
$ echo ",header1,header2,header3" | sed 's/[^,][^,]*/& text/g'
,header1 text,header2 text,header3 text
$ echo ",header1,header2,header3" | sed -r 's/[^,]+/& text/g'
,header1 text,header2 text,header3 text
The last 1 above requires GNU sed to use EREs instead of BREs. You can do the same in awk using gsub() if you prefer:
$ echo ",header1,header2,header3" | awk '{gsub(/[^,]+/,"& text")}1'
,header1 text,header2 text,header3 text
I found the problem and it was me... I forgot to echo the contents of the variable to the Awk command. Brianadams comment was so simple that forced me to re-look at my code and find the problem! Thanks!
I am ok with resolving this but if anyone wants to propose a more concise and elegant Awk one liner - that would be cool.
You can try the following:
#! /bin/bash
header=",header1,header2,header3"
newhead=$(awk 'BEGIN {FS=OFS=","}; {for(i=2;i<=NF;i++) $i=$i" text"}1' <<<"$header")
echo "$newhead"
with output:
,header1 text,header2 text,header3 text
Instead of modifying fields one by one, another option is with a simple substitution:
echo ",header1,header2,header3" | awk '{gsub(/[^,]+/, "& text", $0); print}'
That is, replace a sequence of non-comma characters with text appended.
Another alternative would be replacing the commas, but due to the irregularities of your header line (first comma must be left alone, no comma at the end), that's a bit less easy:
echo ",header1,header2,header3" | awk '{gsub(/,/, " text,", $0); sub(/^ text,/, "", $0); print $0 " text"}'
Btw, the rough equivalent of the two commands in sed:
echo ",header1,header2,header3" | sed -e 's/[^,]\{1,\}/& text/g'
echo ",header1,header2,header3" | sed -e 's/\(.\),/\1 text,/g' -e 's/$/ text/'

Groovy Pretty Print XML assertion fails

I'm writing a unit test that verify if the xml is formatted correctly, but this is failing and I can't figure out why.
So I decided to test the code of this blog post and test in the Grails console, it also fails.
import groovy.xml.*
def prettyXml = '''\
<?xml version="1.0" encoding="UTF-8"?>
<languages>
<language id="1">Groovy</language>
<language id="2">Java</language>
<language id="3">Scala</language>
</languages>
'''
// Pretty print a non-formatted XML String.
def xmlString = '<languages><language id="1">Groovy</language><language id="2">Java</language><language id="3">Scala</language></languages>'
assert XmlUtil.serialize(xmlString) == prettyXml
Assertion fails with:
Assertion failed:
assert XmlUtil.serialize(xmlString) == prettyXml
| | | |
| | | <?xml version="1.0" encoding="UTF-8"?>
| | | <languages>
| | | <language id="1">Groovy</language>
| | | <language id="2">Java</language>
| | | <language id="3">Scala</language>
| | | </languages>
| | false
| <languages><language id="1">Groovy</language><language id="2">Java</language><language id="3">Scala</language></languages>
<?xml version="1.0" encoding="UTF-8"?>
<languages>
<language id="1">Groovy</language>
<language id="2">Java</language>
<language id="3">Scala</language>
</languages>
I'm using Grails 2.2.1, that uses Groovy 2.0.7, on Windows 7.
Maybe is something related with the OS line separator?
EDIT
I saved both strings to file, and checked with Notepad++
The parsed xml (XmlUtils) have CL+RF as line separator but the prettyXml have only LF. I also tested using \n instead of a multi line declaration, with same result!
Groovy shouldn't use CL+RF always, since this is the Windows line separator?
In the Groovy String/GString docs, it says in relation to multi-line literals:
There[sic] are always represented by the character '\n', regardless of
the line-termination conventions of the host system.
They don't really say why, unfortunately.

Resources