XElement Parse error when trying to parse string - xml-parsing

I am getting xml parse error while trying to parse a string (with CDATA within CDATA)
var cont = "<op><![CDATA[someData<p><![CDATA[someotherData]]></p></op>";
XElement.Parse(cont);
Error:
The 'op' start tag on line 1 position 2 does not match the end tag of 'p'. Line 1, position 52.
Can we have CDATA within CDATA ? If we can, then why am I getting the error.
Below code works fine (It does not contain CDATA within CDATA).
var cont = "<op><![CDATA[someData]]</op>";
XElement.Parse(cont);

1 <op>
2 <![CDATA[
3 someData
4 <p>
5 <![CDATA[someotherData]]>
6 </p>
7 </op>
When the XML Parser encounters the ]]> in line 5 , it will terminate the first <![CDATA[ it met in line 2 . As a result , you can never have nested CDATA within an CDATA.
CDATA is not designed to hold xmlelements , but to hold character data that might contains characteres such as <, > and so on , which allows us to avoid escaping them as < , > respectively , and to write them and display them in a clean way .
So the content between <![CDATA[ and ]] will be treated as plain text , with no further processing , even if it looks like that there's a hierarchy . In other words , they are plain strings . Let's take your code as an example :
var cont = "<op><![CDATA[ <foo><bar></bar></foo> ]]></op>";
var xml=XElement.Parse(cont);
Here the FirstNode of xml will be a plain text foo><bar></bar></foo> , and the FirstNode of the FirstNode will be null.
Since the parser will always treat the data between <![CDATA[ and ]] as a plain string , there's no "standard" closest valid way to represent them . Just encode them and decode them . For example , we can urlencode the data :
string xmlstr= #"<op><![CDATA[
<helloworld/>
someData%0A%3Cp%3E%0A%3C!%5BCDATA%5BsomeotherData%5D%5D%3E%0A%3C%2Fp%3E
]]></op>";
var xml = XElement.Parse(xmlstr);
var subxmlString=System.Web.HttpUtility.UrlDecode(xml.Value);
// make sure there' must be a root element
var subxml= XElement.Parse($"<root>${subxmlString}</root>");

Related

Jmeter ForEach Controller failing to write variables to file in order retrieved

Jmeter ForEach Controller failing to write variables in original order correctly
I am executing a http request retrieving a json payload with an array of employees. For each record (employee) I need to parse the record for specific fields e.g. firstName, lastName, PersonId and write to a single csv file, incrementing a new row per record.
Unfortunately, the file created has two issues. The PersonId never gets written and secondly the sequence of the values is not consistent with the returned original values. Sometimes I get the same record for lastName with the wrong firstName and vice versa. Not sure if the two issues are related, I suspect my regular expression extract is wrong for a number.
Jmeter setup. (5.2.1)
jmeter setUp
Thread group
+ HTTP Request
++ JSON JMESPath Extractor
+ ForEach Controller
++ Regular Expression Extractor: PersonId
++ Regular Expression Extractor: firstName
++ Regular Expression Extractor: lastName
++ BeanShell PostProcessor
getWorker returns the following payload
jsonPayload
JSON JMESPath Extractor to handle the payload.
{
"items" : [
{
"PersonId" : 398378,
"firstName" : "Sam",
"lastName" : "Shed"
},
{
"PersonId" : 398379,
"firstName" : "Bob",
"lastName" : "House"
}
],
"count" : 2,
"hasMore" : true,
"limit" : 2,
"offset" : 0,
"links" : [
{
"rel" : "self",
"href" : "https://a.site.on.the.internet.com/employees",
"name" : "employees",
"kind" : "collection"
}
]
}
JSON JMESPath Extractor Configuration
Name of created variables: items
JMESPath expressions: items
Match No. -1
Default Values: Not Found
ForEach Controller
ForEach Controller Configuration
Input variable prefix: items
Start Index: Empty
End Index: Empty
Output variable name: items
Add "_"? Checked
Each of the Regular Expression Extracts follow the same pattern as below.
Extract PersonId with Regular Expression
Apply to: Main Sample Only
Field to check: Body
Name of created variable: PersonId
Regular Expression: "PersonId":"(.+?)"
Template: $1$
Match No. Empty
Default Value: PersonId
The final step in the thread is where I write out the parsed results.
BeanShell PostProcessor
PersonNumber = vars.get("PersonNumber");
DisplayName = vars.get("DisplayName");
f = new FileOutputStream("/Applications/apache-jmeter-5.2.1/bin/scripts/getWorker/responses/myText.csv", true);
p = new PrintStream(f);
this.interpreter.setOut(p);
print(PersonId+", "+ PersonNumber+ ", " + DisplayName);
f.close();
I am new to this and looking either for someone to tell me where I screwed up or direct me to a place I can read up on the appropriate topics. (Both are fine). Thank you.
For Each Controller doesn't know the structure of items variable since it is in JSON format. It is capable of just understanding an array and traverses through them. I would suggest to move away from For Each Controller in your case and use the JSON extractor itself for all the values like below
Person ID
First Name
Last Name
Beanshell Sampler Code
import java.io.FileWriter; // Import the FileWriter class
int matchNr = Integer.parseInt(vars.get("personId_C_matchNr"));
log.info("Match number is "+matchNr);
f = new FileOutputStream("myText.csv", true);
p = new PrintStream(f);
for (int i=1; i<=matchNr; i++){
PersonId = vars.get("personId_C_"+i);
FirstName = vars.get("firstName_C_"+i);
LastName = vars.get("lastName_C_"+i);
log.info("Iteration is "+i);
log.info("Person ID is "+PersonId);
log.info("First Name is "+FirstName);
log.info("Last Name is "+LastName);
p.println(PersonId+", "+FirstName+", "+LastName);
}
p.close();
f.close();
Output File
HOW THE ABOVE ACTUALLY WORKS
When you extract values using the matchNr, it goes in a sequential order in which the response has arrived. For example, in your case, Sam & Shed appear as first occurrences and Bob & House appear as subsequent occurrences. Hence JMeter captures them with the corresponding match and stores them as 1st First Name = Sam, 2nd First Name = Bob and so on.
GENERIC STUFF
The regex expression for capturing Person ID which you have used seems to be inaccurate. The appropriate one would be
"PersonId" :(.+?),
and not
"PersonId":"(.+?)"
Move to JSR223 processors instead of Beanshell as they are more performant. Source: Which one is efficient : Java Request, JSR223 or BeanShell Sampler for my script. The migration is pretty simple. Just copy the code that you have in Beanshell and paste it in JSR223.
Close any stream or writer that is open appropriately else it might cause issues when other users are trying to write to the file during load test
In case you are planning to use this file as a subsequent input within JMeter, please note that there is a space between comma and the next element. For example, it is "Sam, Shed" and not "Sam,Shed".JMeter by default does not trim any spaces and will use the value just like that. Hence you might want to take a judicious call regarding that space
Hope this helps!
Since JMeter 3.1 you shouldn't be using Beanshell, go for JSR223 Test Elements and Groovy language for scripting.
Given Groovy has built-in JSON support you shouldn't need any extractors, you can write the data into a file in a single shot like:
new groovy.json.JsonSlurper().parse(prev.getResponseData()).items.each { item ->
new File('myText.csv') << item.get('PersonId') << ',' << item.get('firstName') << ',' << item.get('lastName') << System.getProperty('line.separator')
}
More information: Apache Groovy - Why and How You Should Use It

How to remove non-ascii char from MQ messages with ESQL

CONCLUSION:
For some reason the flow wouldn't let me convert the incoming message to a BLOB by changing the Message Domain property of the Input Node so I added a Reset Content Descriptor node before the Compute Node with the code from the accepted answer. On the line that parses the XML and creates the XMLNSC Child for the message I was getting a 'CHARACTER:Invalid wire format received' error so I took that line out and added another Reset Content Descriptor node after the Compute Node instead. Now it parses and replaces the Unicode characters with spaces. So now it doesn't crash.
Here is the code for the added Compute Node:
CREATE FUNCTION Main() RETURNS BOOLEAN
BEGIN
DECLARE NonPrintable BLOB X'0001020304050607080B0C0E0F101112131415161718191A1B1C1D1E1F7F808182838485868788898A8B8C8D8E8F909192939495969798999A9B9C9D9E9FA0A1A2A3A4A5A6A7A8A9AAABACADAEAFB0B1B2B3B4B5B6B7B8B9BABBBCBDBEBFC0C1C2C3C4C5C6C7C8C9CACBCCCDCECFD0D1D2D3D4D5D6D7D8D9DADBDCDDDEDFE0E1E2E3E4E5E6E7E8E9EAEBECEDEEEFF1F2F3F4F5F6F7F8F9FAFBFCFDFEFF';
DECLARE Printable BLOB X'20202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020';
DECLARE Fixed BLOB TRANSLATE(InputRoot.BLOB.BLOB, NonPrintable, Printable);
SET OutputRoot = InputRoot;
SET OutputRoot.BLOB.BLOB = Fixed;
RETURN TRUE;
END;
UPDATE:
The message is being parsed as XML using XMLNSC. Thought that would cause a problem, but it does not appear to be.
Now I'm using PHP. I've created a node to plug into the legacy flow. Here's the relevant code:
class fixIncompetence {
function evaluate ($output_assembly,$input_assembly) {
$output_assembly->MRM = $input_assembly->MRM;
$output_assembly->MQMD = $input_assembly->MQMD;
$tmp = htmlentities($input_assembly->MRM->VALUE_TO_FIX, ENT_HTML5|ENT_SUBSTITUTE,'UTF-8');
if (!empty($tmp)) {
$output_assembly->MRM->VALUE_TO_FIX = $tmp;
}
// Ensure there are no null MRM fields. MessageBroker is strict.
foreach ($output_assembly->MRM as $key => $val) {
if (empty($val)) {
$output_assembly->MRM->$key = '';
}
}
}
}
Right now I'm getting a vague error about read only messages, but before that it wasn't working either.
Original Question:
For some reason I am unable to impress upon the senders of our MQ
messages that smart quotes, endashes, emdashes, and such crash our XML
parser.
I managed to make a working solution with SQL queries, but it wasted
too many resources. Here's the last thing I tried, but it didn't work
either:
CREATE FUNCTION CLEAN(IN STR CHAR) RETURNS CHAR BEGIN
SET STR = REPLACE('–',STR,'–');
SET STR = REPLACE('—',STR,'—');
SET STR = REPLACE('·',STR,'·');
SET STR = REPLACE('“',STR,'“');
SET STR = REPLACE('”',STR,'”');
SET STR = REPLACE('‘',STR,'&lsqo;');
SET STR = REPLACE('’',STR,'’');
SET STR = REPLACE('•',STR,'•');
SET STR = REPLACE('°',STR,'°');
RETURN STR;
END;
As you can see I'm not very good at this. I have tried reading about
various ESQL string functions without much success.
So in ESQL you can use the TRANSLATE function.
The following is a snippet I use to clean up a BLOB containing non-ASCII low hex values so that it then be cast into a usable character string.
You should be able to modify it to change your undesired characters into something more benign. Basically each hex value in NonPrintable gets translated into its positional equivalent in Printable, in this case always a full-stop i.e. x'2E' in ASCII. You'll need to make your BLOB's long enough to cover the desired range of hex values.
DECLARE NonPrintable BLOB X'000102030405060708090A0B0C0D0E0F101112131415161718191A1B1C1D1E1F202122232425262728292A2B2C2D2E2F303132333435363738393A3B3C3D3E3F';
DECLARE Printable BLOB X'2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E';
SET WorkBlob = TRANSLATE(WorkBlob, NonPrintable, Printable);
BTW if messages with invalid characters only come in every now and then I'd probably specify BLOB on the input node and then use something similar to the following to invoke the XMLNSC parser.
CREATE LASTCHILD OF OutputRoot DOMAIN 'XMLNSC'
PARSE(InputRoot.BLOB.BLOB CCSID InputRoot.Properties.CodedCharSetId ENCODING InputRoot.Properties.Encoding);
With the exception terminal wired up you can then correct the BLOB's of any messages containing parser breaking invalid characters before attempting to reparse.
Finally my best wishes as I've had a number of battles over the years with being forced to correct invalid message content in the "Integration Layer" after all that's what it's meant to do.

Groovy- searching and excretion xml code from log file

I have so many texts in log file but sometimes i got responses as a xml code and I have to cut this xml code and move to other files.
For example:
sThread1....dsadasdsadsadasdasdasdas.......dasdasdasdadasdasdasdadadsada
important xml code to cut and move to other file: <response><important> 1 </import...></response>
important xml code to other file: <response><important> 2 </important...></response>
sThread2....dsadasdsadsadasdasdasdas.......dasdasdasdadasdasdasdadadsada
Hindrance: xml code starting from difference numbers of sign (not always start in the same number of sign)
Please help me with finding method how to find xml code in text
Right now i tested substring() method but xml code not always start from this same sign :(
EDIT:
I found what I wanted, function which I searched was indexOf().
I needed a number of letter where String "Response is : " ending: so I used:
int positionOfXmlInLine = lineTxt.indexOf("<response")
And after this I can cut string to the end of the line :
def cuttedText = lineTxt.substring(positionOfXmlInLine);
So I have right now only a XML text/code from log file.
Next is a parsing XML value like BDKosher wrote under it.
Hoply that will help someone You guys
You might be able to leverage XmlSlurper for this, assuming your XML is valid enough. The code below will take each line of the log, wrap it in a root element, and parse it. Once parsed, it extracts and prints out the value of the <important> element's value attribute, but instead you could do whatever you need to do with the data:
def input = '''
sThread1..sdadassda..sdadasdsada....sdadasdas...
important code to cut and move to other file: **<response><important value="1"></important></response>**
important code to other file: ****<response><important value="3"></important></response>****
sThread2..dsadasd.s.da.das.d.as.das.d.as.da.sd.a.
'''
def parser = new XmlSlurper()
input.eachLine { line, lineNo ->
def output = parser.parseText("<wrapper>$line</wrapper>")
if (!output.response.isEmpty()) {
println "Line $lineNo is of importance ${output.response.important.#value.text()}"
}
}
This prints out:
Line 2 is of importance 1
Line 3 is of importance 3

Preprocessing Scala parser Reader input

I have a file containing a text representation of an object. I have written a combinator parser grammar that parses the text and returns the object. In the text, "#" is a comment delimiter: everything from that character to the end of the line is ignored. Blank lines are also ignored. I want to process text one line at a time, so that I can handle very large files.
I don't want to clutter up my parser grammar with generic comment and blank line logic. I'd like to remove these as a preprocessing step. Converting the file to an iterator over line I can do something like this:
Source.fromFile("file.txt").getLines.map(_.replaceAll("#.*", "").trim).filter(!_.isEmpty)
How can I pass the output of an expression like that into a combinator parser? I can't figure out how to create a Reader object out of a filtered expression like this. The Java FileReader interface doesn't work that way.
Is there a way to do this, or should I put my comment and blank line logic in the parser grammar? If the latter, is there some util.parsing package that already does this for me?
The simplest way to do this is to use the fromLines method on PagedSeq:
import scala.collection.immutable.PagedSeq
import scala.io.Source
import scala.util.parsing.input.PagedSeqReader
val lines = Source.fromFile("file.txt").getLines.map(
_.replaceAll("#.*", "").trim
).filterNot(_.isEmpty)
val reader = new PagedSeqReader(PagedSeq.fromLines(lines))
And now you've got a scala.util.parsing.input.Reader that you can plug into your parser. This is essentially what happens when you parse a java.io.Reader, anyway—it immediately gets wrapped in a PagedSeqReader.
Not the prettiest code you'll ever write, but you could go through a new Source as follows:
val SEP = System.getProperty("line.separator")
def lineMap(fileName : String, trans : String=>String) : Source = {
Source.fromIterable(
Source.fromFile(fileName).getLines.flatMap(
line => trans(line) + SEP
).toIterable
)
}
Explanation: flatMap will produce an iterator on characters, which you can turn into an Iterable, which you can use to build a new Source. You need the extra SEP because getLines removes it by default (using \n may not work as Source will not properly separate the lines).
If you want to apply filtering too, i.e. remove some of the lines, you could for instance try:
// whenever `trans` returns `None`, the line is dropped.
def lineMapFilter(fileName : String, trans : String=>Option[String]) : Source = {
Source.fromIterable(
Source.fromFile(fileName).getLines.flatMap(
line => trans(line).map(_ + SEP).getOrElse("")
).toIterable
)
}
As an example:
lineMapFilter("in.txt", line => if(line.isEmpty) None else Some(line.reverse))
...will remove empty lines and reverse non-empty ones.

XML Parsing - node.text method removing trailling spaces

I have a big xml file which i'm parsing using jscript. I have used the following code to load the xml
var xmlDoc = Sys.OleObject("Msxml2.DOMDocument.6.0");
xmlDoc.async = false;
// Load xml data from a file
xmlDoc.load(this._studyDocPath);
Now if i use the following code
var text = this.xmlDoc.selectSingleNode(xPath);
text = node.text;
the text variable holds the innertext of a perticular tag. But if I have tag like this
<Text>ABCD </Text>
then the node.text returns me only the value 'ABCD' i.e. it automatically trims the space. But I dont need to trim any trailling spaces. I need the text as it is. How can I achieve that?
Looking forward to your response
Thanks in Advance
We can use node.firstChild.nodeValue with a null check on node.firstChild

Resources