COBOL accurate LENGTH OF (XML-TEXT) when ampersands escaped? - xml-parsing

I'm using the intrinsic function XML-PARSE with XML that looks like this:
<MSGBODYTXT>
<LN>One & two</LN>
</MSGBODYTXT>
By my count, the following string is 13 bytes long.
"One & two"
But when I take $
LENGTH OF(XML-TEXT) $
I get only 9 bytes.
What can I do to get the correct, 13-byte length?

The problem is that XML PARSE translates & into the character it represents, an ampersand. If
you look at the CONTENT-CHARACTERS associated with the <LN> tag you will see: One & two which
is 9 characters long, just as the LENGTH OF operator on XML-TEXT says it is.
Note that if you were to use XML GENERATE on data item LN having the value One & two
it will generate as <LN>One & two</LN> which is a symetrical operation.

Related

is there a way to convert an integer to be always a 4 digit hex number using Lua

I'm creating a Lua script which will calculate a temperature value then format this value as a 4 digit hex number which must always be 4 digits. Having the answer as a string is fine.
Previously in C I have been able to use
data_hex=string.format('%h04x', -21)
which would return ffeb
however the 'h' string formatter is not available to me in Lua
dropping the 'h' doesn't cater for negative answers i.e
data_hex=string.format('%04x', -21)
print(data_hex)
which returns ffffffeb
data_hex=string.format('%04x', 21)
print(data_hex)
which returns 0015
Is there a convenient and portable equivalent to the 'h' string formatter?
I suggest you try using a bitwise AND to truncate any leading hex digits for the value being printed.
If you have a variable temp that you are going to print then you would use something like data_hex=string.format("%04x",temp & 0xffff) which would remove the leading hex digits leaving only the least significant 4 hex digits.
I like this approach as there is less string manipulation and it is congruent with the actual data type of a signed 16 bit number. Whether reducing string manipulation is a concern would depend on the rate at which the temperature is polled.
For further information on the format function see The String Library article.

How do you define the length of a parameter in ESC/POS?

I need to be able to print Hebrew characters on my Epson TM-T20ii. I am trying to get my printer to switch to character code page 36(PC862) using
ESC t36
for some reason the printer is switching to code page 3 and then printing the number 6.
Is there a way to let the printer know that the 6 is part of my command?
If you know of a different workaround please comment below.
Thanks
You are making a mistake, you aren't meant to replace n with an actual number.
The proper syntax in your case would be ←t$
Explanation: the manual says "ESC t n", n referring to the page sheet, however you don't replace n with a number rather with the ASCII character n, so in your example 36 = $ because $ is the 36th character on the ASCII table.

Character Encoding not resolved

I have a text file with unknown character formatting, below is a snapshot
\216\175\217\133\217\136\216\185 \216\167\217\132\217\133\216\177\216\163\216\169 \216\163\217\130\217\136\217\137 \217\134\217\129\217\136\216\176\216\167\217\139 \217\133\217\134 \216\167\217\132\217\130\217\136\216\167\217\134\217\138\217\134
Anyone has an idea how can I convert it to normal text?
This is apparently how Lua stores strings. Each \nnn represents a single byte where nnn is the byte's value in decimal. (A similar notation is commonly used for octal, which threw me off for longer than I would like to admit. I should have noticed that there were digits 8 and 9 in the data!) This particular string is just plain old UTF-8.
$ perl -ple 's/\\(\d{3})/chr($1)/ge' <<<'\216\175\217\133\217\136\216\185 \216\167\217\132\217\133\216\177\216\163\216\169 \216\163\217\130\217\136\217\137 \217\134\217\129\217\136\216\176\216\167\217\139 \217\133\217\134 \216\167\217\132\217\130\217\136\216\167\217\134\217\138\217\134'
دموع المرأة أقوى نفوذاً من القوانين
You would obviously get a similar result simply by printing the string from Lua, though I'm not familiar enough with the language to tell you how exactly to do that.
Post scriptum: I had to look this up for other reasons, so here's how to execute Lua from the command line.
lua -e 'print("\216\175\217\133\217\136\216\185 \216\167\217\132\217\133\216\177\216\163\216\169 \216\163\217\130\217\136\217\137 \217\134\217\129\217\136\216\176\216\167\217\139 \217\133\217\134 \216\167\217\132\217\130\217\136\216\167\217\134\217\138\217\134")'

Regex for clearly single line words with wildcards in Swift

I'm attempting to construct a regex string in Swift 4 that gets characters at the start of a line where some are known and others aren't.
Let's say I've got a text file with line breaks for each word that reads as follows:
pucker
tuckered
duckerdinger
sucker punch
I'd like to get every word that contains "cker" in it that's 1 to 8 characters long.
I'm attempting to use this statement ^..cker..{1,8} as my RegEx string. All I'm getting is a partial match in Patterns (a Mac App), but Regex101.com's saying no match, and most importantly, Xcode says I'm using an invalid regex. I've also tried ^(..cker..) and a bazillion other variations.
What am I screwing up and how do I fix it? What I'm trying to do seems like it would be super simple, but I've wasted more time than I care to admit fiddling with it.
Update:
This has been the best I've been able to get so far...
"\\b..cker..", but I'm only able to get words that are exactly 8 characters long. I'd like to capture words that contain "cker" that are the 3rd, 4th, 5th, and 6th letters while capturing words up to 8 characters long.
Try this regex:
\b(?=.*cker)[a-zA-Z]{1,8}\b
Click for Demo
Explanation:
\b - matches a word boundary
(?=.*cker) - Positive Lookahead to make sure our string should contain the character sequence cker
[a-zA-Z]{1,8} - Matches 1 to 8 occurrences of a letter
\b - matches a word boundary

list of garbage characters like ’

I am using librets to retrieve data form my RETS Server. Somehow librets Encoding method is not working and I am receiving some weird characters in my output. I noticed characters like '’' is replaced with ’. I am unable to find a fix for librets so i decided to replace such garbage characeters with actual values after downloading data. What I need is a list of such garbage string and their equivalent characters. I googled for this but not found any resource. Can anyone point me to the list of such garbage letters and their actual values or a piece of code which can generate such letter.
thanx
Search for the term "UTF-8", because that's what you're seeing.
UTF-8 is a way of representing Unicode characters as a sequence of bytes. ("Unicode characters" are the full range of letters and symbols used all in human languages.) Typically, one Unicode character becomes 1, 2, or 3 bytes in UTF-8. When those bytes (numbers from 0 to 255) are displayed using the character set normally used by Windows, they appear as "garbage" -- in this case, 3 "garbage letters" which are really the 3 bytes of a UTF-8 encoding.
In your example, you started with the smart quote character ’. Its representation in Unicode is the number 8217, or U+2019 (2019 is the hexadecimal for 8217). (Search for "Unicode" for a complete list of Unicode characters and their numbers.) The UTF-8 representation of the number 8217 is the three byte sequence 226, 128, 153. And when you display those three bytes as characters, using the Windows "CP-1252" character encoding (the ordinary way of displaying text on Windows in the USA), they appear as ’. (Search for "CP-1252" to see a table of bytes and characters.)
I don't have any list for you. But you could make one if you wrote a program in a language that has built-in support for Unicode and UTF-8. All I can do is explain what you are seeing.
If there is a way to tell librets to use UTF-8 when downloading, that might automatically solve your problem. I don't know anything about librets, but now that you know the term "UTF-8" you might be able to make progress.
Question reminder:
"...I noticed characters like '’' is replaced with ’... i decided to
replace such garbage characeters with actual values after downloading
data. What I need is a list of such garbage string and their
equivalent characters."
Strictly dealing with this part:
"What I need is a list of such garbage string and their equivalent
characters."
Using php, you can generate these characters and their equivalence. Working with all 1,111,998 Unicode points or 109,449 Utf8 symbols is impractical. You may use the ASCII range in the following loop between &#128 and &#258 or another range that is more relevant to your context.
<?php
for ($i=128; $i<258; $i++)
$tmp1 .= "<tr><td>".htmlentities("&#$i;")."</td><td>".html_entity_decode("&#".$i.";",ENT_NOQUOTES,"utf-8")."</td><td>&#".$i.";</td></tr>";
echo "<table border=1>
<tr><td>&#</td><td>"Garbage"</td><td>symbol</td></tr>";
echo $tmp1;
echo "</table>";
?>
From experience, in an ASCII context, most "garbage" symbols originate in the range &#128 to &#257 + (seldom) &#8129 to &#8246.
In order for the "garbage" symbols to display, the html page charset must be set to iso-1 or whichever other charset that caused the problem in the first place. They will not show if the charset is set to utf-8.
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
.
"i decided to replace such garbage characeters with actual values
after downloading data"
You CANNOT undo the "garbage" with php utf8_decode(), which would actually create more "garbage" on already "garbage". But, you may use the simple and fast search and replace php str_replace() function.
First, generate 2 arrays for each set of "garbage" symbols you wish to replace. The first array is the Search term:
<?php
//ISO 8859-1 (Latin-1) special chars are found in the range 128 to 257
$tmp1 = "\$SearchArr = array(";
for ($i=128; $i<258; $i++)
$tmp1 .= "\"".html_entity_decode("&#".$i.";",ENT_NOQUOTES,"utf-8")."\", ";
$tmp1 = substr($tmp1,0,strlen($tmp1)-2);//erases last comma
$tmp1 .= ");";
$tmp1 = htmlentities($tmp1,ENT_NOQUOTES,"utf-8");
?>
The second array is the replace term:
<?php
//Adapt for your relevant range.
$tmp2 = "\$ReplaceArr = array(\n";
for ($i=128; $i<258; $i++)
$tmp2 .= "\"&#".$i.";\", ";
$tmp2 = substr($tmp2,0,strlen($tmp2)-2);//erases last comma
$tmp2 .= ");";
echo $tmp1."\n<br><br>\n";
echo $tmp2."\n";
?>
Now, you've got 2 arrays that you can copy and paste to use and reuse to clean any of your infected strings like this:
$InfectedString = str_replace($SearchArr,$ReplaceArr,$InfectedString);
Note: utf8_decode() is of no help for cleaning up "garbage" symbols. But, it can be used to prevent further contamination. Alternatively a mb_ function can be useful.

Resources