format-number() with picture string '#.##' - xslt-2.0

I've been using the picture string '#.##" in XSLT 1.0 code for many years to format numbers with decimals if they have a fractional part, and not displaying decimals if there is no fractional part.
Saxon 6.5.5.
format-number(123.456,'#.##’) 123.46
format-number(123.000,'#.##’) 123
But when using XSLT 2.0 or 3.0, I'm getting different behavior.
Saxon 9.9.1.5J
format-number(123.456,'#.##’) 123.46
format-number(123.000,'#.##’) 123.0
I haven't been able to figure out why the last case has the fractional part. https://www.w3.org/TR/xslt20/#dt-picture-string states:
The minimum-fractional-part-size is set to the number of
zero-digit-sign characters found in the fractional part of the
sub-picture.
https://www.w3.org/TR/xpath-functions-30/#syntax-of-picture-string states:
The minimum-fractional-part-size is set to the number of
decimal-digit-family characters found in the fractional part of the
sub-picture.
And with no zero-digit-sign and no decimal-digit-family characters, the fractional size should be zero, like it is in the XSLT 1.0 case.
Am I missing something?

Saxon is applying the rule:
If (after making the above adjustments) the minimum-integer-part-size
and the minimum-fractional-part-size are both zero, then the
minimum-fractional-part-size is set to 1 (one).
The XSLT 1.0 spec for format-number() was defined by reference to the JDK 1.1 specification, which was a bad mistake because (a) it was always grossly underspecified in the JDK, and (b) that version of the JDK specification is now almost impossible to get hold of. So we spent many happy hours writing a new spec from scratch that tried to reproduce the JDK spec in cases where it was clear and unbuggy, but filled in the gaps. We took into account the way that different XSLT 1.0 vendors had interpreted the spec, but didn't regard any existing implementations (or the JDK) as definitive.
Your test case is similar to test case numberformat228 at https://github.com/w3c/qt3tests/blob/master/fn/format-number.xml, and the notes for this test case refer to bug 29164. The bug entry at https://www.w3.org/Bugs/Public/show_bug.cgi?id=29164 reveals that the rule cited above was a fairly late addition/change to the spec (Sept 2015 - long after XPath 2.0).
The discussion in the bug entry suggests that the primary consideration was to prevent zero being formatted as an empty string. Perhaps, with hindsight, we should have treated that case specially.

Related

Add to zero...What is it for?

Why such code is used in some applications instead of a MOVE?
add 16 to ZERO giving SOME-RESULT
I spotted this in professionally written code at several spots.
Sorce is on this page
Why such code is used in some applications instead of a MOVE?
add 16 to ZERO giving SOME-RESULT
Without seeing more of the code, it appears that it could be a translation of IBM Assembler to COBOL. In particular, the ZAP (Zero and Add Packed) instruction may be literally translated to the above instruction, particularly if SOME-RESULT is COMP-3. Thus, someone checking the translation could see that the ZAP instruction was faithfully translated.
Or, it could be an assembler programmer's idea of a joke.
Having seen the code, I also note the use of
subtract some-data-item from some-data-item
which is used instead of
move zero to some-data-item
This is consistent with operations used with packed decimal fields in IBM Assembly, where there are no other instructions to accomplish "flexible" moves. By flexible, I mean that the packed decimal instructions contain a length field so that specific size MVC instructions need not be used.
This particular style, being unusual, may be related to catching copyright violations.
From my experience, I'm pretty sure I know the reason why the programmer would have done this. It has something to do with the binary representation of the number.
I bet SOME-RESULT is a packed-decimal (or COMP-3) format number. Let's assume the field is defined like this
05 SOME-RESULT PIC S9(5) COMP-3.
This results in a 3-byte field with a hex representation like this
x'00016C'
The decimal number is encoded as a binary encoded decimal (BCD, one decimal digit per half-byte), and the last half-byte holds the sign.
Let's take a look at how the sign is defined:
if it is one of x'C', x'A', x'F', x'E' (café), then the number is positive
if it is one of x'B', x'D', then the number is negative
any of x'0'..'x'9' are not valid signs, so we can distinguish signed packed-decimals from unsigned.
However, a zoned number (PIC 9(5) DISPLAY) - as in the source code - looks like this:
x'F0F0F0F1F6'
As you can see, each decimal digit is an EBCDIC character with the 'zone' part (the first half-byte) always being x'F'.
Now we get closer to your question!
What happens when we use
MOVE 16 TO SOME-RESULT
If you just MOVE a number to such a field, this results in being compiled into a PACK instruction on the machine code level.
PACK SOME-RESULT,=C'16'
A pack instruction takes a zoned number and packs it by picking only the second half-byte of each byte and storing it in the half-bytes of the packed number - with one exception! When it comes to the last byte, it simply flips the two half-bytes and stores them in the last half-byte of the decimal.
This means that the zone of the last byte of the zoned decimal becomes the sign in the packed decimal:
x'00016F'
So now we have an x'F' as the sign – which is a valid positive sign.
However, what happens if use this Cobol instruction instead
ADD 16 TO ZERO GIVING SOME-RESULT
This compiles into multiple machine level instructions
PACK SOME_RESULT,=C'0'
PACK TEMP,=C'16'
AP SOME_RESULT,TEMP
(or similar - the key point is that is needs an AP somewhere)
This makes a slight difference in the result, because the AP (add packed) instruction always sets the resulting sign to either x'C' for a positive or x'D' for a negative result.
So the difference lies in the sign
x'00016C'
Finally, the question is why would one make this difference? After all, both x'F' and x'C' are valid positive signs. So why care?
There is one situation when this slight difference can cause big problems: When the packed decimal is part of an index key, then we would not get a match, even though the numbers are semantically identical!
Because this situation occurred quite often in older databases like VSAM and DL/I (later: IMS/DB), it became good practice to "normalize" packed decimals if they were part of an index key.
However, some programmers adopted the practice without knowing why, so you may come across code that uses this "normalization" even though the data are not used for index keys.
You might also wonder why a compiler does not optimize out the ADD 16 TO ZERO. I'm pretty sure it once did, but that broke a lot of applications, so this specific optimization was removed again or at least made a non-default option with warnings.
Additional useful info
Note that at least the Enterprise Cobol for z/OS compiler allows you to see exactly the machine code that is produced from your source code if use the LIST compile option (see this example output). I recommend to always compile with options LIST, MAP, OFFSET, XREF because these options enable you find the exact problem in your Cobol source even when you only have a program dump from an abend.
Anyway, good programming practice is not to care about the compiler or the machine code, but about the other programmers who will have to maintain, and thus read and understand the code. Good practice would be to always prefer simple and readable instructions, and to document the reasons (right in the code) when deviating from this rule.
Some programmers like to do things "just because they can". I have a feeling that is what you are seeing here. It makes about as much sense as doing
a := 0 + b
would in go.

This regex matches in BBEdit and regex.com, but not on iOS - why?

I am trying to "highlight" references to law statutes in some text I'm displaying. These references are of the form <number>-<number>-<number>(char)(char), where:
"number" may be whole numbers 18 or decimal numbers 12.5;
the parenthetical terms are entirely optional: zero or one or more;
if a parenthetical term does exist, there may or may not be a space between the last number and the first parenthesis, as in 18-1.3-401(8)(g) or 18-3-402 (2).
I am using the regex
((\d+(\.\d+)*-){2}(\d+(\.\d+)*))( ?(\([0-9a-zA-Z]+\))*)
to find the ranges of these strings and then highlight them in my text. This expression works perfectly, 100% of the time, in all of the cases I've tried (dozens), in BBEdit, and on regex101.com and regexr.com.
However, when I use that exact same expression in my code, on iOS 12.2, it is extremely hit-or-miss as to whether a string matching the regex is actually found. So hit-or-miss, in fact, that a string of the exact same form of two other matches in a specific bit of text is NOT found. E.g., in this one paragraph I have, there are five instances of xxx-x-xxx; the first and the last are matched, but the middle three are not matched. This makes no sense to me.
I'm using the String method func range(of:options:range:locale:) with options of .regularExpression (and nil locale) to do the matching. I see that iOS uses ICU-compatible regexes, whereas these other tools use PCRE (I think). But, from what I can tell, my expression should be compatible and valid for my case with the ICU parsing. But, something is definitely different, and I cannot figure out what it is.
Anyone? (I'm going to give NSRegularExpression a go and see if it behaves differently, but I'd still like to figure out what's going on here.)

CommonMark Parsing ***

Let's say I want to parse the string ***cat*** into Markdown using the CommonMark standard. The standard says (http://spec.commonmark.org/0.28/#phase-2-inline-structure):
....
If one is found:
Figure out whether we have emphasis or strong emphasis: if both closer
and opener spans have length >= 2, we have strong, otherwise regular.
Insert an emph or strong emph node accordingly, after the text node
corresponding to the opener.
Remove any delimiters between the opener and closer from the delimiter
stack.
Remove 1 (for regular emph) or 2 (for strong emph) delimiters from the
opening and closing text nodes. If they become empty as a result,
remove them and remove the corresponding element of the delimiter
stack. If the closing node is removed, reset current_position to the
next element in the stack.
....
Based on my reading of this the result should be <em><strong>cat</strong></em> since first the <strong> is added, THEN the <em>. However, all online markdown editors I have tried this in output <strong><em>cat</em></strong>. What am I missing?
Here is a visual representation of what I think should be happening
TextNode[***] TextNode[cat] TextNode[***]
TextNode[*] StrongEmphasis TextNode[cat] TextNode[*]
TextNode[] Emphasis StrongEmphasis TextNode[cat] TextNode[]
Emphasis StrongEmphasis TextNode[cat]
It's important to remember that Commonmark and Markdown are not necessarily the same thing. Commonmark is a recent variant of Markdown. Most Markdown parsers existed and established their behavior long before the Commonmark spec was even started.
While the original Markdown rules make no comment on whether the <em> or <strong> tag should be first in the given example, the reference implementation's (markdown.pl) actual behavior was to list the <strong> tag before the <em> tag in the output. In fact, the MarkdownTest package, which was created by the author of Markdown and markdown.pl) explicitly required that output (the original is no longer available online that I know of, but mdtest is a faithful copy with its history showing no modifications of that test since the initial import from MarkdownTest). AFAICT, every (non-Commonmark) Markdown parser has followed that behavior exactly.
The Commonmark spec took a different route. The spec specifically states in Rule 14 of Section 6.4 (Emphasis and strong emphasis):
An interpretation <em><strong>...</strong></em> is always preferred to <strong><em>...</em></strong>.
... and backs it up with example 444:
***foo***
<p><em><strong>foo</strong></em></p>
In fact, you can see that that is exactly the behavior of the reference implementation of Commonmark.
As an aside, the original question quotes from the Appendix to the spec which recommends how to implement a parser. While potentially useful to a parser creator, I would not recommend using that section to determine proper syntax handling and/or output. The actual rules should be consulted instead; and in fact, they clearly provide the expected output in this instance. But this question is about an apparent disparity between implementations and the spec, not interpretation of the spec.
For a more complete comparison, see Babelmark. With the exception of a few (completely) broken implementations, every "classic" Markdown parser follows markdown.pl, while every Commonmark parser follows the Commonmark spec. Therefore, there is no actual disparity between the spec and implementations. The disparity is between Markdown and Commonmark.
As for why the Commonmark authors chose a different route in this regard, or why they insist on calling Commonmark "Markdown" when it is clearly different are off topic here and better asked of the authors themselves.

When to use V instead of a decimal in Cobol Pic Clauses

Studying for a test right now and can't seem to wrap my head around when to use "V" for a decimal instead of an actual decimal in PIC clauses. I've done some research but can't find anything I understand. Only been learning cobol for about a week, so is there like a rule of thumb here? Thanks for your time.
You use an actual decimal-point when you want to "output" a value which has decimal places, like a report line, a position on a screen, an item in an output file which is going to a "different" system which doesn't understand the format with an implied decimal pace.
That's what the V is, it is an implied decimal place. It tells the compiler where to align results from calculations, MOVEs, whatever. Computer chips, and the machine instructions they support, don't know about actual decimal points for their internal processing.
COBOL is a language with fixed-length fields. The machine instructions don't need to know where the decimal point is (effectively it can deal with everything as integer values) but the compiler does, and the compiler has to do the correct scaling and alignment of results.
Storing on your own files, use V, the implied decimal place.
For data which is to be "human readable" or read by a system which cannot understand your character set, cannot scale what looks like an integer, use an actual decimal-point, . (for computer-readable stuff, you can sometimes use a separate scaling factor, if that is more convenient for the receiving system).
Basically, V for internal, . for external, should be a rule of thumb to get you there.
Which COBOL are you using? I'm surprised it is not covered in your documentation.

Determine Cobol coding style

I'm developing an application that parses Cobol programs. In these programs some respect the traditional coding style (programm text from column 8 to 72), and some are newer and don't follow this style.
In my application I need to determine the coding style in order to know if I should parse content after column 72.
I've been able to determine if the program start at column 1 or 8, but prog that start at column 1 can also follow the rule of comments after column 72.
So I'm trying to find rules that will allow me to determine if texts after column 72 are comments or valid code.
I've find some but it's hard to tell if it will work everytime :
dot after column 72, determine the end of sentence but I fear that dot can be in comments too
find the close character of a statement after column 72 : " ' ) }
look for char at columns 71 - 72 - 73, if there is not space then find the whole word, and check if it's a key word or a var. Problem, it can be a var from a COPY or a replacement etc...
I'd like to know what do you think of these rules and if you have any ideas to help me determine the coding style of a Cobol program.
I don't need an API or something just solid rules that I will be able to rely on.
I think you need to know the COBOL compiler for each program. Its documentation should tell you what conventions/configurations/switches it uses to decide if the source code ends at column 72 or not.
So.... which compiler(s)?
And if you think the column 72 issue is a pain, wait till you get around to actually parsing the COBOL itself. If you are not well prepared to handle the lexical issues of the language, you are probably very badly prepared to handle the syntactic ones.
There is no absolutely reliable way to determine if a COBOL program
is in fixed or free format based only on the source code. Heck it is sometimes difficult to identify
the programming language based only on source code. Check out
this classic polyglot - it is valid under 8 different language compilers. That
said, you could try a few heuristics that might yield
the correct answer more often than not.
Compiler directives imbedded in source code
Watch for certain compiler directives that determine code format.
Unfortunately, every compiler vendor uses their own flavour of directive.
For example, Microfocus COBOL uses the
SOURCEFORMAT directive. This directive will appear near the top of the program so a short pre-scan
could be used to find it. On the other hand, OpenCobol uses >>SOURCE FORMAT IS FREE and
>>SOURCE FORMAT IS FIXED to toggle between free and fixed format, different parts of the same program
could be formatted differently!
The bottom line here is that you will have to support the conventions of multiple COBOL compilers.
Compiler switches
Source code format can be also be specified using a compiler switch. In this case, there are no concrete
clues to go on. However, you can be reasonably sure that the entire source program will be either
fixed or free. All you can do here is guess. Unless the programmer is out to "mess with
your head" (and some will), a program in free format will have the keywords IDENTIFICATION DIVISION or ID DIVISION, starting before column 8.
Every COBOL program will begin with these keywords so you can use them as the anchor point for determining code format in the
absence of imbedded compiler directives.
Warning - this is far from fool proof, but might be a good start.
There won't be an algorithm to do this with 100% certainty, because if comments can be anything, they can also be compilable COBOL code. So you could theoretically write a program that means one thing if the comments are ignored, and something else entirely if the comments are treated as part of the COBOL.
But that's extremely unlikely. What's most likely to happen is that if you try to compile the code under the wrong convention, it will simply fail. So the only accurate way to do this is to try compiling/parsing the program one way, and if you come to a line that can't make sense, switch to the other style. You could also support passing an argument to the compiler when the style is already known.
You can try using heuristics like what you've described, but that will never be totally accurate. The most they can give you is a probability that the code is one or the other style, which will increase as they examine more and more lines of code. They could be useful for helping you guess the style before you start compiling, or for figuring out when the problem is really just a typo in the code.
EDIT:
Regarding ideas for heuristics, it's hard to say. If there were a standard comment sigil like // or # in other languages, this would be a lot easier (actually, there is, but it sounds like your code doesn't follow this convention). The only thing I can think of would be to check whether every line (or maybe 99% of lines, and not counting empty lines or lines commented with *) has a period somewhere before position 72.
One thing you DON'T want to do is apply any heuristics to the part after position 72. That is, you don't want to be checking the comments to see if they're valid COBOL. You want to check what you know is COBOL first, and see if that works by itself. There are several reasons for this:
Comments written in English are likely to have periods and quotes in them, so your first and second bullet points are out.
Natural languages are WAY harder to parse than something like COBOL.
The comments could easily have COBOL in them (maybe someone commented out the previous version of the line).
An important rule for comments is that they should never affect what the program does. If changing the comments can change how the program is compiled, you violate that.
All that in mind, my opinion is that you shouldn't use heuristics at all. You should always try to compile the program under both conventions unless one is explicitly specified. There's a chance that code will compile successfully under both conventions, and then you'll have two different programs and no way to tell which one is correct.
If that happens, you need to compare the two results (perhaps with a hash or something) to see if they're the same program. If they're the same, great, but if not, you'll need to force the user to explicitly choose a convention.
Most COBOL compilers will allow you to generate and analyze the post text manipulation phase.
The text preprocessor output can be seen (using OpenCOBOL for the example)
cobc -E program.cob
The text manipulation processor deals with any COPY ... REPLACING compiler directives, as well as converting SOURCE FORMAT IS FIXED (with line continuations, string literal concatenations, comment line removal, among other things) to the actual free format that the compiler lexical analyzer needs. A lot of the OpenCOBOL toolkits (Cross referencer and Animator, to name two) use source code AFTER the preprocessor pass. I don't think you'll lose any street cred if your parser program relies on post processed source code files.

Resources