YAML: encoding vs semantic differences

YAML: encoding vs semantic differences - parsing

I would like to get some better understanding about what aspects of YAML refer to the encoding of data vs what aspects refer to semantic.
A simple example:
test1: dGVzdDE=
test2: !!binary |
dGVzdDE=
test3:
- 116
- 101
- 115
- 116
- 49
test4: test1
Which of these values (if any) are equivalent?
I would argue that test1 encodes the literal string value dGVzdDE=. test2 and test3 both encode the same array, just using a different encoding. I am unsure about test4, it contains the same bytes as test2 and test3 but does this make it the equivalent value or is a string in YAML different from a byte array?
Different tools seem to produce different answers:
https://onlineyamltools.com/convert-yaml-to-json suggests that test2 and test3 are equivalent, but different from test4
https://yaml-online-parser.appspot.com/ suggests that test2 and test4 are equivalent, but different from test4
to yq all entries are different yq < test.yml:
{
"test1": "dGVzdDE=",
"test2": "dGVzdDE=\n",
"test3": [
116,
101,
115,
116,
49
],
"test4": "test1"
}
What does the YAML spec intend?

Equality
You're asking for equivalence but that's not a term in the spec and therefore cannot be discussed (at least not without definition). I'll go with discussing equality instead, which is defined by the spec as follows:
Two scalars are equal only when their tags and canonical forms are equal character-by-character. Equality of collections is defined recursively.
One node in your example has the tag !!binary but the others do not have tags. So we must check what the spec says about tags of nodes that don't have explicit tags:
Tags and Schemes
The YAML spec says that every node is to have a tag. Any node that does not have an explicit tag gets a non-specific tag assigned. Nodes are divided into scalars (that get created from textual content) and collections (sequences and mappings). Every non-plain scalar node (i.e. every scalar in quotes or given via | or >) that does not have an explicit tag gets the non-specific tag !, every other node without explicit tag gets the non-specific tag ?.
During loading, the spec defines that non-specific tags are to be resolved to specific tags by means of using a scheme. The specification describes some schemes, but does not require an implementation to support any particular one.
The failsafe scheme, which is designed to be the most basic scheme, will resolve non-specific tags as follows:
on scalars to !!str
on sequences to !!seq
on mappings to !!map
and that's it.
A scheme is allowed to derive a specific tag from a non-specific one by considering the kind of non-specific tag, the node's position in the document, and the node's content. For example, the JSON Scheme will give a scalar true the tag !!bool due to its content.
The spec says that the non-specific tag ! should only be resolved to !!str for scalars, !!seq for sequence, and !!map for mappings, but does not require this. This is what most implementations support and means that if you quote your scalar, you will get a string. This is important so that you can give the scalar "true" quoted to avoid getting a boolean value.
By the way, the spec does not say that every step defined there is to be implemented slavishly as defined in the spec, it is more a logical description. A lot of implementations do not actually transition from non-specific tags to specific tags, but instead directly choose native types for the YAML data they load according to the scheme rules.
Applying Equality
Now that we know how tags are assigned to nodes, let's go over your example:
test1: dGVzdDE=
test2: !!binary |
dGVzdDE=
The two values are immediately not equal because even without the tag, their content differs: Literal block scalars (introduced with |) contain the final linebreak, so the value of test2 is "dGVzdEDE=\n" and therefore not equal to the test1 value. You can introduce the literal scalar with |- instead to chop the final linebreak, which I suppose is your intent. In that case, the scalar content is identical.
Now for the tag: The value of test1 is a plain scalar, hence it has a non-specific tag ?. The question is now: Will this be resolved to !!binary? There could be a scheme that does this, but the spec doesn't define one. But think about it: A scheme that assigns every scalar the tag !!binary if it looks like base64-encoded data would be a very specific one.
As for the other values: The test3 value is a sequence, so obviously not equal to any other value. The test4 value contains content not present anywhere else, therefore also not equal.
But yaml-online-parser does things!
Yes. The YAML spec explicitly states that the target of loading YAML data is native data types. Tags are thought of as generic hints that can be mapped to native data types by a specific implementation. So an !!str for example would be resolved to the target language's string type.
How this mapping to native types is done is implementation-defined (and must be, since the spec cannot cater to every language out there). yaml-online-parser uses PyYAML and what it does is to load the YAML into Python's native data types, and then dump it again. In this process, the !!binary will get loaded into a Python binary string. However, during dumping, this binary string will get interpreted as UTF-8 string and then written as plain scalar. You can argue this is a bug, but it certainly doesn't violate the spec (as the spec doesn't know what a Python binary string is and therefore does not define how it is to be represented).
In any case, this shows that as soon as you transition to native types and back again, everything goes and nothing is certain because native types are outside of the spec. Different implementations will give you different outputs because they are allowed to. !!binary is not a tag defined in the JSON scheme so even translating your input to JSON is not well-defined.
If you want an online tool that shows you canonical YAML representation without loading data into native types and back, you can use the NimYAML testing ground (my work).
Conclusion
The question of whether two YAML inputs are equal is an academic one. Since YAML does allow for different schemes, the question can only be definitely answered in the context of a certain scheme.
However, you will find very few formal scheme definitions outside of the YAML spec. Most applications that do use YAML will document their input structure in a less formal way, and most of the time without discussing YAML tags. This is fine because as discussed before, loading YAML does not need to directly implement the logical process described in the spec.
Your answer for practical purposes should come from the documentation of the application consuming the YAML data. If the documentation is very good, it will answer this, but a lot of YAML-consuming applications just use the default settings of the YAML implementation they use without telling you about this.
So the takeaway is: Know your application and know the YAML implementation it uses.

Related

In Nix, why does hashDrv replace the inputDrvs with a recursive hashDrv of their contents

In Section 5.4 of Eelco Dolstra's thesis on page 108 there is a definition of the hashDrv function in which, on box 69, the inputDrvs are replaced (recursively) with the hashDrv of the parsed contents of the file. I don't understand the motivation for performing this substitution as opposed to just using the the inputDrvs file names themselves without perform a substitution.
The consequence of this substitution appears to be that the output values of store derivations are recursively removed from the computation of all output values. However, since the output values themselves are computed from all the other data that goes into hashDrv, there doesn't seem to be any positive consequences for doing this substitution operation.
Indeed there appears to be negative consequences because this substitution means that the output hash file cannot be computed from the derivation contents itself, and instead you are required to have the entire tree of input derivations to perform the computation (see https://github.com/NixOS/nix/issues/2789#issuecomment-595143352).
While of course the output hash value for the derivation itself needs to be excluded from the computation of its hashDrv, it seems like it would have been better if the output hash values were simply derived from just the other contents of the derivation file.

OK so first all, it would be better still to just drop the inputDrvs and just use the output paths of those inputs for creating the store path, see the commit message of https://github.com/nixos/nix/commit/1511aa9f488ba0762c2da0bf8ab61b5fde47305d for Eelco saying as much:
Note that this is a privileged operation, because you can construct a
derivation that builds any store path whatsoever. Fixing this will
require changing the hashing scheme (i.e., the output paths should be
computed from the other fields in BasicDerivation, allowing them to be
verified without access to other derivations). However, this would be
quite nice because it would allow .drv-free building (e.g. "nix-env
-i" wouldn't have to write any .drv files to disk).
But, the hashDrv given is better than one that just hashes the drv as-is, because the "modulo fixed output derivations part". By ignoring the rest fixed output derivations and just returning a hash based on the fixed output hash (and name) alone, we gain the ability to change how fixed output derivations produce the data they do without changing downstream hashes. This how in Nixpkgs today, we can for example do https://github.com/NixOS/nixpkgs/pull/82130 and it won't be a mass rebuild.

CommonMark Parsing ***

Let's say I want to parse the string ***cat*** into Markdown using the CommonMark standard. The standard says (http://spec.commonmark.org/0.28/#phase-2-inline-structure):
....
If one is found:
Figure out whether we have emphasis or strong emphasis: if both closer
and opener spans have length >= 2, we have strong, otherwise regular.
Insert an emph or strong emph node accordingly, after the text node
corresponding to the opener.
Remove any delimiters between the opener and closer from the delimiter
stack.
Remove 1 (for regular emph) or 2 (for strong emph) delimiters from the
opening and closing text nodes. If they become empty as a result,
remove them and remove the corresponding element of the delimiter
stack. If the closing node is removed, reset current_position to the
next element in the stack.
....
Based on my reading of this the result should be <em><strong>cat</strong></em> since first the <strong> is added, THEN the <em>. However, all online markdown editors I have tried this in output <strong><em>cat</em></strong>. What am I missing?
Here is a visual representation of what I think should be happening
TextNode[***] TextNode[cat] TextNode[***]
TextNode[*] StrongEmphasis TextNode[cat] TextNode[*]
TextNode[] Emphasis StrongEmphasis TextNode[cat] TextNode[]
Emphasis StrongEmphasis TextNode[cat]

It's important to remember that Commonmark and Markdown are not necessarily the same thing. Commonmark is a recent variant of Markdown. Most Markdown parsers existed and established their behavior long before the Commonmark spec was even started.
While the original Markdown rules make no comment on whether the <em> or <strong> tag should be first in the given example, the reference implementation's (markdown.pl) actual behavior was to list the <strong> tag before the <em> tag in the output. In fact, the MarkdownTest package, which was created by the author of Markdown and markdown.pl) explicitly required that output (the original is no longer available online that I know of, but mdtest is a faithful copy with its history showing no modifications of that test since the initial import from MarkdownTest). AFAICT, every (non-Commonmark) Markdown parser has followed that behavior exactly.
The Commonmark spec took a different route. The spec specifically states in Rule 14 of Section 6.4 (Emphasis and strong emphasis):
An interpretation <em><strong>...</strong></em> is always preferred to <strong><em>...</em></strong>.
... and backs it up with example 444:
***foo***
<p><em><strong>foo</strong></em></p>
In fact, you can see that that is exactly the behavior of the reference implementation of Commonmark.
As an aside, the original question quotes from the Appendix to the spec which recommends how to implement a parser. While potentially useful to a parser creator, I would not recommend using that section to determine proper syntax handling and/or output. The actual rules should be consulted instead; and in fact, they clearly provide the expected output in this instance. But this question is about an apparent disparity between implementations and the spec, not interpretation of the spec.
For a more complete comparison, see Babelmark. With the exception of a few (completely) broken implementations, every "classic" Markdown parser follows markdown.pl, while every Commonmark parser follows the Commonmark spec. Therefore, there is no actual disparity between the spec and implementations. The disparity is between Markdown and Commonmark.
As for why the Commonmark authors chose a different route in this regard, or why they insist on calling Commonmark "Markdown" when it is clearly different are off topic here and better asked of the authors themselves.

Standardizing "character set ranges" as internationally defined values

Lets say I have a field which accepts A-Z,a-z,0-9 . If I'm trying to communicate to someone, via documenation or api creation "what" my code can accept, i HAVE to say:
A-Z,a-z,0-9
Now that in my mind this is restrictive and error prone.
Compare that to what i'm proposing.
Suppose A-Z,a-z,0-9 was allocated the "code" ANSI456
When I'm communicating that to someone, I can say that my code accepts ANSI456. If someone else was developing a check, there is no confusion on what my code can or cannot accept.
To those who will suggest just specifying character ranges, please note that what i'm envisioning will handle scenarios where even this is defined as a valid "code"
0-9, +, -, *, /
In fact, if its done properly, we can have a site generate automatic code in various languages to accomodate the different "codes".
Okay - i KNOW there are ~ infinite values, eg:
a-z
is different from
a-l,n-z
And these would have two different codes in this "system".
I'm not proposing a HUMAN moderated system - it can be completely automatic BUT systematic way of generating these "codes"

There already is such a standard, although it doesn't have the word "standard" in its name. It is called Perl 5 compatible regular expressions, and it is used in Perl 5, Java, JavaScript, libpcre and many other contexts.

Alpha renaming in many languages

I have what I imagine will be a fairly involved technical challenge: I want to be able to reliably alpha-rename identifiers in multiple languages (as many as possible). This will require special consideration for each language, and I'm asking for advice for how to minimize the amount of work I need to do by sharing code. Something like a unified parsing or abstract syntax framework that already has support for many languages would be great.
For example, here is some python code:
def foo(x):
def bar(y):
return x+y
return bar
An alpha renaming of x to y changes the x to a y and preserves semantics. So it would become:
def foo(y):
def bar(y1):
return y+y1
return bar
See how we needed to rename y to y1 in order to keep from breaking the code? That is why this is a hard problem. It seems like the program would have to have a pretty good knowledge of what constitutes a scope, rather than just doing, say, a string search and replace.
I would also like to preserve as much of the formatting as possible: comments, spacing, indentation. But that is not 100% necessary, it would just be nice.
Any tips?

To do this safely, you need to be able to to determine
all the identifiers (and those things that are not, e.g., the middle of a comment) in your code
the scopes of validity for each identifer
the ability to substitute a new identifier for an old one in the text
the ability to determine if renaming an identifier causes another name to be shadowed
To determine identifiers accurately, you need a least a langauge-accurate lexer. Identifiers in PHP look different than the do in COBOL.
To determine scopes of validity, you have to be determine program structure in practice, since most "scopes" are defined by such structure. This means you need a langauge-accurate parser; scopes in PHP are different than scopes in COBOL.
To determine which names are valid in which scopes, you need to know the language scoping rules. Your language may insist that the identifier X will refer to different Xes depending on the context in which X is found (consider object constructors named X with different arguments). Now you need to be able to traverse the scope structures according to the naming rules. Single inheritance, multiple inheritance, overloading, default types all will pretty much require you to build a model of the scopes for the programs, insert the identifiers and corresponding types into each scope, and then climb from the point of encounter of an identifier in the program text through the various scopes according to the language semantics. You will need symbol tables, inheritance linkages, ASTs, and the ability to navigage all of these. These structures are different from PHP and COBOL, but they share lots of common ideas so you likely need a library with the common concept support.
To rename an identifier, you have to modify the text. In a million lines of code, you need to point carefully. Modifying an AST node is one way to point carefully. Actually, you need to modify all the identifiers that correspond to the one being renamed; you have to climb over the tree to find them all, or record in the AST where all the references exist so they can be found easily. After modifyingy the tree you have to regenerate the source text after modifying the AST. That's a lot of machinery; see my SO answer on how to prettyprint ASTs preseriving all of the stuff you reasonably suggest should be preserved.
(Your other choice is to keep track in the AST of where the text for the string is,
and the read/patch/write the file.)
Before you update the file, you need to check that you haven't shadowed something. Consider this code:
{ local x;
x=1;
{local y;
y=2;
{local z;
z=y
print(x);
}
}
}
We agree this code prints "1". Now we decide to rename y to x.
We've broken the scoping, and now the print statement which referred
conceptually to the outer x refers to an x captured by the renamed y. The code now prints "2", so our rename broke it. This means that one must check all the other identifiers in scopes in which the renamed variable might be found, to see if the new name "captures" some name we weren't expecting. (This would be legal if the print statement printed z).
This is a lot of machinery.
Yes, there is a framework that has almost all of this as well as a number of robust language front ends. See our DMS Software Reengineering Toolkit. It has parsers producing ASTs, prettyprinters to produce text back from ASTs, generic symbol table management machinery (including support for multiple inheritance), AST visiting/modification machinery. Ithas prettyprinting machinery to turn ASTs back into text. It has front ends for C, C++, COBOL and Java that implement name and type resolution (e.g. instanting symbol table scopes and identifier to symbol table entry mappings); it has front ends for many other langauges that don't have scoping implemented yet.
We've just finished an exercise in implementing "rename" for Java. (All the above issues of course appeared). We about about to start one for C++.

You could try to create Xtext based implementations for the involved languages. The Xtext framework provides reliable infrastructure for cross language rename refactoring. However, you'll have to provide a grammar a at least a "good enough" scope resolution for each language.

Languages mostly guarantee tokens will be unique, whatever the context. A naive first approach (and this will break many, many pieces of code) would be:
cp file file.orig
sed -i 's/\b(newTokenName)\b/TEMPTOKEN/g' file
sed -i 's/\b(oldTokenName)\b/newTokenName/g' file
With GNU sed, this will break on PHP. Rewriting \b to a general token match, like ([^a-zA-Z~$-_][^a-zA-Z0-9~$-_]) would work on most C, Java, PHP, and Python, but not Perl (need to add # and % to the token characters. Beyond that, it would require a plugin architecture that works for any language you wanted to add. At some point, there will be two languages whose variable and function naming rules will be incompatible, and at that point, you'll need to do more and more in the plugin.

SnakeYAML: How to disable underscore stripping when parsing?

Here's my problem. I have YAML doc that contains the following pair:
run_ID: 2010_03_31_101
When this get's parsed at
org.yaml.snakeyaml.constructor.SafeConstructor.ConstructYamlInt:159
underscores get stripped and Constructor returns Long 20100331101
instead of unmodified String "2010_03_31_101" that I really need.
QUESTION: How
can I disable this behavior and force parser to use String constructor
instead of Long?

OK. Got answer form their mailing list. Here it is
Hi, according to the spec
(http://yaml.org/type/int.html): Any
“_” characters in the number are
ignored, allowing a readable
representation of large values
You have a few ways to solve it. 1) do
not rely on implicit types, use quotes
(single or double) run_ID:
'2010_03_31_101'
2) Turn off resolver for integers (as
it is done here for floats) link
1 link 2
3) Define your own pattern for int
link 3
Please be aware that when you start to
deviate from the spec other recipients
may fail to parse your YAML document.
Using quotes is safe.
Andrey

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart