Here's my problem. I have YAML doc that contains the following pair:
run_ID: 2010_03_31_101
When this get's parsed at
org.yaml.snakeyaml.constructor.SafeConstructor.ConstructYamlInt:159
underscores get stripped and Constructor returns Long 20100331101
instead of unmodified String "2010_03_31_101" that I really need.
QUESTION: How
can I disable this behavior and force parser to use String constructor
instead of Long?
OK. Got answer form their mailing list. Here it is
Hi, according to the spec
(http://yaml.org/type/int.html): Any
“_” characters in the number are
ignored, allowing a readable
representation of large values
You have a few ways to solve it. 1) do
not rely on implicit types, use quotes
(single or double) run_ID:
'2010_03_31_101'
2) Turn off resolver for integers (as
it is done here for floats) link
1 link 2
3) Define your own pattern for int
link 3
Please be aware that when you start to
deviate from the spec other recipients
may fail to parse your YAML document.
Using quotes is safe.
Andrey
Related
I would like to get some better understanding about what aspects of YAML refer to the encoding of data vs what aspects refer to semantic.
A simple example:
test1: dGVzdDE=
test2: !!binary |
dGVzdDE=
test3:
- 116
- 101
- 115
- 116
- 49
test4: test1
Which of these values (if any) are equivalent?
I would argue that test1 encodes the literal string value dGVzdDE=. test2 and test3 both encode the same array, just using a different encoding. I am unsure about test4, it contains the same bytes as test2 and test3 but does this make it the equivalent value or is a string in YAML different from a byte array?
Different tools seem to produce different answers:
https://onlineyamltools.com/convert-yaml-to-json suggests that test2 and test3 are equivalent, but different from test4
https://yaml-online-parser.appspot.com/ suggests that test2 and test4 are equivalent, but different from test4
to yq all entries are different yq < test.yml:
{
"test1": "dGVzdDE=",
"test2": "dGVzdDE=\n",
"test3": [
116,
101,
115,
116,
49
],
"test4": "test1"
}
What does the YAML spec intend?
Equality
You're asking for equivalence but that's not a term in the spec and therefore cannot be discussed (at least not without definition). I'll go with discussing equality instead, which is defined by the spec as follows:
Two scalars are equal only when their tags and canonical forms are equal character-by-character. Equality of collections is defined recursively.
One node in your example has the tag !!binary but the others do not have tags. So we must check what the spec says about tags of nodes that don't have explicit tags:
Tags and Schemes
The YAML spec says that every node is to have a tag. Any node that does not have an explicit tag gets a non-specific tag assigned. Nodes are divided into scalars (that get created from textual content) and collections (sequences and mappings). Every non-plain scalar node (i.e. every scalar in quotes or given via | or >) that does not have an explicit tag gets the non-specific tag !, every other node without explicit tag gets the non-specific tag ?.
During loading, the spec defines that non-specific tags are to be resolved to specific tags by means of using a scheme. The specification describes some schemes, but does not require an implementation to support any particular one.
The failsafe scheme, which is designed to be the most basic scheme, will resolve non-specific tags as follows:
on scalars to !!str
on sequences to !!seq
on mappings to !!map
and that's it.
A scheme is allowed to derive a specific tag from a non-specific one by considering the kind of non-specific tag, the node's position in the document, and the node's content. For example, the JSON Scheme will give a scalar true the tag !!bool due to its content.
The spec says that the non-specific tag ! should only be resolved to !!str for scalars, !!seq for sequence, and !!map for mappings, but does not require this. This is what most implementations support and means that if you quote your scalar, you will get a string. This is important so that you can give the scalar "true" quoted to avoid getting a boolean value.
By the way, the spec does not say that every step defined there is to be implemented slavishly as defined in the spec, it is more a logical description. A lot of implementations do not actually transition from non-specific tags to specific tags, but instead directly choose native types for the YAML data they load according to the scheme rules.
Applying Equality
Now that we know how tags are assigned to nodes, let's go over your example:
test1: dGVzdDE=
test2: !!binary |
dGVzdDE=
The two values are immediately not equal because even without the tag, their content differs: Literal block scalars (introduced with |) contain the final linebreak, so the value of test2 is "dGVzdEDE=\n" and therefore not equal to the test1 value. You can introduce the literal scalar with |- instead to chop the final linebreak, which I suppose is your intent. In that case, the scalar content is identical.
Now for the tag: The value of test1 is a plain scalar, hence it has a non-specific tag ?. The question is now: Will this be resolved to !!binary? There could be a scheme that does this, but the spec doesn't define one. But think about it: A scheme that assigns every scalar the tag !!binary if it looks like base64-encoded data would be a very specific one.
As for the other values: The test3 value is a sequence, so obviously not equal to any other value. The test4 value contains content not present anywhere else, therefore also not equal.
But yaml-online-parser does things!
Yes. The YAML spec explicitly states that the target of loading YAML data is native data types. Tags are thought of as generic hints that can be mapped to native data types by a specific implementation. So an !!str for example would be resolved to the target language's string type.
How this mapping to native types is done is implementation-defined (and must be, since the spec cannot cater to every language out there). yaml-online-parser uses PyYAML and what it does is to load the YAML into Python's native data types, and then dump it again. In this process, the !!binary will get loaded into a Python binary string. However, during dumping, this binary string will get interpreted as UTF-8 string and then written as plain scalar. You can argue this is a bug, but it certainly doesn't violate the spec (as the spec doesn't know what a Python binary string is and therefore does not define how it is to be represented).
In any case, this shows that as soon as you transition to native types and back again, everything goes and nothing is certain because native types are outside of the spec. Different implementations will give you different outputs because they are allowed to. !!binary is not a tag defined in the JSON scheme so even translating your input to JSON is not well-defined.
If you want an online tool that shows you canonical YAML representation without loading data into native types and back, you can use the NimYAML testing ground (my work).
Conclusion
The question of whether two YAML inputs are equal is an academic one. Since YAML does allow for different schemes, the question can only be definitely answered in the context of a certain scheme.
However, you will find very few formal scheme definitions outside of the YAML spec. Most applications that do use YAML will document their input structure in a less formal way, and most of the time without discussing YAML tags. This is fine because as discussed before, loading YAML does not need to directly implement the logical process described in the spec.
Your answer for practical purposes should come from the documentation of the application consuming the YAML data. If the documentation is very good, it will answer this, but a lot of YAML-consuming applications just use the default settings of the YAML implementation they use without telling you about this.
So the takeaway is: Know your application and know the YAML implementation it uses.
Assuming I define a variable like this in Lua
local input = "..."
Where the ... comes from a user-provided string. Would that user be able to perform code injection just from a variable definition? Do I need to sanitize the string?
As a general rule, if you ever need to ask yourself if you need to sanitize your inputs, the correct answer is "yes".
As to this particular case, if you just copy/paste the user's string directly into the Lua source file, even in quotes like that, they will be able to execute arbitrary code. It's not even particularly difficult; they can provide some text"; my_code = 20; last = "end of string.
The best way to sanitize this is by using a long-form literal string with [[...]] syntax. But even that can be broken out, so you need to search through the given string for repeated sequences of the = character. Each time you find a sequence, note how many = characters are in that sequence. After searching, insert a number of = characters into your literal string that isn't one of the lengths found in the user string.
Of course, the internal implementation of Lua may have some limits on the length of the = sequence in a long-form literal string. In such a case, an external user could break your code by forcing you to use a longer sequence than the implementation supports. But it won't be able to cause arbitrary code execution; you'll just get a compile error.
Using a parser generator I want to create a parser for "From headers" in email messages. Here is an example of a From header:
From: "John Doe" <john#doe.org>
I think it will be straightforward to implement a parser for that.
However, there is a complication in the "From header" syntax: comments may be inserted just about anywhere. For example, a comment may be inserted within "john":
From: "John Doe" <jo(this is a comment)hn#doe.org>
And comments may be inserted in many other places.
How to handle this complication? Does it require a "2-pass" parser: one pass to remove all comments and a second pass to create the parse tree for the From header? Do modern parser generators support multiple passes on the input? Can it be parsed in a single pass? If yes, would you sketch the approach please?
I'm not convinced that your interpretation of email addresses is correct; my reading of RFC-822 leads me to believe that a comment can only come before or after a "word", and that "word"s in the local-part of an addr-spec need to be separated by dots ("."). Section 3.1.4 gives a pretty good hint on how to parse: you need a lexical analyzer which feeds syntactic symbols into the parser; the lexical analyzer is expected to unfold headers, ignore whitespace, and identify comments, quoted strings, atoms, and special characters.
Of course, RFC-822 has long been obsoleted, and I think that email headers with embedded comments are anachronistic.
Nonetheless, it seems like you could easily achieve the analysis you wish using flex and bison. As indicated, flex would identify the comments. Strictly speaking, you cannot identify comments with a regular expression, since comments nest. But you can recognize simple nested structures using a start condition stack, or even more economically by maintaining a counter (since flex won't return until the outermost parenthesis is found, the counter doesn't need to be global.)
I'm using FitNesse to test web service responses using check to compare the expected to the actual response.
In some cases the check is failing and I can't see what the differences are between the expected and the actual that is causing it to fail.
Here's a screenshot from what it's telling me in a specific instance (of many similar instances):
Feel free to point out the obvious; it's probably staring at me in the face and I'm looking so hard I can't see it!
I would check that the expected and actual strings are both written with the same text encoding. I've seen this error plenty of times when the text comparing failed due to a comma or apostrophe being written in different encodes.
It is possible that your string contains extra spaces in the actual value. FitNesse, being html based, will not respect leading or trailing spaces. It might not handle any extra spaces inside the actual either. So this can cause the result to be different, but not visibly so.
See if you can add some debug messages that would help you see the extra spaces, or at least count the number of characters in both strings.
This question doesn't specify whether Slim or Fit are being used, or which Slim server/plugin if using Slim, but I found the following to be true for me using FitNesse release 20130530 and fitSharp release 2.2:
Non-ASCII characters and { apostrophes / single quote characters } in input arguments/parameters that are strings are HTML encoded. The values in my FitNesse test tables are HTML encoded, but only the required syntax characters and (double) quotes; not the non-ASCII characters (and FitNesse doesn't seem to have any problems storing those values).
EOL characters in the input arguments that are strings consist of a linefeed character only
I imagine that because I'm using .NET, EOLs in my return values consist of carriage return and linefeed characters.
Because of [1], I'm HTML-encoding non-ASCII characters (but not the HTML syntax characters or quotes). Because of [2] and [3], I'm now removing carriage return characters from my fixture return values. Both changes seem to have resolved this issue for me and expected and actual values are now reported as being the same.
Whitespace has troubled me often. The resulting HTML just collapses whitespace, but the compare in code does not.
I now use a fixture to make differences more explicit to me. Example usage: http://fhoeben.github.io/hsac-fitnesse-fixtures/examples-results/HsacExamples.SlimTests.UtilityFixtures.CompareFixtureTest.html
Newer versions of FitNesse (since 20151230) do a diff on the expected and actual result values. Has that helped you at all?
I've been playing with this for an hour or tow and have found myself at a road block with the Lua pattern matching utilities. I am attempting to match all quoted text in a string and replace it if needed.
The pattern I have come up with so far is: (\?[\"\'])(.-)%1
This works in some cases but, not all cases:
Working: "This \"is a\" string of \"text to\" test with"
Not Working: "T\\\"his \"is\' a\" string\" of\' text\" to \"test\" wit\\\"h"
In the not working example I would like it to match to (I made a function that gets the matches I desire, I'm just looking for a pattern to use with gsub and curious if a lua pattern can do this):
string
a" string" of
is' a" string" of' text
test
his "is' a" string" of' text" to "test" wit
I'm going to continue to use my function instead for the time being, but am curious if there is a pattern I could/should be using and i'm just missing something with patterns.
(a few edits b/c I forgot about stackoverflows formating)
(another edit to make a non-html example since it was leading to assumptions that I was attempting to parse html)
Trying to match escaped, quoted text using regular expressions is like trying to remove the daisies (and only the daises) from a field using a lawnmower.
I made a function that gets the matches I desire
This is the correct move.
I'm curious if a lua pattern can do this
From a practical point of view, even if a pattern can do this, you don't want to. From a theoretical point of view, you are trying to find a double quote that is preceded by an even number of backslashes. This is definitely a regular language, and the regular expression you want would be something like the following (Lua quoting conventions)
[[[^\](\\)*"(.-[^\](\\)*)"]]
And the quoted string would be result #2. But Lua patterns are not full regular expressions; in particular, you cannot put a * after a parenthesized pattern.
So my guess is that this problem cannot be solved using Lua patterns, but since Lua patterns are not a standard thing in automata theory, I'm not aware of any body of proof technique that you could use to prove it.
The issue with escaped quotes is that, in general, if there's an odd number of backslashes before the quote, then it's escaped, and if there's an even number, it's not. I do not believe that Lua pattern-matching is powerful enough to represent this condition, so if you need to parse text like this, then you should seek another way. Perhaps you can iterate through the string and parse it, or you could find each quote in turn and read backwards, counting the backslashes until you find a non-backslash character (or the beginning of the string).
If you absolutely must use patterns for some reason, you could try doing this in a multi-step process. First, gsub for all occurrences of two backslashes in a row, and replace them with some sentinel value. This must be a value that does not already occur in the string. You could try something like "\001" if you know this string doesn't contain non-printable characters. Anyway, once you've replaced all sequences of two backslashes in a row, any backslashes left are escaping the following character. Now you can apply your original pattern, and then finally you can replace all instances of your sentinel value with two backslashes again.
Lua's pattern language is adequate for many simple cases. And it has at least one trick you don't find in a typical regular expression package: a way to match balanced parenthesis. But it has its limits as well.
When those limits are exceeded, then I reach for LPeg. LPeg is an implementation of a Parsing Expression Grammer for Lua, and was implemented by one of Lua's original authors so the adaptation to Lua is done quite well. A PEG allows specification of anything from simple patterns through complete language grammars to be written. LPeg compiles the grammar to a bytecode and executes it extremely efficiently.
you should NOT be trying to parse HTML with regular expressions, HTML and XML are NOT regular languages and can not be successfully manipulated with regular expressions. You should use a dedicated HTML parser. Here are lots of explanations why.