margin characters in string with interpolation - rascal

Inside a string literal, the margin character (a single quote) indicates that the initial part of the line, before the margin character, should be ignored.
Is there a complementary "right-margin" character, that indicates that everything after it on the same line should be ignored?
(I realized that this would be useful when writing somewhat lengthy code inside interpolated expressions in a string. In order to make that code more readable, I would like to split it over multiple lines, but the newlines introduced that way should not be part of the string literal. And putting the newlines inside the <...> looks ugly.)

Rascal only supports ignoring left-margin characters before the single quote (') in interpolated strings. Your suggestion for also ignoring right-margin characters is interesting but currently not supported.
Since layout inside interpolations <...> is ignored you can (and have to) place all your additional layout there.

Related

Antlr differentiating a newline from a \n

Let's say I have the following statement:
SELECT "hi\n
there";
Notice there is a literal newline in there, and the escape \n. The string that antlr4 picks up for me is:
String_Literal: "hi\n\nthere"
In other words, not differentiating between the literal newline and the \n one. Is there a way to differentiate the two, or what's the usual process to do that?
My guess is that the output you pasted into your question comes from a call to the Antlr4 runtime method tree.toStringTree(parser) (or equivalent in whatever target language you've chosen).
That function calls escapeWhitespace in the utilities class/module/file, and that function does what it's name suggests: it converts (some) whitespace characters to C-like backslash escape sequences. (Specifically, it handles newline, carriage return, and tab characters.) It does not escape backslash characters, which makes its output ambiguous; there's no way to distinguish between the two character escape sequence \n and the escaped conversion of a newline character in the message.
They are different in the actual character string, because the Antlr4 lexer does not transform the string value of the matched token in any way. That's your responsibility.
In computing, it is very often the case that what you see is not what you got. What you see is just what you see, and a lot of computational power has gone into creating that vision for you. By the same token, nothing guarantees that the vision is an unambiguous, or even useful, representation of the actual values. The best you can say for it is that it's probably more useful than trying to read the data as individual bits. (And, indeed, the individual bits are not physical objects either; despite the common refrain, you could completely disassemble a computer and examine it with an arbitrarily powerful microscope, and you will not see a single 1 or 0.)
That might seem like irrelevant philosophizing, but it has a real consequence: when you're debugging and you see something that makes you think, "that looks wrong", you need to consider two possibilities: maybe the underlying data is incorrect, but may it's the process which rendered the representation which is at fault. In this case, I'd say that the failure of escapeWhitespace to convert backslash characters into pairs of backslashes is a bug, but that's a value judgement on my part. Anyway, the function is not critical to the operation of Antlr4, and you could easily replace it.

RegEx how to properly use OR pipelines

I need to know how to properly use "OR" when it comes to individual characters and whole phrases... For example I have code that is checking for any number of characters OR words that are found in an array...
I want to check for some unicode characters and also some html lines of code.
I'm currently just checking for the characters using this:
([\u200b\u200c\u200d\0\1\2\3\4\5\6\7]*)
(the backslashes are representing the unicode characters u+200b - u+200d and the special characters in my software \0-\7 (They are all individual characters), these are valid escape sequences in Objective-C.)
Now what if I wanted to check for these characters AND check for phrases like <b> or <font color="#FF0000">
I found stuff while doing research that said to use pipelines | but I'm not sure if I put them only in-between the words or also in-between the individual characters and I'm not sure if I put quotes around the words or what not... I need help before I screw this up badly haha!
(p.s., not sure if it will be any different but I'm also doing it for this:
([^\u200b\u200c\u200d\0\1\2\3\4\5\6\7])
it's be someting like
/([^....]|\<b\/\>|\<font color .... \>)/
though, the usual caveats about regexes and html apply here.
As for the confusion about where to put the |, consider this this hackneyed example: You want to find the word color, but also want to accommodate the british spelling, colour:
/(color|colour)/
/(colou?r)/
/(colo(r|ur))/
are all basically equivalent.

FitNesse: Can't see the difference between the expected and the actual result in failed assertion

I'm using FitNesse to test web service responses using check to compare the expected to the actual response.
In some cases the check is failing and I can't see what the differences are between the expected and the actual that is causing it to fail.
Here's a screenshot from what it's telling me in a specific instance (of many similar instances):
Feel free to point out the obvious; it's probably staring at me in the face and I'm looking so hard I can't see it!
I would check that the expected and actual strings are both written with the same text encoding. I've seen this error plenty of times when the text comparing failed due to a comma or apostrophe being written in different encodes.
It is possible that your string contains extra spaces in the actual value. FitNesse, being html based, will not respect leading or trailing spaces. It might not handle any extra spaces inside the actual either. So this can cause the result to be different, but not visibly so.
See if you can add some debug messages that would help you see the extra spaces, or at least count the number of characters in both strings.
This question doesn't specify whether Slim or Fit are being used, or which Slim server/plugin if using Slim, but I found the following to be true for me using FitNesse release 20130530 and fitSharp release 2.2:
Non-ASCII characters and { apostrophes / single quote characters } in input arguments/parameters that are strings are HTML encoded. The values in my FitNesse test tables are HTML encoded, but only the required syntax characters and (double) quotes; not the non-ASCII characters (and FitNesse doesn't seem to have any problems storing those values).
EOL characters in the input arguments that are strings consist of a linefeed character only
I imagine that because I'm using .NET, EOLs in my return values consist of carriage return and linefeed characters.
Because of [1], I'm HTML-encoding non-ASCII characters (but not the HTML syntax characters or quotes). Because of [2] and [3], I'm now removing carriage return characters from my fixture return values. Both changes seem to have resolved this issue for me and expected and actual values are now reported as being the same.
Whitespace has troubled me often. The resulting HTML just collapses whitespace, but the compare in code does not.
I now use a fixture to make differences more explicit to me. Example usage: http://fhoeben.github.io/hsac-fitnesse-fixtures/examples-results/HsacExamples.SlimTests.UtilityFixtures.CompareFixtureTest.html
Newer versions of FitNesse (since 20151230) do a diff on the expected and actual result values. Has that helped you at all?

Yaml gotcha's, are there any issues with storing certain types of characters in Yaml?

I'm very new to yaml, and I just want to know what I can and can't store character wise in yaml?
What are the escape characters for double quotes etc?
Can I span multiple lines?
Basically, you can store everything. Quotes aren't an issue, you can type text out without quotes (and for non-printable characters that you can't incorporate casually, there are the usual escape sequences). That means purely numerical text is considered a number, though - but then again, you can add quotes or an explicit type annotation (and I assume most libraries do that when necessary), e.g. !!str 100. Also, if you want to include the comment sign (#), you have to add quotes.
Another issue is that some strings may look like more complex YAML (e.g. certain uses of exclamation signs look like casts and certain uses of colons look like singleton associative tables). You can avoid these by using "multi-line" strings that just consist of a single line. Multi-line strings exist and come in two flavors, preserving linebreaks (--- |) and ignoring newlines except for blank lines (--- >, much like markdown).

What to do when unescapable character(s) are escaped?

In designing of a (mini)language:
When there are certain characters that should be escaped to lose special meanings (like quotes in some programming languages), what should be done, especially from a security perspective, when characters that are not escapable (e.g. normal characters which never have special meaning) are escaped? Should an error be "error"ed, or should the character be discarded, or should it be in the output the same as if it was not escaped?
Example:
In a simple language where strings are delimited by double-quotes("), and any quotes in a given string are escaped with a back-slash(\): for input "We \said, \"We want Moshiach Now\"" -- what would should be done with the letter s in said which is escaped?
I prefer the lexer to whine when this occurs. A lexer/parser should be tight about syntax; one can always loosen it up later. If you are sloppy, you'll find you can't retract a decision you didn't think you made.
Assume that you initially decide to treat " backslash not-an-escape " as that pair of characters, and the "T" is
not-an-escape today. Sometime later you decide to extend the language, and want "\T" to mean something special, and you change your language.
You'll find an angry mob of programmers storming your design castle,
because for them, "\T" means "\" "T" (or "T" depending on your default decision),
and you just broke their code. You hang your head in shame, retract the decision,
and then realize... oops, there are no more available escape characters!
This lesson goes for any piece of syntax that isn't well defined in your language. If it isn't explicitly legal, it should be implicitly illegal and your compiler should check it. Or you'll never be able to extend your successful language.
If your language isn't going to be successful, you may not care as much.
Well, one way to solve the problem is for the backslash to just mean backslash when it precedes a non-escapable character. That's what Python does:
>>> print "a\tb"
a b
>>> print "a\tb\Rc"
a b\Rc
Obviously, most systems take the escape character to mean "take the next character verbatim", so escaping a "non-escapable" character is usually harmless. The problem later happens when you get to comparisons and such, where the literal text does not represent the actual value (that's where you see a lot of issues securitywise, especially with things like URLs).
So on the one hand, you can only accept a limited number of escaped characters. In that sense, you have an "escape sequence", rather than an escaped character (the \x is the entire sequence rather than a \ followed by an x). That's like the most safe mechanism, and it's not really burdensome to write.
The other option is to ensure that you you "canonicalizing" everything you compare, through some ruleset. This typically means removing all of the escape sequences properly up front, before comparison and comparing only the final values rather than the literals.
Most systems interpret the slash as Will Hartung says, except for alphanumerics which are variously used as aliases for control codes, character classes, word boundaries, the start of hex sequences, case region markers, hex or octal digits, etc. \s in particular often means white-space in perl5 style regexs. JavaScript, which interprets it as 's' in one context and as whitespace in another suffers from subtle bugs because of this choice. Consider /foo\sbar/ vs new RegExp('foo\sbar').

Resources