How to change F# Interactive newline character - f#

In a .fs file a newline is denoted by \r\n, but in the F# Interactive window it is \n.
In a problem I'm currently trying to solve, the length of a multiple line literal string matters. So a problem arises when I am testing code in the F# Interactive window, because the length of the string is different from in normal execution.
I hope there is an option to change the newline 'character' in F# Interactive to \r\n, but I can't find it. Does anyone know where I can achieve this, or some other workaround?

You can use conditional compilation to handle this:
#if INTERACTIVE
text.Replace("\n", System.Environment.NewLine)
#endif
I don't know of a way to change it in fsi. Another option would be to remove, or normalize, the newlines regardless of the execution environment. If the exact length is that important, it might be good to do anyway.
EDIT
If the newlines are only there for readability, you can end each line with a backslash. The backslash, newline, and leading whitespace on the following line are removed at compile time.
let text = "a\
b"
printfn "%s" text //"ab"
This works the same in VS and FSI. I'm assuming you're sending bits of code to FSI via Alt+Enter or Alt+'.

Related

Antlr differentiating a newline from a \n

Let's say I have the following statement:
SELECT "hi\n
there";
Notice there is a literal newline in there, and the escape \n. The string that antlr4 picks up for me is:
String_Literal: "hi\n\nthere"
In other words, not differentiating between the literal newline and the \n one. Is there a way to differentiate the two, or what's the usual process to do that?
My guess is that the output you pasted into your question comes from a call to the Antlr4 runtime method tree.toStringTree(parser) (or equivalent in whatever target language you've chosen).
That function calls escapeWhitespace in the utilities class/module/file, and that function does what it's name suggests: it converts (some) whitespace characters to C-like backslash escape sequences. (Specifically, it handles newline, carriage return, and tab characters.) It does not escape backslash characters, which makes its output ambiguous; there's no way to distinguish between the two character escape sequence \n and the escaped conversion of a newline character in the message.
They are different in the actual character string, because the Antlr4 lexer does not transform the string value of the matched token in any way. That's your responsibility.
In computing, it is very often the case that what you see is not what you got. What you see is just what you see, and a lot of computational power has gone into creating that vision for you. By the same token, nothing guarantees that the vision is an unambiguous, or even useful, representation of the actual values. The best you can say for it is that it's probably more useful than trying to read the data as individual bits. (And, indeed, the individual bits are not physical objects either; despite the common refrain, you could completely disassemble a computer and examine it with an arbitrarily powerful microscope, and you will not see a single 1 or 0.)
That might seem like irrelevant philosophizing, but it has a real consequence: when you're debugging and you see something that makes you think, "that looks wrong", you need to consider two possibilities: maybe the underlying data is incorrect, but may it's the process which rendered the representation which is at fault. In this case, I'd say that the failure of escapeWhitespace to convert backslash characters into pairs of backslashes is a bug, but that's a value judgement on my part. Anyway, the function is not critical to the operation of Antlr4, and you could easily replace it.

antlr4 assistance

I'm new to antlr4, and am trying to write a code to look through a .txt and find keywords (set to "PARTY" for testing) and then store everything after, stopping at a new line (excluding the '|' symbol).
I'm running the code in IntelliJ with the antlr4 plugin, and for some reason it's reading the first line, making a parsing tree for it and then stopping.
According to your grammar, each line should start with one or more occurrences of the keyword PARTY, but your first and second line don't start with that. That's why it's complaining about the "missing P".
Another problem is that since you're hiding the NWL tokens, you can't use them in the grammar. If you want newlines to be significant in your grammar, you should not hide them. In other words you should remove the {channel=HIDDEN;} bit.

Is parsing a language with an ending delimiter (e.g. ';') more efficient than having none?

I am wondering about the effect of having an ending delimiter about performance (if it's a scripting language) and ease to parse languages.
Is it easier to parse a language that has one?
If it is and the language is a scripting language, does it make the language run commands faster?
Any difference is going to be trivial.
This is because a language that doesn't use such end delimiters, do infact have them in the form of newline characters!! These will then use special characters to mark that teh statement continues on to the next line.
A few do neither, but there, the end of the statement is implicit in the definition - eg. the close ')' at the end of a parameter list. If there's a comma (or nothing) then more parameters or a closed ')' will be expected on the next line.
In the big scheme of things, these small differences have a negligible effect on processing time. For script languages, you're going to have much bigger differences according to whether (or how much) a particular statement construct can be internally optimized.
Performance-wise It doesn't matter.
If you think about it, there's always a delimiter, if it's not the ; then it's EndOfLine (\n)
So you're still parsing the code script the same way. The parser won't mind if it's a delimiter that's a visible character or invisible one.
The only thing that a visible delimiter is good for, is enabling the programmer to write multi-line script lines, which can be useful. Multi-lined script lines enable the programmer to write more coherent code in some cases. (Just parse the EndOfLine as a white space)
All languages I'm aware of that don't require an explicit typed-out statement delimiter or terminator use line breaks instead - in a few cases, with the exception that line breaks inside parens, braces, brackets, etc. are ignored (implicit line continuation). A small difference, and its effect on parsing speed will heavily depend on the parser (or parser generator). It may be slightly harder to write a parser for if you allow implicit line continuations (because you have to distinguish between newlines and statement seperators - but this can be sorted out relatively early in the parsing process unless you invent incredibly complex rules for it).
But: Even if there was a huge difference, it wouldn't affect runtime performance. Unless for programs where it doesn't matter anyway since they're rather short and/or I/O-bound (or especially stupid implementations that refuse to build any IR and instead interpret from source as they go, using a poorly-written parser), parsing takes place at most once at startup (today's "scripting" languages mostly compile to bytecode and mostly cache that bytecode between runs) and is then shadowed by the time the program spends doing its actual pass. Don't make tradeoffs at language design for speed; or if so, do it properly as e.g. C.
A more interesting issue is explicit block delimiters vs. offside rule. The tokenizer can already solve this, but most parsing generators/libraries don't (easily) give you such a parser. A few do, and in that case the difference is negible, but it's not as widely supported as "free-form" langauges.

What is the proper Lua pattern for quoted text?

I've been playing with this for an hour or tow and have found myself at a road block with the Lua pattern matching utilities. I am attempting to match all quoted text in a string and replace it if needed.
The pattern I have come up with so far is: (\?[\"\'])(.-)%1
This works in some cases but, not all cases:
Working: "This \"is a\" string of \"text to\" test with"
Not Working: "T\\\"his \"is\' a\" string\" of\' text\" to \"test\" wit\\\"h"
In the not working example I would like it to match to (I made a function that gets the matches I desire, I'm just looking for a pattern to use with gsub and curious if a lua pattern can do this):
string
a" string" of
is' a" string" of' text
test
his "is' a" string" of' text" to "test" wit
I'm going to continue to use my function instead for the time being, but am curious if there is a pattern I could/should be using and i'm just missing something with patterns.
(a few edits b/c I forgot about stackoverflows formating)
(another edit to make a non-html example since it was leading to assumptions that I was attempting to parse html)
Trying to match escaped, quoted text using regular expressions is like trying to remove the daisies (and only the daises) from a field using a lawnmower.
I made a function that gets the matches I desire
This is the correct move.
I'm curious if a lua pattern can do this
From a practical point of view, even if a pattern can do this, you don't want to. From a theoretical point of view, you are trying to find a double quote that is preceded by an even number of backslashes. This is definitely a regular language, and the regular expression you want would be something like the following (Lua quoting conventions)
[[[^\](\\)*"(.-[^\](\\)*)"]]
And the quoted string would be result #2. But Lua patterns are not full regular expressions; in particular, you cannot put a * after a parenthesized pattern.
So my guess is that this problem cannot be solved using Lua patterns, but since Lua patterns are not a standard thing in automata theory, I'm not aware of any body of proof technique that you could use to prove it.
The issue with escaped quotes is that, in general, if there's an odd number of backslashes before the quote, then it's escaped, and if there's an even number, it's not. I do not believe that Lua pattern-matching is powerful enough to represent this condition, so if you need to parse text like this, then you should seek another way. Perhaps you can iterate through the string and parse it, or you could find each quote in turn and read backwards, counting the backslashes until you find a non-backslash character (or the beginning of the string).
If you absolutely must use patterns for some reason, you could try doing this in a multi-step process. First, gsub for all occurrences of two backslashes in a row, and replace them with some sentinel value. This must be a value that does not already occur in the string. You could try something like "\001" if you know this string doesn't contain non-printable characters. Anyway, once you've replaced all sequences of two backslashes in a row, any backslashes left are escaping the following character. Now you can apply your original pattern, and then finally you can replace all instances of your sentinel value with two backslashes again.
Lua's pattern language is adequate for many simple cases. And it has at least one trick you don't find in a typical regular expression package: a way to match balanced parenthesis. But it has its limits as well.
When those limits are exceeded, then I reach for LPeg. LPeg is an implementation of a Parsing Expression Grammer for Lua, and was implemented by one of Lua's original authors so the adaptation to Lua is done quite well. A PEG allows specification of anything from simple patterns through complete language grammars to be written. LPeg compiles the grammar to a bytecode and executes it extremely efficiently.
you should NOT be trying to parse HTML with regular expressions, HTML and XML are NOT regular languages and can not be successfully manipulated with regular expressions. You should use a dedicated HTML parser. Here are lots of explanations why.

Tex command which affects the next complete word

Is it possible to have a TeX command which will take the whole next word (or the next letters up to but not including the next punctuation symbol) as an argument and not only the next letter or {} group?
I’d like to have a \caps command on certain acronyms but don’t want to type curly brackets over and over.
First of all create your command, for example
\def\capsimpl#1{{\sc #1}}% Your main macro
The solution to catch a space or punctuation:
\catcode`\#=11
\def\addtopunct#1{\expandafter\let\csname punct#\meaning#1\endcsname\let}
\addtopunct{ }
\addtopunct{.} \addtopunct{,} \addtopunct{?}
\addtopunct{!} \addtopunct{;} \addtopunct{:}
\newtoks\capsarg
\def\caps{\capsarg{}\futurelet\punctlet\capsx}
\def\capsx{\expandafter\ifx\csname punct#\meaning\punctlet\endcsname\let
\expandafter\capsend
\else \expandafter\continuecaps\fi}
\def\capsend{\expandafter\capsimpl\expandafter{\the\capsarg}}
\def\continuecaps#1{\capsarg=\expandafter{\the\capsarg#1}\futurelet\punctlet\capsx}
\catcode`\#=12
#Debilski - I wrote something similar to your active * code for the acronyms in my thesis. I activated < and then \def<#1> to print the acronym, as well as the expansion if it's the first time it's encountered. I also went a bit off the deep end by allowing defining the expansions in-line and using the .aux files to send the expansions "back in time" if they're used before they're declared, or to report errors if an acronym is never declared.
Overall, it seemed like it would be a good idea at the time - I rarely needed < to be catcode 12 in my actual text (since all my macros were in a separate .sty file), and I made it behave in math mode, so I couldn't foresee any difficulties. But boy was it brittle... I don't know how many times I accidentally broke my build by changing something seemingly unrelated. So all that to say, be very careful activating characters that are even remotely commonly-used.
On the other hand, with XeTeX and higher unicode characters, it's probably a lot safer, and there are generally easy ways to type these extra characters, such as making a multi (or compose) key (I usually map either numlock or one of the windows keys to this), so that e.g. multi-!-! produces ¡). Or if you're running in emacs, you can use C-\ to switch into TeX input mode briefly to insert unicode by typing the TeX command for it (though this is a pain for actually typing TeX documents, since it intercepts your actual \'s, and please please don't try defining your own escape character!)
Regarding whitespace after commands: see package xspace, and TeX FAQ item Commands gobble following space.
Now why this is very difficult: as you noted yourself, things like that can only be done by changing catcodes, it seems. Catcodes are assigned to characters when TeX reads them, and TeX reads one line at a time, so you can not do anything with other spaces on the same line, IMHO. There might be a way around this, but I do not see it.
Dangerous code below!
This code will do what you want only at the end of the line, so if what you want is more "fluent" typing without brackets, but you are willing to hit 'return' after each acronym (and not run any auto-indent later), you can use this:
\def\caps{\begingroup\catcode`^^20 =11\mcaps}
\def\mcaps#1{\def\next##1 {\sc #1##1\catcode`^^20 =10\endgroup\ }\next}
One solution might be setting another character as active and using this one for escaping. This does not remove the need for a closing character but avoids typing the \caps macro, thus making it overall easier to type.
Therefore under very special circumstances, the following works.
\catcode`\*=\active
\def*#1*{\textsc{\MakeTextLowercase{#1}}}
Now follows an *Acronym*.
Unfortunately, this makes uses of \section*{} impossible without additional macro definitions.
In Xetex, it seems to be possible to exploit unicode characters for this, so one could define
\catcode`\•=\active
\def•#1•{\textsc{\MakeTextLowercase{#1}}}
Now follows an •Acronym•.
Which should reduce the effects on other commands but of course needs to have the character ‘•’ mapped to the keyboard somewhere to be of use.

Resources