I am trying to parse telnet subnegotiation, and am unclear if I should support nesting or not, and exactly in what way.
If I have input like IAC SB stuff1 IAC SB stuff2 IAC SE stuff3 IAC SE, should it be parsed nested, or through simple lexing?
For example {0xFF, 0xFA, 1, 0xFF, 0xFA, 2, 0xFF, 0xF0, 3, 0xFF, 0xF0} could be parsed as any of the following:
Using nesting so the entire thing is considered a single subnegotiation, with another one inside. Sn(1??2??3)
Everything inside the subnegotiation is considered a simple literal and transcribed exactly, and invalid commands are ignored. Sn(1??2)3
Where any commands where they don't belong, including inside the subnegotiation, are ignored as invalid Sn(12)3
Even something like like all commands are considered valid but fully independent, and issued when complete, so that the subcommand is extracted from the nesting one Sn(2)Sn(13)
etc?
I cant figure out how this is supposed to work from the RFCs nor anywhere on the internet, nor can I seem to find a definitive formal grammar to help me out.
Related
I need to intercept ANSI sequences received by Xterm.js, mostly CSI sequences, in order to modify/customize them, what would be the best method to do so ?
Is there a xterm.js API that could help me ?
example of what i need to achieve :
<esc>[2m (SGR faint) replaced by a vivid color like (Cyan)
<esc>[36m
<esc>[2J (erase screen) replaced by a sequence of newlines before
the erase sequence because of xterm.js implementation not sending in
scrollback buffer anything erased
note : i cant modify softwares generating the CSI sequence, and i'm obliged to modify the behavior of above sequence to match the exact behavior of an old client i'm replacing
Since i'm using WebSSH as a framework, i evaluated the opportunity to do it before, on python backend, on on_read ssh channel.recv(), but it is challenging because of data being split over multiple messages resulting in an ansi sequence can be split across two messages. Clean incomplete ansi sequence detection seems difficult to reach.
Moreover it sounds dirty to me trying to patch this text buffer on backend.
That's why i think it would be much more efficient if i could do it directly from xterm.js between his ansi sequence detection and rendering.
I gave a try to xterm.js CSI hook registerCsiHandler but it seems not allowing me to change the whole sequence, and even less to add more data, i think it's not design to suit my needs ...
tl;dr: I'm struggling to find documentation or examples of text parsers that require lookahead using nom.
Long version
I'm using nom to parse 6502 assembly. I'm struggling with creating a parser that can parse the various addressing modes. Any given opcode will have the following format:
XXX AM
Where XXX is a three-character mnemonic and AM is the operand. The operand can take many forms and is referred to as the "addressing mode." I've defined an enum for the operands, an enum for the addressing modes, and an OpCode tuple struct containing these values, which is ultimately the result returned when parsing.
The addressing mode can be omitted completely, in which case the addressing mode is Implied, it can have a literal value of A, which is the Accumulator addressing mode.
Many of the addressing modes refer to memory locations, and it's these addressing modes I'm struggling to parse. In particular, if an addressing mode specifies a single byte in the form of $00, it is a ZeroPage addressing mode, whereas an operand specifying two bytes in the form of $0000 is an Absolute addressing mode. To complicate the matter, there are indexed variants of these addressing modes in the form of $00,X, $00,Y, $0000,X, etc.
Are there any good examples of existing text parsers that would illustrate the correct way to parse values that all start similarly ($00...) but are differentiated by how they end? The nom documentation is not very comprehensive, and the best example I've found is the INI parser, which isn't doing anything as complex as I'm trying to accomplish. I've also look at the syn source code, but it's using a lot of custom macros and is a pretty complex beast, making it hard to learn from.
One way of doing this is with the alt!() macro.
The idea is have a parser which tries each alternative in sequence. So if you already have parsers for each of the addressing modes separately, you can combine them into a parser for any of them:
// The sub-parsers all return Operand too.
named!(parse_operand<&str, Operand>,
alt!(parse_absolute_indexed |
parse_absolute |
parse_zeropage_indexed |
parse_zeropage |
parse_implied));
Some notes:
The order may be important; I've put parse_absolute after parse_absolute_indexed since the former would match the initial part of the operand and return too early.
A variant would be to include the end of line (including comments if applicable) matching into each sub parser. Then it couldn't match early.
If you're parsing to the end of the input without a byte/character which terminates the pattern (such as a newline) then you may need to use alt_complete!() instead of alt!(). The reason for this is that if you try matching ADD $00, the parser which might match ADD $0000 has to assume that it might still match if more input arrives, and alt!() won't then skip to the next case. Using alt_complete!(), or alternatively wrapping the inner matchers in complete!(), is saying that an incomplete match is a non-match.
If the parsers were very complicated it might mean doing extra work (trying each parse in sequence) compared to a parser generated by eg the venerable yacc, but I don't think it's an issue in this case.
I have a string that, by using string.format("%02X", char), I've received the following:
74657874000000EDD37001000300
In the end, I'd like that string to look like the following:
t e x t NUL NUL NUL í Ó p SOH NUL ETX NUL (spaces are there just for clarification of characters desired in example).
I've tried to use \x..(hex#), string.char(0x..(hex#)) (where (hex#) is alphanumeric representation of my desired character) and I am still having issues with getting the result I'm looking for. After reading another thread about this topic: what is the way to represent a unichar in lua and the links provided in the answers, I am not fully understanding what I need to do in my final code that is acceptable for this to work.
I'm looking for some help in better understanding an approach that would help me to achieve my desired result provided below.
ETA:
Well I thought that I had fixed it with the following code:
function hexToAscii(input)
local convString = ""
for char in input:gmatch("(..)") do
convString = convString..(string.char("0x"..char))
end
return convString
end
It appeared to work, but didnt think about characters above 127. Rookie mistake. Now I'm unsure how I can get the additional characters up to 256 display their ASCII values.
I did the following to check since I couldn't truly "see" them in the file.
function asciiSub(input)
input = input:gsub(string.char(0x00), "<NUL>") -- suggested by a coworker
print(input)
end
I did a few gsub strings to substitute in other characters and my file comes back with the replacement strings. But when I ran into characters in the extended ASCII table, it got all forgotten.
Can anyone assist me in understanding a fix or new approach to this problem? As I've stated before, I read other topics on this and am still confused as to the best approach towards this issue.
The simple way to transform a base16-encoded string is just to
function unhex( input )
return (input:gsub( "..", function(c)
return string.char( tonumber( c, 16 ) )
end))
end
This is basically what you have, just a bit cleaner. (There's no need to say "(..)", ".." is enough – if you specify no captures, you'll automatically get the whole match. And while it might work if you write string.char( "0x"..c ), it's just evil – you concatenate lots of strings and then trigger the automatic conversion to numbers. Much better to just specify the base when explicitly converting.)
The resulting string should be exactly what went into the hex-dumper, no matter the encoding.
If you cannot correctly display the result, your viewer will also be unable to display the original input. If you used different viewers for the original input and the resulting output (e.g. a text editor and a terminal), try writing the output to a file instead and looking at it with the same viewer you used for the original input, then the two should be exactly the same.
Getting viewers that assume different encodings (e.g. one of the "old" 8-bit code pages or one of the many versions of Unicode) to display the same thing will require conversion between different formats, which tends to be quite complicated or even impossible. As you did not mention what encodings are involved (nor any other information like OS or programs used that might hint at the likely encodings), this could be just about anything, so it's impossible to say anything more specific on that.
You actually have a couple of problems:
First, make sure you know the meaning of the term character encoding, and that you know the difference between characters and bytes. A popular post on the topic is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Then, what encoding was used for the bytes you just received? You need to know this, otherwise you don't know what byte 234 means. For example it could be ISO-8859-1, in which case it is U+00EA, the character ê.
The characters 0 to 31 are control characters (eg. 0 is NUL). Use a lookup table for these.
Then, displaying the characters on the terminal is the hard part. There is no platform-independent way to display ê on the terminal. It may well be impossible with the standard print function. If you can't figure this step out you can search for a question dealing specifically with how to print Unicode text from Lua.
I cant see what encoding Lua uses for its strings.
Im using
string.byte (s [, i [, j]])
which has the doc
Returns the internal numerical codes of the characters s[i], s[i+1],
···, s[j]. The default value for i is 1; the default value for j is i.
Note that numerical codes are not necessarily portable across
platforms.
Reading around people suggest it uses ASCII - which is fine for me - but I dont get the changing across platforms - I thought the very nature of using a single encoding (like ASCII) is that this wouldnt happen - or is it just saying this as ASCII does not define for over 126 (or 127) and therefore different countries / OEMS / OSs etc may be using custom ASCII extensions from decades ago for the upper range?
Its important for me to know that [a-zA-Z] will have the same char values on all platforms im running on.
The Lua doc could be a bit more specific here!
Any light anyone can shed on this would be great thx
I'm fairly sure you can safely assume an ASCII-derived encoding. So the minuscule set of characters you're interested in stays the same.
The note about the code changing between platforms likely means that Lua doesn't know anything about the character encoding at all and thus just uses whatever bytes the OS hands out. On Linux this is likely UTF-8, which means you'd have to deal with individual code units when stepping outside ASCII. On Windows I could imagine it being the system's legacy codepage, which means sort-of Latin 1 (CP 1252) in much of the Western world.
I need to do some metaprogramming on a large Mathematica code base (hundreds of thousands of lines of code) and don't want to have to write a full-blown parser so I was wondering how best to get the code from a Mathematica notebook out in an easily-parsed syntax.
Is it possible to export a Mathematica notebook in FullForm syntax, or to save all definitions in FullForm syntax?
The documentation for Save says that it can only export in the InputForm syntax, which is non-trivial to parse.
The best solution I have so far is to evaluate the notebook and then use DownValues to extract the rewrite rules with arguments (but this misses symbol definitions) as follows:
DVs[_] := {}
DVs[s_Symbol] := DownValues[s]
stream = OpenWrite["FullForm.m"];
WriteString[stream,
DVs[Symbol[#]] & /# Names["Global`*"] // Flatten // FullForm];
Close[stream];
I've tried a variety of approaches so far but none are working well. Metaprogramming in Mathematica seems to be extremely difficult because it keeps evaluating things that I want to keep unevaluated. For example, I wanted to get the string name of the infinity symbol using SymbolName[Infinity] but the Infinity gets evaluated into a non-symbol and the call to SymbolName dies with an error. Hence my desire to do the metaprogramming in a more suitable language.
EDIT
The best solution seems to be to save the notebooks as package (.m) files by hand and then translate them using the following code:
stream = OpenWrite["EverythingFullForm.m"];
WriteString[stream, Import["Everything.m", "HeldExpressions"] // FullForm];
Close[stream];
You can certainly do this. Here is one way:
exportCode[fname_String] :=
Function[code,
Export[fname, ToString#HoldForm#FullForm#code, "String"],
HoldAllComplete]
For example:
fn = exportCode["C:\\Temp\\mmacode.m"];
fn[
Clear[getWordsIndices];
getWordsIndices[sym_, words : {__String}] :=
Developer`ToPackedArray[words /. sym["Direct"]];
];
And importing this as a string:
In[623]:= Import["C:\\Temp\\mmacode.m","String"]//InputForm
Out[623]//InputForm=
"CompoundExpression[Clear[getWordsIndices], SetDelayed[getWordsIndices[Pattern[sym, Blank[]], \
Pattern[words, List[BlankSequence[String]]]], Developer`ToPackedArray[ReplaceAll[words, \
sym[\"Direct\"]]]], Null]"
However, going to other language to do metaprogramming for Mathematica sounds ridiculous to me, given that Mathematica is very well suited for that. There are many techniques available in Mathematica to do meta-programming and avoid premature evaluation. One that comes to my mind I described in this answer, but there are many others. Since you can operate on parsed code and use the pattern-matching in Mathematica, you save a lot. You can browse the SO Mathematica tags (past questions) and find lots of examples of meta-programming and evaluation control.
EDIT
To ease your pain with auto-evaluating symbols (there are only a few actually, Infinity being one of them).If you just need to get a symbol name for a given symbol, then this function will help:
unevaluatedSymbolName = Function[sym, SymbolName#Unevaluated#sym, HoldAllComplete]
You use it as
In[638]:= unevaluatedSymbolName[Infinity]//InputForm
Out[638]//InputForm="Infinity"
Alternatively, you can simply add HoldFirst attribute to SymbolName function via SetAttributes. One way is to do that globally:
SetAttributes[SymbolName,HoldFirst];
SymbolName[Infinity]//InputForm
Modifying built-in functions globally is however dangerous since it may have unpredictable effects for such a large system as Mathematica:
ClearAttributes[SymbolName, HoldFirst];
Here is a macro to use that locally:
ClearAll[withUnevaluatedSymbolName];
SetAttributes[withUnevaluatedSymbolName, HoldFirst];
withUnevaluatedSymbolName[code_] :=
Internal`InheritedBlock[{SymbolName},
SetAttributes[SymbolName, HoldFirst];
code]
Now,
In[649]:=
withUnevaluatedSymbolName[
{#,StringLength[#]}&[SymbolName[Infinity]]]//InputForm
Out[649]//InputForm= {"Infinity", 8}
You may also wish to do some replacements in a piece of code, say, replace a given symbol by its name. Here is an example code (which I wrap in Hold to prevent it from evaluation):
c = Hold[Integrate[Exp[-x^2], {x, -Infinity, Infinity}]]
The general way to do replacements in such cases is using Hold-attributes (see this answer) and replacements inside held expressions (see this question). For the case at hand:
In[652]:=
withUnevaluatedSymbolName[
c/.HoldPattern[Infinity]:>RuleCondition[SymbolName[Infinity],True]
]//InputForm
Out[652]//InputForm=
Hold[Integrate[Exp[-x^2], {x, -"Infinity", "Infinity"}]]
, although this is not the only way to do this. Instead of using the above macro, we can also encode the modification to SymbolName into the rule itself (here I am using a more wordy form ( Trott - Strzebonski trick) of in-place evaluation, but you can use RuleCondition as well:
ClearAll[replaceSymbolUnevaluatedRule];
SetAttributes[replaceSymbolUnevaluatedRule, HoldFirst];
replaceSymbolUnevaluatedRule[sym_Symbol] :=
HoldPattern[sym] :> With[{eval = SymbolName#Unevaluated#sym}, eval /; True];
Now, for example:
In[629]:=
Hold[Integrate[Exp[-x^2],{x,-Infinity,Infinity}]]/.
replaceSymbolUnevaluatedRule[Infinity]//InputForm
Out[629]//InputForm=
Hold[Integrate[Exp[-x^2], {x, -"Infinity", "Infinity"}]]
Actually, this entire answer is a good demonstration of various meta-programming techniques. From my own experiences, I can direct you to this, this, this, this and this answers of mine, where meta-programming was essential to solve problem I was addressing. You can also judge by the fraction of functions in Mathematica carrying Hold-attributes to all functions - it is about 10-15 percents if memory serves me well. All those functions are effectively macros, operating on code. To me, this is a very indicative fact, telling me that Mathematica jeavily builds on its meta-programming facilities.
The full forms of expressions can be extracted from the Code and Input cells of a notebook as follows:
$exprs =
Cases[
Import["mynotebook.nb", "Notebook"]
, Cell[content_, "Code"|"Input", ___] :>
ToExpression[content, StandardForm, HoldComplete]
, Infinity
] //
Flatten[HoldComplete ## #, 1, HoldComplete] & //
FullForm
$exprs is assigned the expressions read, wrapped in Hold to prevent evaluation. $exprs could then be saved into a text file:
Export["myfile.txt", ToString[$exprs]]
Package files (.m) are slightly easier to read in this way:
Import["mypackage.m", "HeldExpressions"] //
Flatten[HoldComplete ## #, 1, HoldComplete] &