I'm using pipes to communicate two Prolog processes and every time I reached a read/2 predicate to read a message from my pipe, the program blocked and remained like that. I couldn't understand why that happened (I tried with extremely simple programs) and at the end I realized three things:
Every time I use write/2 to send a message, the sender process must end that message with .\n. If the message does not end like this, the receiver process will get stuck at the read/2 predicate.
If the sender does not flush the output, the message therefore is not left in the pipe buffer. It may seem obvious but it wasn't for me at the beginning.
Although when the message is not flushed the read/2 is blocking, wait_for_input/3 is not blocking at all, so no need for flush_output/1 in such case.
Examples:
This does not work:
example1 :-
pipe(R,W),
write(W,hello),
read(R,S). % The program is blocked here.
That one won't work either:
example2 :-
pipe(R,W),
write(W,'hello.\n'),
read(R,S). % The program is blocked here.
While these two, do work:
example3 :-
pipe(R,W),
write(W,'hello.\n'),
flush_output(W),
read(R,S).
example4 :-
pipe(R,W),
write(W,'hello.\n'),
wait_for_input([W],L,infinite).
Now my question is why? Is there a reason why Prolog only "accepts" full lines ended with a period when reading from a pipe (actually reading from any stream you may want to read)? And why does read block while wait_for_input/3 doesn't (assuming the message is not flushed)?
Thanks!
A valid Prolog read-term always ends with a period, called end char (* 6.4.8 *). And in 6.4.8 Other tokens, the standard reads:
An end char shall be followed by a layout character or a %.
So this is what the standard demands.
A newline after the period is one possibility to end a read-term, besides space, tab and other layout characters as well as %. However, due to the prevalence of ttys and related buffering, it seems a good convention to just stick with a newline.
The reason why the end char is needed is that Prolog syntax permits infix and postfix operators. Consider as input
f(1) + g(2).
when reading f(1) you might believe that this is already the entire term, but you still must await the period to be sure that there is no infix or postfix thereafter.
Also note that you must use writeq/1 or write_canonical/1 to produce output that can be read back. You cannot use write/1.
As an example, consider write([(.)+ .]). First, this is valid syntax. The dots are immediately followed by some other character. Remark the . is commonly called a period at the end, whereas it is called a dot within Prolog text.
write/1 will write this as [. + .]. Note, that the first . is now followed by a space. So when this text is read back,
only [. will be read.
There are many other ugly examples such as this one, usually they do not hit you. But once you are hit, you are hit...
Related
I'm trying to remove all normal and multi-line comments from a string, but it doesn't remove entire multi-line comment I tried
str:gsub("%-%-[^\n\r]+", "")
on this code
print(1)
--a
print(2) --b
--[[
print(4)
]]
output:
print(1)
print(2)
print(4)
]]
expected output:
print(1)
print(2)
The pattern you have provided to gsub, %-%-[^\n\r]+, will only remove "short" comments ("line" comments). It doesn't even attempt to deal with "long" comments and thus just treats their first line as a line comment, removing it.
Thus Piglet is right: You must remove the line comments after removing the long comments, not the other way around, as to not lose the start of long comments.
The pattern suggested by Piglet however necessarily fails for some (carefully crafted) long comments or even line comments. Consider
--[this is a line comment]print"Hello World!"
Piglet's pattern would strip the balanced parenthesis, treating the comment as if it were a long comment and uncommenting the rest of the line! We obtain:
print"Hello World!"
in a similar vein, this may happily consider a second line comment part of a long comment, outcommenting your entire code:
--[
-- all my code goes here
print"Hello World!"
-- end of all my code
--]
would be turned into the empty string.
Furthermore, long comments may use multiple equal signs (=) and must be terminated by the same sequence of equal signs (which is not equivalent to matching square ([]) brackets):
--[=[
A long long comment
]] <- not the termination of this long long comment
(poor regular-grammar-based syntax highlighters fail this)
]=]
this would terminate the comment at ]], leaving some syntax errors:
<- not the termination of this long long comment
(poor regular-grammar-based syntax highlighters fail this)
]=]
considering that Lua 5.1 already deprecates nesting long comments (whereas LuaJIT will entirely reject it), there is no need for matching balanced parenthesis here. Rather, you need to find long comment start sequences and then terminate at the next stop sequence. Here's some hacky pattern-based code to do just this:
for equal_signs in str:gmatch"%-%-%[(=*)%[" do
str = str:gsub("%-%-%["..equal_signs.."%[(.-)%]"..equal_signs.."%]", "", 1)
end
and here's an example string str for it to process, enclosed in a long string literal for easier testing:
local str = [==[
--[[a "long" comment]]
print"hello world"
--[=[another long comment
--[[this does not disrupt it at all
]=]
--]] oops, just a line comment
--[doesn't care about line comments]
]==]
which yields:
print"hello world"
--]]
--[doesn't care about line comments]
retaining the newlines.
now why is this hacky, despite fixing all of the aforementioned issues? Well, it's inefficient. It runs over the entire source, replacing long comments of a certain length, each time it encounters a long comment. For n long comments this means clear quadratic complexity O(n²).
You can't trivially optimize this by not replacing long comments if you have already replaced all long comments of the same length, reducing the complexity to O(n sqrt n) - since there may be at most sqrt(n) different long comment lengths for sources of length n: The gsub is limited to one replacement as to not remove part of long comments with more equal signs:
--[=[another long comment
--[[this does not disrupt it at all
]=]
You could however optimize it by using string.find repeatedly to always find (1) the opening delimiter (2) then the closing delimiter, adding all the substrings inbetween to a rope to concatenate to a string. Assuming linear matching performance (which isn't the case but could - assuming a better implementation than the current one - be the case for simple patterns such as this one) this would run in linear time. Implementing this is left as an excercise to the reader as pattern-based approaches are overall infeasible.
Note also that removing comments (to minify code?) may introduce syntax errors, as at the tokenization stage, comment (or whitespace) tokens (which are later suppressed) might be used to separate other tokens. Consider the following pathological case:
do--[[]]print("hello world")end
which would be turned into
doprint("hello world")end
which is an entirely different beast (call to doprint now, syntax error since the end isn't matched by an opening do anymore).
In addition, any pattern-based solution is likely to fail to consider context, removing "comments" inside string literals or - even harder to work around - long string literals. Again workarounds might be possible (i.e. by replacing strings with placeholders and later substituting them back), but this gets messy & error-prone. Consider
quoted_string = "--[[this is no comment but rather part of the string]]"
long_string = [=[--[[this is no comment but rather part of the string]]]=]
which would be turned into an empty string by comment removal patterns.
Conclusion
Pattern-based solutions are bound to fall short of myriads of edge cases. They will also usually be inefficient.
At least a partial tokenization that distinguishes between comments and "everything else" is needed. This must take care of long strings & long comments properly, counting the number of equals signs. Using a handwritten tokenizer is possible, but I'd recommend using lhf's ltokenp.
Even when using a proper tokenization stage to strip long comments, you might still have the aforementioned tokenization issue. For that reason you'll have to insert whitespace instead of the comment (if there isn't already). To save the most space you could check whether removing the comment alters the tokenization (i.e. removing the comment here if--[[comment]]"str"then end is fine, since the string will still be considered a distinct token from the keyword if).
What's your root problem here? If you're searching for a Lua minifier, just grab a battle-tested one rather than trying to roll your own (and especially before you try to rename local variables using patterns!).
Why should str:gsub("%-%-[^\n\r]+", "") remove
print(4)
]]
?
This pattern matches -- followed by anything but a linebreak or carriage return.
So it matches --a and --[[.
If there is an opening bracket immediately after -- you need to match anything until and including the corresponding closing bracket.
That would be -- followed by a balanced pair of brackets.
Hence "%-%-%b[]"
Then in a second run remove any short comments.
I am trying to create a simple programming language from scratch (interpreter) but I wonder why I should use a lexer.
For me, it looks like it would be easier to create a parser that directly parses the code. what am I overlooking?
I think you'll agree that most languages (likely including the one you are implementing) have conceptual tokens:
operators, e.g * (usually multiply), '(', ')', ;
keywords, e.g., "IF", "GOTO"
identifiers, e.g. FOO, count, ...
numbers, e.g. 0, -527.23E-41
comments, e.g., /* this text is ignored in your file */
whitespace, e.g., sequences of blanks, tabs and newlines, that are ignored
As a practical matter, it takes a specific chunk of code to scan for/collect the characters that make each individual token. You'll need such a code chunk for each type of token your language has.
If you write a parser without a lexer, at each point where your parser is trying to decide what comes next, you'll have to have ALL the code that recognize the tokens that might occur at that point in the parse. At the next parser point, you'll need all the code to recognize the tokens that are possible there. This gives you an immense amount of code duplication; how many times do you want the code for blanks to occur in your parser?
If you think that's not a good way, the obvious cure to is remove all the duplication: place the code for each token in a subroutine for that token, and at each parser place, call the subroutines for the tokens. At this point, in some sense, you already have a lexer: an isolated collection of code to recognize tokens. You can code perfectly fine recursive descent parsers this way.
The next thing you'll discover is that you call the token subroutines for many of the tokens at each parser point. Even that seems like a lot of work and duplication. So, replace all the calls with a single "GetNextToken" call, that itself invokes the token recognizing code for all tokens, and returns a enum that identifies the specific token encountered. Now your parser starts to look reasonable: at each parser point, it makes one call on GetNextToken, and then branches on enum returned. This is basically the interface that people have standardized on as a "lexer".
One thing you will discover is the token-lexers sometimes have trouble with overlaps; keywords and identifiers usually have this trouble. It is actually easier to merge all the token recognizers into a single finite state machine, which can then distinguish the tokens more easily. This also turns out to be spectacularly fast when processing the programming language source text. Your toy language may never parse more than 100 lines, but real compilers process millions of lines of code a day, and most of that time is spent doing token recognition ("lexing") esp. white space suppression.
You can code this state machine by hand. This isn't hard, but it is rather tedious. Or, you can use a tool like FLEX to do it for you, that's just a matter of convenience. As the number of different kinds of tokens in your language grows, the FLEX solution gets more and more attractive.
TLDR: Your parser is easier to write, and less bulky, if you use a lexer. In addition, if you compile the individual lexemes into a state machine (by hand or using a "lexer generator"), it will run faster and that's important.
Well, for intelligently simplified programing language you can get away without either lexer or parser :-) Not kidding. Look up Forth. You can start with tags here on SO (gforth is GNU's) and then go to the Standard's site which has pointers to a few interpreters, sites and its Glossary.
Then you can check out Win32Forth and that should keep you busy for quite a while :-)
Interpreter also compiles (when you invoke words that switch system to compilation context). All without a distinct parser. Lookahead is actually lookbehind :-) - not kidding. It rarely absorbs one following word (== lookahead is max 1). The "words" (aka tokens) are at the same time keywords and variable names and they all live in a Dictionary. There's a whole online book at that site (plus pdf).
Control structures are also just words (they compile a few addresses and jumps on the fly).
You can find old Journals there as well, covering a wide spectrum from machine code generation to object oriented extensions. Yes still without parser - believe it or not.
There used to be more sophisticated (commercial) Forth systems which were reducing words to machine call instructions with immediate addressing (makes the engine run 2-4 times faster) but even plain interpreters were always considered to be fast. One is apparently still active - SwiftForth, but don't expect any freebies there.
There's one Forth on GitHub CiForth which is quite spartanic but has builds and releases for Win, Linux and Mac, 32 and 64 so you can just download and run. Claims to have a 16-bit build as well :-) For embedded systems I suppose.
I have a string that, by using string.format("%02X", char), I've received the following:
74657874000000EDD37001000300
In the end, I'd like that string to look like the following:
t e x t NUL NUL NUL í Ó p SOH NUL ETX NUL (spaces are there just for clarification of characters desired in example).
I've tried to use \x..(hex#), string.char(0x..(hex#)) (where (hex#) is alphanumeric representation of my desired character) and I am still having issues with getting the result I'm looking for. After reading another thread about this topic: what is the way to represent a unichar in lua and the links provided in the answers, I am not fully understanding what I need to do in my final code that is acceptable for this to work.
I'm looking for some help in better understanding an approach that would help me to achieve my desired result provided below.
ETA:
Well I thought that I had fixed it with the following code:
function hexToAscii(input)
local convString = ""
for char in input:gmatch("(..)") do
convString = convString..(string.char("0x"..char))
end
return convString
end
It appeared to work, but didnt think about characters above 127. Rookie mistake. Now I'm unsure how I can get the additional characters up to 256 display their ASCII values.
I did the following to check since I couldn't truly "see" them in the file.
function asciiSub(input)
input = input:gsub(string.char(0x00), "<NUL>") -- suggested by a coworker
print(input)
end
I did a few gsub strings to substitute in other characters and my file comes back with the replacement strings. But when I ran into characters in the extended ASCII table, it got all forgotten.
Can anyone assist me in understanding a fix or new approach to this problem? As I've stated before, I read other topics on this and am still confused as to the best approach towards this issue.
The simple way to transform a base16-encoded string is just to
function unhex( input )
return (input:gsub( "..", function(c)
return string.char( tonumber( c, 16 ) )
end))
end
This is basically what you have, just a bit cleaner. (There's no need to say "(..)", ".." is enough – if you specify no captures, you'll automatically get the whole match. And while it might work if you write string.char( "0x"..c ), it's just evil – you concatenate lots of strings and then trigger the automatic conversion to numbers. Much better to just specify the base when explicitly converting.)
The resulting string should be exactly what went into the hex-dumper, no matter the encoding.
If you cannot correctly display the result, your viewer will also be unable to display the original input. If you used different viewers for the original input and the resulting output (e.g. a text editor and a terminal), try writing the output to a file instead and looking at it with the same viewer you used for the original input, then the two should be exactly the same.
Getting viewers that assume different encodings (e.g. one of the "old" 8-bit code pages or one of the many versions of Unicode) to display the same thing will require conversion between different formats, which tends to be quite complicated or even impossible. As you did not mention what encodings are involved (nor any other information like OS or programs used that might hint at the likely encodings), this could be just about anything, so it's impossible to say anything more specific on that.
You actually have a couple of problems:
First, make sure you know the meaning of the term character encoding, and that you know the difference between characters and bytes. A popular post on the topic is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Then, what encoding was used for the bytes you just received? You need to know this, otherwise you don't know what byte 234 means. For example it could be ISO-8859-1, in which case it is U+00EA, the character ê.
The characters 0 to 31 are control characters (eg. 0 is NUL). Use a lookup table for these.
Then, displaying the characters on the terminal is the hard part. There is no platform-independent way to display ê on the terminal. It may well be impossible with the standard print function. If you can't figure this step out you can search for a question dealing specifically with how to print Unicode text from Lua.
I want match something like:
var i=1;
So I want to know if var has started at word boundary.
When it matches this line I want to know the last character of previous yytext.
Just to be sure that a char before var is really a non variable character( aka "\b" in regex)
One crude way to maintain old_yytext in each rule and also have a default rule ".".
How to get it?
The only way is to save a copy of the previous token, or at least the last character. Flex's buffer management strategy does not guarantee that the previous token still exists in memory. It is possible that the current token starts at the beginning of flex's buffer.
But doing the work of saving the previous token in every rule would be really silly. You should trust flex to work as advertised, and write appropriate rules. For example, if your identifier pattern looks like this:
[[:alpha:]][[:alnum:]]*
then it is impossible for var to immediately follow an identifier because it would have been included in the idebtifier.
There is one common case in a "normal" flex scanner definition where a keyword or identifier might immediately follow an alphanumeric character, which is when the keyword immediately follows a number (123var). This is not usually a problem, because in almost all languages, it will trigger a syntax error (and if it isn't a syntax error, maybe it is ok :-) )
If you really want to trigger a lexical error, you can add a pattern which recognizes a number followed by a letter.
I'm trying to interface Haskell with a command line program that has a read-eval-print loop. I'd like to put some text into an input handle, and then read from an output handle until I find a prompt (and then repeat). The reading should block until a prompt is found, but no longer. Instead of coding up my own little state machine that reads one character at a time until it constructs a prompt, it would be nice to use Parsec or Attoparsec. (One issue is that the prompt changes over time, so I can't just check for a constant string of characters.)
What is the best way to read the appropriate amount of data from the output handle and feed it to a parser? I'm confused because most of the handle-reading primatives require me to decide beforehand how much data I want to read. But it's the parser that should decide when to stop.
You seem to have two questions wrapped up in here. One is about incremental parsing, and one is about incremental reading.
Attoparsec supports incremental parsing directly. See the IResult type in Data.Attoparsec.Text. Parsec, alas, doesn't. You can run your parser on what you have, and if it gives an error, add more input and try again, but you really don't know if the error was an unrecoverable parse error, or just needing for more input.
In your case, usualy REPLs read one line at a time. Hence you can use hGetLine to read a line - pass it to Attoparsec, and if it parses evaluate it, and if not, get another line.
If you want to see all this in action, I do this kind of thing in Plush.Job.Output, but with three small differences: 1) I'm parsing byte streams, not strings. 2) I've set it up to pull as much as is available from the input and parse as many items as I can. 3) I'm reading directly from file descriptos. But the same structure should help you do it in your situation.