How can two Haskell programs exchange an integer value via stdin and stdout without treating the data as text? - parsing

I am interested in learning how to send data efficiently between Haskell programs using standard input and output. Suppose I want to pipe two programs together: "P1" outputs the number 5 to stdout, and "P2" takes an integer from stdin, adds 1, and outputs it to stdout again. Right now, the best way I know to do this involves outputting the data as text from P1, parsing that text back to an integer in P2, and proceeding from there. For example:
P1.hs:
module Main where
main = do
print 5
P2.hs:
module Main where
main = fmap manipulateData getLine >>= print
where
manipulateData = (+ 1) . (read :: String -> Int)
Output:
$ (stack exec p1) | (stack exec p2)
6
I'd like to use standard i/o to send an integer without treating it as text, if possible. I'm assuming this still requires some sort of parsing to work, but I'm hoping it's possible to parse the data as binary and get a faster program.
Does Haskell have any way to make this straightforward? Since I am going from one fundamental Haskell datatype (Int) to the same type again with a pass through standard i/o in the middle, I'm wondering if there is an easy solution that doesn't require writing a custom binary parser (which I don't know how to do). Can anyone provide such a method?

Here is the code that I ended up with:
module Main where
import qualified Data.ByteString.Lazy as BS
import qualified Data.Binary as B
main :: IO ()
main = do
dat <- BS.getContents
print $ (B.decode dat :: Int) + 1
The other program uses similar imports and outputs 5 with the following line:
BS.putStr $ B.encode (5 :: Int)
The resulting programs can be piped together, and the resulting program behaves as required.

Related

How to also get how many characters read in parse?

I'm using Numeric.readDec to parse numbers and reads to parse Strings. But I also need to know how many characters were read.
For example readDec "52 rest" returns [(52," rest")], and read 2 characters. But there isn't a great way that I can find to know that it read 2 characters.
You could check the string length of show 52, but if the input was 052 that would give you the wrong answer (this solution also wouldn't work for the string parsing which has escape characters). You also could use the length of the post parsed string subtracted from the length of the input string. But this is very inefficient for long strings with many parses.
How can this be done correctly and efficiently (preferably without just writing your own parse)?
With just base, instead of readDec, you can use readDecP from Text.Read.Lex, which uses a ReadP parser:
readDecP :: (Eq a, Num a) => ReadP a
The gather combinator in Text.ParserCombinators.ReadP returns the parse result along with the actual characters parsed:
gather :: ReadP a -> ReadP (String, a)
You can run the parser with readP_to_S, which gives back a ReadS parser, which is a function that accepts a string and produces a list of possible parses with the remainder of the string.
readP_to_S :: ReadP a -> ReadS a
type ReadS a = String -> [(a, String)]
An example in GHCi:
> import Text.ParserCombinators.ReadP (gather, readP_to_S)
> import Text.Read.Lex (readDecP)
> readP_to_S (gather readDecP) "52 rest"
[(("52",52)," rest")]
> readP_to_S (gather readDecP) "0644 permissions"
[(("0644",644)," permissions")]
You can simply check that there is only one valid parse if you want the result to be unambiguous, and then take the length of the first component to find the number of Char code points parsed.
These parsers are fairly limited, however; if you want something easier to use, faster, or able to produce more detailed error messages, then you should check out a more fully featured parsing package such as regex-applicative (regular grammars) or megaparsec (context-sensitive grammars).

Append text file to lexicon in Rascal

Is it possible to append terminals retrieved from a text file to a lexicon in Rascal? This would happen at run time, and I see no obvious way to achieve this. I would rather keep the data separate from the Rascal project. For example, if I had read in a list of countries from a text file, how would I add these to a lexicon (using the lexical keyword)?
In the data-dependent version of the Rascal parser this is even easier and faster but we haven't released this yet. For now I'd write a generic rule with a post-parse filter, like so:
rascal>set[str] lexicon = {"aap", "noot", "mies"};
set[str]: {"noot","mies","aap"}
rascal>lexical Word = [a-z]+;
ok
rascal>syntax LexiconWord = word: Word w;
ok
rascal>LexiconWord word(Word w) { // called when the LexiconWord.word rule is use to build a tree
>>>>>>> if ("<w>" notin lexicon)
>>>>>>> filter; // remove this parse tree
>>>>>>> else fail; // just build the tree
>>>>>>>}
rascal>[Sentence] "hello"
|prompt:///|(0,18,<1,0>,<1,18>): ParseError(|prompt:///|(0,18,<1,0>,<1,18>))
at $root$(|prompt:///|(0,64,<1,0>,<1,64>))
rascal>[Sentence] "aap"
Sentence: (Sentence) `aap`
rascal>
Because the filter function removed all possible derivations for hello, the parser eventually returns a parse error on hello. It does not do so for aap which is in the lexicon, so hurray. Of course you can make interestingly complex derivations with this kind of filtering. People sometimes write ambiguous grammars and use filters like so to make it unambiguous.
Parsing and filtering in this way is in cubic worst-case time in terms of the length of the input, if the filtering function is in amortized constant time. If the grammar is linear, then of course the entire process is also linear.
A completely different answer would be to dynamically update the grammar and generate a parser from this. This involves working against the internal grammar representation of Rascal like so:
set[str] lexicon = {"aap", "noot", "mies"};
syntax Word = ; // empty definition
typ = #Word;
grammar = typ.definitions;
grammar[sort("Word")] = { prod(sort("Word"), lit(x), {}) | x <- lexicon };
newTyp = type(sort("Word"), grammar);
This newType is a reified grammar + type for the definition of the lexicon, and which can now be used like so:
import ParseTree;
if (type[Word] staticGrammar := newType) {
parse(staticGrammar, "aap");
}
Now having written al this, two things:
I think this may trigger unknown bugs since we did not test dynamic parser generation, and
For a lexicon with a reasonable size, this will generate an utterly slow parser since the parser is optimized for keywords in programming languages and not large lexicons.

Haskell command system parser

I've written this simple parser that take from the command line [ps auxww | ./myparser] and parses the output of the ps command in order to insert it into the process data structure I created.
I succeed to parse one line of the result String, but now I'm stuck trying to parse the whole string and return a [Process] and not a single Process. The problem is how to implement parsePS. It has to call many times myParser in order to parse every single line and return a list of Process and print it into the terminal.
Can someone help me?
I'm not sure what's failing for you, but I am guessing the spacing is killing you. If so, I have two ideas that might help.
Modify myParser to consume spaces at the end and the many combinator should work.
myParser = do
...
spaces
command <- pCommand
spaces -- CONSUME END OF LINE
return Entry{ ... }
Then many myParser should work.
Alternately, you could split the input into lines separately first and call parse on each.
argLines <- fmap lines getContents
(I take it you mean to burn the first line via getLine before the hGetContents?)
It sounds to me like you're looking for a way to parse each line in sequence and return a list of parsed results. How about mapM from the Prelude?
If myParser :: String -> Parser Process, then mapM myParser :: [String] -> Parser [Process], which seems to be what you're looking for (using generic names for Parsec's Parser types). So if you have a list of lines (call it lns) that you want to parse in sequence, you can use parse (mapM myParser) lns to get what you want.

Haskell/Parsec: How do you use the functions in Text.Parsec.Indent?

I'm having trouble working out how to use any of the functions in the Text.Parsec.Indent module provided by the indents package for Haskell, which is a sort of add-on for Parsec.
What do all these functions do? How are they to be used?
I can understand the brief Haddock description of withBlock, and I've found examples of how to use withBlock, runIndent and the IndentParser type here, here and here. I can also understand the documentation for the four parsers indentBrackets and friends. But many things are still confusing me.
In particular:
What is the difference between withBlock f a p and
do aa <- a
pp <- block p
return f aa pp
Likewise, what's the difference between withBlock' a p and do {a; block p}
In the family of functions indented and friends, what is ‘the level of the reference’? That is, what is ‘the reference’?
Again, with the functions indented and friends, how are they to be used? With the exception of withPos, it looks like they take no arguments and are all of type IParser () (IParser defined like this or this) so I'm guessing that all they can do is to produce an error or not and that they should appear in a do block, but I can't figure out the details.
I did at least find some examples on the usage of withPos in the source code, so I can probably figure that out if I stare at it for long enough.
<+/> comes with the helpful description “<+/> is to indentation sensitive parsers what ap is to monads” which is great if you want to spend several sessions trying to wrap your head around ap and then work out how that's analogous to a parser. The other three combinators are then defined with reference to <+/>, making the whole group unapproachable to a newcomer.
Do I need to use these? Can I just ignore them and use do instead?
The ordinary lexeme combinator and whiteSpace parser from Parsec will happily consume newlines in the middle of a multi-token construct without complaining. But in an indentation-style language, sometimes you want to stop parsing a lexical construct or throw an error if a line is broken and the next line is indented less than it should be. How do I go about doing this in Parsec?
In the language I am trying to parse, ideally the rules for when a lexical structure is allowed to continue on to the next line should depend on what tokens appear at the end of the first line or the beginning of the subsequent line. Is there an easy way to achieve this in Parsec? (If it is difficult then it is not something which I need to concern myself with at this time.)
So, the first hint is to take a look at IndentParser
type IndentParser s u a = ParsecT s u (State SourcePos) a
I.e. it's a ParsecT keeping an extra close watch on SourcePos, an abstract container which can be used to access, among other things, the current column number. So, it's probably storing the current "level of indentation" in SourcePos. That'd be my initial guess as to what "level of reference" means.
In short, indents gives you a new kind of Parsec which is context sensitive—in particular, sensitive to the current indentation. I'll answer your questions out of order.
(2) The "level of reference" is the "belief" referred in the current parser context state of where this indentation level starts. To be more clear, let me give some test cases on (3).
(3) In order to start experimenting with these functions, we'll build a little test runner. It'll run the parser with a string that we give it and then unwrap the inner State part using an initialPos which we get to modify. In code
import Text.Parsec
import Text.Parsec.Pos
import Text.Parsec.Indent
import Control.Monad.State
testParse :: (SourcePos -> SourcePos)
-> IndentParser String () a
-> String -> Either ParseError a
testParse f p src = fst $ flip runState (f $ initialPos "") $ runParserT p () "" src
(Note that this is almost runIndent, except I gave a backdoor to modify the initialPos.)
Now we can take a look at indented. By examining the source, I can tell it does two things. First, it'll fail if the current SourcePos column number is less-than-or-equal-to the "level of reference" stored in the SourcePos stored in the State. Second, it somewhat mysteriously updates the State SourcePos's line counter (not column counter) to be current.
Only the first behavior is important, to my understanding. We can see the difference here.
>>> testParse id indented ""
Left (line 1, column 1): not indented
>>> testParse id (spaces >> indented) " "
Right ()
>>> testParse id (many (char 'x') >> indented) "xxxx"
Right ()
So, in order to have indented succeed, we need to have consumed enough whitespace (or anything else!) to push our column position out past the "reference" column position. Otherwise, it'll fail saying "not indented". Similar behavior exists for the next three functions: same fails unless the current position and reference position are on the same line, sameOrIndented fails if the current column is strictly less than the reference column, unless they are on the same line, and checkIndent fails unless the current and reference columns match.
withPos is slightly different. It's not just a IndentParser, it's an IndentParser-combinator—it transforms the input IndentParser into one that thinks the "reference column" (the SourcePos in the State) is exactly where it was when we called withPos.
This gives us another hint, btw. It lets us know we have the power to change the reference column.
(1) So now let's take a look at how block and withBlock work using our new, lower level reference column operators. withBlock is implemented in terms of block, so we'll start with block.
-- simplified from the actual source
block p = withPos $ many1 (checkIndent >> p)
So, block resets the "reference column" to be whatever the current column is and then consumes at least 1 parses from p so long as each one is indented identically as this newly set "reference column". Now we can take a look at withBlock
withBlock f a p = withPos $ do
r1 <- a
r2 <- option [] (indented >> block p)
return (f r1 r2)
So, it resets the "reference column" to the current column, parses a single a parse, tries to parse an indented block of ps, then combines the results using f. Your implementation is almost correct, except that you need to use withPos to choose the correct "reference column".
Then, once you have withBlock, withBlock' = withBlock (\_ bs -> bs).
(5) So, indented and friends are exactly the tools to doing this: they'll cause a parse to immediately fail if it's indented incorrectly with respect to the "reference position" chosen by withPos.
(4) Yes, don't worry about these guys until you learn how to use Applicative style parsing in base Parsec. It's often a much cleaner, faster, simpler way of specifying parses. Sometimes they're even more powerful, but if you understand Monads then they're almost always completely equivalent.
(6) And this is the crux. The tools mentioned so far can only do indentation failure if you can describe your intended indentation using withPos. Quickly, I don't think it's possible to specify withPos based on the success or failure of other parses... so you'll have to go another level deeper. Fortunately, the mechanism that makes IndentParsers work is obvious—it's just an inner State monad containing SourcePos. You can use lift :: MonadTrans t => m a -> t m a to manipulate this inner state and set the "reference column" however you like.
Cheers!

String splitting problems in Erlang

I've been playing around with the splitting of atoms and have a problem with strings. The input data will always be an atom that consists of some letters and then some numbers, for instance ms444, r64 or min1. Since the function lists:splitwith/2 takes a list the atom is first converted into a list:
24> lists:splitwith(fun (C) -> is_atom(C) end, [m,s,4,4,4]).
{[m,s],[4,4,4]}
25> lists:splitwith(fun (C) -> is_atom(C) end, atom_to_list(ms444)).
{[],"ms444"}
26> atom_to_list(ms444).
"ms444"
I want to separate the letters from the numbers and I've succeeded in doing that when using a list, but since I start out with an atom I get a "string" as result to put into my splitwith function...
Is it interpreting each item in the list as a string or what is going on?
You might want to have a look at the string module documentation:
http://www.erlang.org/doc/man/string.html
The following function might interest you:
tokens(String, SeparatorList) -> Tokens
Since strings in Erlang are just a list() of integer() the test in the fun will be made if the item is an atom() when it is in fact an integer(). If the test is changed to look for letters it works:
29> lists:splitwith(fun (C) -> (C >= $a) and (C =< $Z) end, atom_to_list(ms444)).
{"ms","444"}
An atom in erlang is a named constant and not a variable (or not like a variable is in an imperative language).
You should really not create atoms in dynamic fashion (that is, don't convert things to atoms at runtime)
They are used more in pattern matching and send recive code.
Pid ! {matchthis, X}
recive
{foobar,Y} -> doY(Y);
{matchthis,X} -> doX(X);
Other -> doother(Other)
end
A variable, like X could be set to an atom. For example X=if 1==1 -> ok; true -> fail end. I could suffer from poor imagination but I can't think of a way why you would like to parse atom. You should be in charge of what atoms you write and not use list_to_atom(CharIntegerList).
Can you perhaps give a more overview of what you like to accomplish?
A "string" in Erlang is not a primitive type: it is just a list() of integers(). So if you want to "separate" the letters from the digits, you'll have to do comparison with the integer representation of the characters.

Resources