Sequential Programming: Better nest loops or catch "events" - cobol

This is a question for program design. There are 2 possibilities I see, where I want to know which one is better and why (aspects: readability, performance, expendability, complexity etc.) If you got an other one (sequential only pls no OO stuff xD) feel free to post it and explain its pros and cons
It is a simple report looping over entries from a file, doing something with them and generate a printfile (without a dedicated printer, 3rdparty software, I need to calculate the pages by myself)
Solution NESTED LOOPS in pseudocode
//loop over entries
perform until End-Of-File
perform page-heading
//loop over pages
perform until End-Of-File or End-Of-Page
perform calculate-entry
perform print-entry
end-perform
perform page-footer
end-perform
Solution EVENT in pseudocode
//loop over entries
perform until End-Of-File
evaluate line-counter
case heading-position
perform page-heading
case footer-position
perform reset-line-counter
perform page-footer
case other
perform calculate-entry
perform print-entry
end-evaluate
end-perform
My final program will be in cobol, but I occurred the same question in ABAP so feel free to argument with any sequential language.

The first one (nested loops) is the more traditional way, and I prefer it, as I was a COBOL programmer last century:) However, you should probably look at the preferred coding style in your organisation in order to decide.
If you are interested in the performance question, here is a similar question, about if vs evaluate, rather than nested loops, but there is a long and very interesting answer. Which is better in terms of performance? "if else" or "Evaluate" statement in COBOL?

Related

PERFORM THRU VARYING : exact behavior?

I am trying to understand the behavior of a COBOL program and I am stumbling upon the following lines:
PERFORM 525-NUMERIC THRU 525-EXIT
VARYING K FROM 1 BY 1 UNTIL K > CAR-L.
I understand the global concept that it is some kind of loop based on the value of K, but I fail to understand the impact of the THRU 525-EXIT words?
A PERFORM can execute a range of paragraphs or SECTIONS serially. This is done by using THRU/THROUGH to name the last paragraph/SECTION of the series, the PERFORM having already named the starting point.
A simple PERFORM of a paragraph:
PERFORM 10-OPEN-INPUT-FILES
This establishes a "range" for the PERFORM starting with 10-OPEN-INPUT-FILES and ending with the last statement of that paragraph.
A PERFORM of multiple paragraphs, one after the other:
PERFORM 10-OPEN-INPUT-FILES
THRU 10-OPEN-INPUT-FILES-EXIT
This establishes a wider range for the PERFORM, starting from 10-OPEN-INPUT-FILES and ending with the last statement of 10-OPEN-INPUT-FILES-EXIT.
This isn't really a good or useful thing, but it gets used a lot in a specific way, and that is perhaps what you have. This is to have an "exit paragraph" associated with each paragraph that is PERFORMed. One original paragraph, followed by one unique exit-paragraph associated (by the second paragraph's position, nothing else) with that original paragraph.
This is never necessary, programs work fine without them. However, since there is an "exit-paragraph" of course there is now a label which can be the target of a GO TO. Code something stupid, or come across something already coded that way, and just jam a GO TO in to get you (but perhaps not the next person) out of trouble. This is not a good thing to do, although it will often be seen as "expedient". And the next time in the same code the expedient route will be taken, and the next time, and then something which should always have been simple has become... complex.
Stunningly (my opinion, for many this is normal practice), more than a few sites have it in their local standards that an exit-paragraph must be included for each PERFORMed paragraph and PERFORM ... THRU ... must be coded.
As well as asking for trouble by inviting the use of GO TO, another problem is that the physical location of the code is now relevant. If you put a new paragraph before the exit-paragraph, then it becomes part of the PERFORM range, whether intended or not. Some people intend, even going so far as to code several paragraphs to be within the range of a PERFORM, with the ever-handy GO TO as their "get out of mess" tool.
A further problem is that when you encounter the PERFORM ... with THRU ... you don't know how many paragraphs (or SECTIONS) are included in the PERFORM without looking at the paragraphs (or SECTIONS) themselves.
You can also PERFORM a SECTION. This is a procedure label followed by the word SECTION and terminated by a full-stop/period.
A SECTION can contain paragraphs. When you PERFORM a SECTION, all paragraphs in the SECTION are within the PERFORM range, until the last statement of the SECTION. It is like a PERFORM ... THRU ... without the THRU ... bit. It has the same problems as the PERFORM ... THRU ... plus the additional one that if unique paragraphs are used as the target of a GO TO, great care must be taken if a SECTION is copied to make a new one. It is perfectly legal to GO TO out of a SECTION (or paragraph) but this would usually be unintended and can cause chaos, as program control wanders of somewhere else. If using a GO TO to an exit-paragraph in a SECTION, the best thing to do is to use identical names for the exit-paragraph. Any GO TO within a SECTION will be automatically "qualified" by the compiler to the paragraph within that SECTION (if there are non-unique paragraph-names referenced in a SECTION, but no paragraph of that name in the SECTION itself, the compiler will spot the error).
With your code, locate the paragraph being PERFORMed, and search sequentially for the paragraph named on the THRU. It is probably just a dumb exit (a paragraph with just an EXIT statement and a full-stop/period). There may be other paragraphs in between those two, if you are very unlucky. Any such paragraphs are included within the range of the PERFOR, without being explicit on the PERFORM statement. The
It is worth noting that the EXIT statement itself does nothing. It is a "No Operation" or NOP (or NOOP). At the end of a paragraph which is PERFORMed a number of instructions will be generated to do the exit-processing, this is automatic and does not (and never has) relied on the presence of an EXIT statement.
It used to be the case, prior to the 1985 Standard, that EXIT had to be coded in a paragraph on its own. This is no longer the case. You can fill a paragraph with EXIT statements (say 20) and then finish the paragraph with a DISPLAY. PERFORM the paragraph, and you will see the DISPLAY output.
A paragraph which is PERFORMed without a THRU should not contain a GO TO. It is not a compile-error if it does contain a GO TO. It is an accident waiting to happen.
People who feel the need to use GO TO have to use PERFORM ... THRU ... or to use a PERFORM of a SECTION. Unfortunately, even if the original coder does not use GO TO, the use of PERFORM ... THRU ... or PERFORM of a SECTION does make it easy for someone in the future to use GO TO. If there is somewhere for a GO TO to go to, then it is more likely that a GO TO will appear at some point. If there is no existing target for a GO TO, then the next coder will be put off from using GO TO by having to make additional changes.
In the current Standard, from 2014, there are some new versions of EXIT. (The EXIT PROGRAM version of EXIT has been around a long time already, although these days the IBM-inspired GOBACK is more likely to be used to return to the CALLing program).
The new possibilities include EXIT PARAGRAPH and EXIT SECTION. Check the documentation for your compiler to see what variants of EXIT are available. All the variants of EXIT generate executable code, it is just the plain EXIT which does not.
If your compiler does allow EXIT PARAGRAPH and EXIT SECTION, it means you no longer need to have a label to allow use of a (now "secret") GO TO, it just won't be called GO TO, it'll be called EXIT somevariant. Just remember that all those (except EXIT PROGRAM) are GO TOs in disguise, and that code can always be rearranged (and often simplified) to obviate the need for GO TO.
It does require experience to structure for no-GO TO code, so don't be too concerned with it for now. If your site uses GO TO (implied by the use of THRU) then it will be very important that you understand the ramifications of the use of GO TO because existing code will use it.
Then start in a small way to avoid GO TO yourself, increasing the scope of your efforts as you become familiar with techniques to do so. Remember, the aim is to simplify the code, or at least not make it worse, not just to not code GO TO by rote. If you make your code complicated simply to avoid a GO TO, then use GO TO until you know how to do it better.
Many, many, good and experienced COBOL programmers write understandable programs using GO TO. Unfortunately the majority of programmers are not "good and experienced" so are happy to short-circuit something with GO TO and move on.
The VARYING on a PERFORM is a means to repeat the execution, with a starting point, an increment and termination condition. This is the same whether or not THRU is coded.
An example of what can happen with an errant GO TO going outside the range of a PERFORM can be found here: https://codegolf.stackexchange.com/a/20889/16411. That is coded with SECTIONs but it could equally be done with PERFORM ... TRHU ...
Very good to read this in conjunction with Bruce Martin's answer.
Following on on from Bill's answer, I will try and add a more visual answer.
Perform Proc-A thru Proc-D.
...
Proc-A.
....
Proc-B.
....
Proc-C.
....
Proc-D.
....
In the above, the Perform Proc-A thru Proc-D executes procedure's Proc-A, Proc-B, Proc-C and Proc-D. It is a shorthand for
Perform Proc-A
Perform Proc-B
Perform Proc-C
Perform Proc-D
The Perform Thru syntax has several problems:
It is not always clear what is being executed.
Perform B100-Load-From-DB thru B500-Save-To-Db
I think the following tells you more
Perform B100-Load-From-DB
Perform B200-Adjust-Stock-for-Purchases
Perform B300-Adjust-Stock-for-Sales
Perform B400-Calculate-Markdowns
Perform B500-Save-To-Db
It makes it easy to introduce problems i.e. if you add a procedure in the wrong position, you will be introducing code without realizing it
Proc-B.
....
Proc-in-wrong-position.
....
Proc-C.
....
An error like the above is easy to make but hard to spot
It is one of those Cobol features that Seemed like a good idea at the time it was introduced; but should be avoided !!!
Expounding on both Bill's and Bruce's excellent answers.
Starting with code from Bruce's example.
Perform B100-Load-From-DB thru B500-Save-To-Db
...
B100-Load-From-DB
...
B200-Adjust-Stock-for-Purchases
...
B300-Adjust-Stock-for-Sales
...
Some-Danged-Ol-Thing
...
B400-Calculate-Markdowns
...
B500-Save-To-Db
This still seems reasonably simple enough. Code in each paragraph will be processed in top down order: B100-Load-From-DB, B200-Adjust-Stock-for-Purchases, B300-Adjust-Stock-for-Sales, Some-Danged-Ol-Thing, B400-Calculate-Markdowns, B500-Save-To-Db
As Bill indicated, the addition of GO TO statements to a PERFORM ... THRU block introduces the author of said code to a special quadrant in the Ninth Circle of Hell.
Perform B100-Load-From-DB thru B500-Save-To-Db
...
B100-Load-From-DB
...
GO TO B400-Calculate-Markdowns
...
B200-Adjust-Stock-for-Purchases
...
B300-Adjust-Stock-for-Sales
...
GO TO B200-Adjust-Stock-for-Purchases
...
Some-Danged-Ol-Thing
...
B400-Calculate-Markdowns
...
B500-Save-To-Db
The Hell is exacerbated by yet more "clever thinking" when someone decides to run a portion of the PERFORM ... THRU block.
Perform B200-Adjust-Stock-for-Purchases thru B300-Adjust-Stock-for-Sales
...
Perform B100-Load-From-DB thru B200-Adjust-Stock-for-Purchases
...
Perform B100-Load-From-DB thru Some-Danged-Ol-Thing
There are numerous opportunities to fall through the Looking Glass and continue on into several Alternate Realities. I'm not just pontificating about theoretical possibilities. I'm talking actual experiences in code I've had over the years.
Whilst I was growing up, er I mean learning to code in COBOL, I was all but threatened with being beaten by a large man with a rattan stick if I stepped outta line. As such, I know how to walk on the edge of the precipice and use PERFORM ... THRU with GO TO interspersed throughout and not fall off the edge of the world. However, it makes for very dangerous code that can kill, so more importantly I know how to de-fang code like this and turn it into something civilized.

Good practice to parse data in a custom format

I'm writing a program that takes in input a straight play in a custom format and then performs some analysis on it (like number of lines and words for each character). It's just for fun, and a pretext for learning cool stuff.
The first step in that process is writing a parser for that format. It goes :
####Play
###Act I
##Scene 1
CHARACTER 1. Line 1, he's saying some stuff.
#Comment, stage direction
CHARACTER 2, doing some stuff. Line 2, she's saying some stuff too.
It's quite a simple format. I read extensively about basic parser stuff like CFG, so I am now ready to get some work done.
I have written my grammar in EBNF and started playing with flex/bison but it raises some questions :
Is flex/bison too much for such a simple parser ? Should I just write it myself as described here : Is there an alternative for flex/bison that is usable on 8-bit embedded systems? ?
What is good practice regarding the respective tasks of the tokenizer and the parser itself ? There is never a single solution, and for such a simple language they often overlap. This is especially true for flex/bison, where flex can perform some intense stuff with regex matching. For example, should "#" be a token ? Should "####" be a token too ? Should I create types that carry semantic information so I can directly identify for example a character ? Or should I just process it with flex the simplest way then let the grammar defined in bison decide what is what ?
With flex/bison, does it makes sense to perform the analysis while parsing or is it more elegant to parse first, then operate on the file again with some other tool ?
This got me really confused. I am looking for an elegant, perhaps simple solution. Any guideline ?
By the way, about the programing language, I don't care much. For now I am using C because of flex/bison but feel free to advise me on anything more practical as long as it is a widely used language.
It's very difficult to answer those questions without knowing what your parsing expectations are. That is, an example of a few lines of text does not provide a clear understanding of what the intended parse is; what the lexical and syntactic units are; what relationships you would like to extract; and so on.
However, a rough guess might be that you intend to produce a nested parse, where ##{i} indicates the nesting level (inversely), with i≥1, since a single # is not structural. That violates one principle of language design ("don't make the user count things which the computer could count more accurately"), which might suggest a structure more like:
#play {
#act {
#scene {
#location: Elsinore. A platform before the castle.
#direction: FRANCISCO at his post. Enter to him BERNARDO
BERNARDO: Who's there?
FRANCISCO: Nay, answer me: stand, and unfold yourself.
BERNARDO: Long live the king!
FRANCISCO: Bernardo?
or even something XML-like. But that would be a different language :)
The problem with parsing either of these with a classic scanner/parser combination is that the lexical structure is inconsistent; the first token on a line is special, but most of the file consists of unparsed text. That will almost inevitably lead to spreading syntactic information between the scanner and the parser, because the scanner needs to know the syntactic context in order to decide whether or not it is scanning raw text.
You might be able to avoid that issue. For example, you might require that a continuation line start with whitespace, so that every line not otherwise marked with #'s starts with the name of a character. That would be more reliable than recognizing a dialogue line just because it starts with the name of a character and a period, since it is quite possible for a character's name to be used in dialogue, even at the end of a sentence (which consequently might be the first word in a continuation line.)
If you do intend for dialogue lines to be distinguished by the fact that they start with a character name and some punctuation then you will definitely have to give the scanner access to the character list (as a sort of symbol table), which is a well-known but not particularly respected hack.
Consider the above a reflection about your second question ("What are the roles of the scanner and the parser?"), which does not qualify as an answer but hopefully is at least food for thought. As to your other questions, and recognizing that all of this is opinionated:
Is flex/bison too much for such a simple parser ? Should I just write it myself...
The fact that flex and bison are (potentially) more powerful than necessary to parse a particular language is a red herring. C is more powerful than necessary to write a factorial function -- you could easily do it in assembler -- but writing a factorial function is a good exercise in learning C. Similarly, if you want to learn how to write parsers, it's a good idea to start with a simple language; obviously, that's not going to exercise every option in the parser/scanner generators, but it will get you started. The question really is whether the language you're designing is appropriate for this style of parsing, not whether it is too simple.
With flex/bison, does it makes sense to perform the analysis while parsing or is it more elegant to parse first, then operate on the file again with some other tool?
Either can be elegant, or disastrous; elegance has more to do with how you structure your thinking about the problem at hand. Having said that, it is often better to build a semantic structure (commonly referred to as an AST -- abstract syntax tree) during the parse phase and then analyse that structure using other functions.
Rescanning the input file is very unlikely to be either elegant or effective.

For the the union of 2 conditions, use IF or EVALUATE in COBOL?

"If it works, don't touch it"...I understand. That said, the code I'm extending is studded with blocks like this:
EVALUATE TRUE ALSO TRUE
WHEN FOO-YES ALSO BAR-YES
PERFORM ACTION
WHEN OTHER
SET ERROR TO TRUE
END-EVALUATE
To my inexperienced eye, IF seems clearer:
IF FOO-YES AND BAR-YES
PERFORM ACTION
ELSE
SET ERROR TO TRUE
ENDIF
When writing new functionality, is there a reason to prefer EVALUATE over IF?
There is no reason to prefer EVALUATE over IF. I do, but I don't have a good reason for it. I try to code so it is easy for the maintainer to understand, but there are going to be cases where the logic is complex and you end up with constructs like...
EVALUATE TRUE ALSO TRUE ALSO TRUE
WHEN INITIAL-STATE ALSO ANY ALSO ANY
PERFORM 0100-INITIALIZE
PERFORM 1000-DISPLAY-MENU
WHEN MENU-DISPLAYED ALSO DFHRESP(NORMAL) ALSO UPDATE-REQUESTED
PERFORM 2000-DO-THE-UPDATE
EXEC CICS RETURN END-EXEC
[...and so forth...]
END-EVALUATE
Somewhere there's a quote about how things should be as simple as possible, but no simpler. There are, of course, many ways to code the same logic. Different people find different constructs easier to understand. Lots of arguments occur because of those differences.
You read it correctly. "True ALSO True" is a very odd thing to code.
Often you will see an "Evaluate True", in which case the when conditions all act EXACTLY like IF statements. Using ALSO introduces some possible oddities, as it isn't just like the CASE statements from all the other languages.
A nicer way to write this is:
Evaluate true
when FOO-YES and BAR-YES
perform action
when other
set error to true
End-Evaluate
Granted, TRUE ALSO TRUE is very easy to understand, but you could have TRUE ALSO WS-BLAH-BLAH and it could get confusing. The Cobol Evaluate verb is very powerful, and sometimes easy to shoot yourself in the foot with. That said, it is very powerful and will let you do alot.
It is often the case that convoluted IF's that don't nest cleanly can be fixed up nicely with a well written Evaluate.
I may say that makes the life easy in certain situations. Say for instance where we'll get to add another condition based on which we may need to PERFORM different chunk of code; in that case appending another WHEN would be easy rather than adjusting the conditions on IF. Here are the points when compared WHEN against adjusting conditions on IF:
We would have a feasibility to provide as many WHEN(s) as we want based on the PARAs to be performed.
IF conditions are prone to mistakes while working with OR or AND parameters
Also WHEN gives clear picture of how things flow where as we may have to adjust our glasses while analyzing IF(s). [This may happen when we have complicated loops, but as an efficient programmer we ought to forecast such scenarios]
Last but not the least, retrofitting back the code to previous state would be easy with less number of IFs.
But in your current case, i dont feel much clarity and as well difference between EVALUATE and IF. So, go for the one which makes you happy :)
To me it all comes down to readability (and hence, simpler maintenance later on).
When I look at the 2 code segments in your question, I find the IF block far more readable.
In 22 years of Cobol programming I have never used an ALSO in an EVALUATE.
And using an EVALUATE block for a simple binary true/false test seems a bit obtuse to me.
But don't get me wrong, I love EVALUATEs when used appropriately. When I started programming in Java 6 years ago, EVALUATE was the thing I missed most. Sure, Java has switch, but that can't be used as flexibly as EVALUATE.
Consider the following code segment:
IF ACTION = "START"
PERFORM START
ELSE IF ACTION = "STOP"
PERFORM STOP
ELSE IF ACTION = "PROCESS" AND FILE-NAME = "A"
PERFORM PROCESS-A-FILE
ELSE IF ACTION = "PROCESS" AND FILE-NAME = "B"
PERFORM PROCESS-B-FILE
ELSE IF ACTION = "RESET"
PERFORM RESET
ELSE
PERFORM INVALID-ACTION-ERROR.
This collection of IF statements has some real problems. Without adding 5 END-IFs to the end of it, you have to end the block with a full-stop. But you wouldn't be able to do that if the code was inside a PERFORM UNTIL ... END-PERFORM block, say. While this example is fairly simple, in a more complicated (and longer) example, it could get tricky to figure out which ELSE goes with which IF.
This is crying out to be put in an EVALUATE TRUE block, thus:
EVALUATE TRUE
WHEN ACTION = "START"
PERFORM START
WHEN ACTION = "STOP"
PERFORM STOP
WHEN ACTION = "PROCESS" AND FILE-NAME = "A"
PERFORM PROCESS-A-FILE
WHEN ACTION = "PROCESS" AND FILE-NAME = "B"
PERFORM PROCESS-B-FILE
WHEN ACTION = "RESET"
PERFORM RESET
WHEN OTHER
PERFORM INVALID-ACTION-ERROR
END-EVALUATE
Changing this EVALUATE to have TRUE ALSO TRUE would make it far less readable, and it really isn't necessary.

F# tail recursion and why not write a while loop?

I'm learning F# (new to functional programming in general though used functional aspects of C# for years but let's face it, that's pretty different) and one of the things that I've read is that the F# compiler identifies tail recursion and compiles it into a while loop (see http://thevalerios.net/matt/2009/01/recursion-in-f-and-the-tail-recursion-police/).
What I don't understand is why you would write a recursive function instead of a while loop if that's what it's going to turn into anyway. Especially considering that you need to do some extra work to make your function recursive.
I have a feeling someone might say that the while loop is not particularly functional and you want to act all functional and whatnot so you use recursion but then why is it sufficient for the compiler to turn it into a while loop?
Can someone explain this to me?
You could use the same argument for any transformation that the compiler performs. For instance, when you're using C#, do you ever use lambda expressions or anonymous delegates? If the compiler is just going to turn those into classes and (non-anonymous) delegates, then why not just use those constructions yourself? Likewise, do you ever use iterator blocks? If the compiler is just going to turn those into state machines which explicitly implement IEnumerable<T>, then why not just write that code yourself? Or if the C# compiler is just going to emit IL anyway, why bother writing C# instead of IL in the first place? And so on.
One obvious answer to all of these questions is that we want to write code which allows us to express ourselves clearly. Likewise, there are many algorithms which are naturally recursive, and so writing recursive functions will often lead to a clear expression of those algorithms. In particular, it is arguably easier to reason about the termination of a recursive algorithm than a while loop in many cases (e.g. is there a clear base case, and does each recursive call make the problem "smaller"?).
However, since we're writing code and not mathematics papers, it's also nice to have software which meets certain real-world performance criteria (such as the ability to handle large inputs without overflowing the stack). Therefore, the fact that tail recursion is converted into the equivalent of while loops is critical for being able to use recursive formulations of algorithms.
A recursive function is often the most natural way to work with certain data structures (such as trees and F# lists). If the compiler wants to transform my natural, intuitive code into an awkward while loop for performance reasons that's fine, but why would I want to write that myself?
Also, Brian's answer to a related question is relevant here. Higher-order functions can often replace both loops and recursive functions in your code.
The fact that F# performs tail optimization is just an implementation detail that allows you to use tail recursion with the same efficiency (and no fear of a stack overflow) as a while loop. But it is just that - an implementation detail - on the surface your algorithm is still recursive and is structured that way, which for many algorithms is the most logical, functional way to represent it.
The same applies to some of the list handling internals as well in F# - internally mutation is used for a more efficient implementation of list manipulation, but this fact is hidden from the programmer.
What it comes down to is how the language allows you to describe and implement your algorithm, not what mechanics are used under the hood to make it happen.
A while loop is imperative by its nature. Most of the time, when using while loops, you will find yourself writing code like this:
let mutable x = ...
...
while someCond do
...
x <- ...
This pattern is common in imperative languages like C, C++ or C#, but not so common in functional languages.
As the other posters have said some data structures, more exactly recursive data structures, lend themselves to recursive processing. Since the most common data structure in functional languages is by far the singly linked list, solving problems by using lists and recursive functions is a common practice.
Another argument in favor of recursive solutions is the tight relation between recursion and induction. Using a recursive solution allows the programmer to think about the problem inductively, which arguably helps in solving it.
Again, as other posters said, the fact that the compiler optimizes tail-recursive functions (obviously, not all functions can benefit from tail-call optimization) is an implementation detail which lets your recursive algorithm run in constant space.

What is parsing in terms that a new programmer would understand? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I am a college student getting my Computer Science degree. A lot of my fellow students really haven't done a lot of programming. They've done their class assignments, but let's be honest here those questions don't really teach you how to program.
I have had several other students ask me questions about how to parse things, and I'm never quite sure how to explain it to them. Is it best to start just going line by line looking for substrings, or just give them the more complicated lecture about using proper lexical analysis, etc. to create tokens, use BNF, and all of that other stuff? They never quite understand it when I try to explain it.
What's the best approach to explain this without confusing them or discouraging them from actually trying.
I'd explain parsing as the process of turning some kind of data into another kind of data.
In practice, for me this is almost always turning a string, or binary data, into a data structure inside my Program.
For example, turning
":Nick!User#Host PRIVMSG #channel :Hello!"
into (C)
struct irc_line {
char *nick;
char *user;
char *host;
char *command;
char **arguments;
char *message;
} sample = { "Nick", "User", "Host", "PRIVMSG", { "#channel" }, "Hello!" }
Parsing is the process of analyzing text made of a sequence of tokens to determine its grammatical structure with respect to a given (more or less) formal grammar.
The parser then builds a data structure based on the tokens. This data structure can then be used by a compiler, interpreter or translator to create an executable program or library.
(source: wikimedia.org)
If I gave you an english sentence, and asked you to break down the sentence into its parts of speech (nouns, verbs, etc.), you would be parsing the sentence.
That's the simplest explanation of parsing I can think of.
That said, parsing is a non-trivial computational problem. You have to start with simple examples, and work your way up to the more complex.
What is parsing?
In computer science, parsing is the process of analysing text to determine if it belongs to a specific language or not (i.e. is syntactically valid for that language's grammar). It is an informal name for the syntactic analysis process.
For example, suppose the language a^n b^n (which means same number of characters A followed by the same number of characters B). A parser for that language would accept AABB input and reject the AAAB input. That is what a parser does.
In addition, during this process a data structure could be created for further processing. In my previous example, it could, for instance, to store the AA and BB in two separate stacks.
Anything that happens after it, like giving meaning to AA or BB, or transform it in something else, is not parsing. Giving meaning to parts of an input sequence of tokens is called semantic analysis.
What isn't parsing?
Parsing is not transform one thing into another. Transforming A into B, is, in essence, what a compiler does. Compiling takes several steps, parsing is only one of them.
Parsing is not extracting meaning from a text. That is semantic analysis, a step of the compiling process.
What is the simplest way to understand it?
I think the best way for understanding the parsing concept is to begin with the simpler concepts. The simplest one in language processing subject is the finite automaton. It is a formalism to parsing regular languages, such as regular expressions.
It is very simple, you have an input, a set of states and a set of transitions. Consider the following language built over the alphabet { A, B }, L = { w | w starts with 'AA' or 'BB' as substring }. The automaton below represents a possible parser for that language whose all valid words starts with 'AA' or 'BB'.
A-->(q1)--A-->(qf)
/
(q0)
\
B-->(q2)--B-->(qf)
It is a very simple parser for that language. You start at (q0), the initial state, then you read a symbol from the input, if it is A then you move to (q1) state, otherwise (it is a B, remember the remember the alphabet is only A and B) you move to (q2) state and so on. If you reach (qf) state, then the input was accepted.
As it is visual, you only need a pencil and a piece of paper to explain what a parser is to anyone, including a child. I think the simplicity is what makes the automata the most suitable way to teaching language processing concepts, such as parsing.
Finally, being a Computer Science student, you will study such concepts in-deep at theoretical computer science classes such as Formal Languages and Theory of Computation.
Have them try to write a program that can evaluate arbitrary simple arithmetic expressions. This is a simple problem to understand but as you start getting deeper into it a lot of basic parsing starts to make sense.
Parsing is about READING data in one format, so that you can use it to your needs.
I think you need to teach them to think like this. So, this is the simplest way I can think of to explain parsing for someone new to this concept.
Generally, we try to parse data one line at a time because generally it is easier for humans to think this way, dividing and conquering, and also easier to code.
We call field to every minimum undivisible data. Name is field, Age is another field, and Surname is another field. For example.
In a line, we can have various fields. In order to distinguish them, we can delimit fields by separators or by the maximum length assign to each field.
For example:
By separating fields by comma
Paul,20,Jones
Or by space (Name can have 20 letters max, age up to 3 digits, Jones up to 20 letters)
Paul 020Jones
Any of the before set of fields is called a record.
To separate between a delimited field record we need to delimit record. A dot will be enough (though you know you can apply CR/LF).
A list could be:
Michael,39,Jordan.Shaquille,40,O'neal.Lebron,24,James.
or with CR/LF
Michael,39,Jordan
Shaquille,40,O'neal
Lebron,24,James
You can say them to list 10 nba (or nlf) players they like. Then, they should type them according to a format. Then make a program to parse it and display each record. One group, can make list in a comma-separated format and a program to parse a list in a fixed size format, and viceversa.
Parsing to me is breaking down something into meaningful parts... using a definable or predefined known, common set of part "definitions".
For programming languages there would be keyword parts, usable punctuation sequences...
For pumpkin pie it might be something like the crust, filling and toppings.
For written languages there might be what a word is, a sentence, what a verb is...
For spoken languages it might be tone, volume, mood, implication, emotion, context
Syntax analysis (as well as common sense after all) would tell if what your are parsing is a pumpkinpie or a programming language. Does it have crust? well maybe it's pumpkin pudding or perhaps a spoken language !
One thing to note about parsing stuff is there are usually many ways to break things into parts.
For example you could break up a pumpkin pie by cutting it from the center to the edge or from the bottom to the top or with a scoop to get the filling out or by using a sledge hammer or eating it.
And how you parse things would determine if doing something with those parts will be easy or hard.
In the "computer languages" world, there are common ways to parse text source code. These common methods (algorithims) have titles or names. Search the Internet for common methods/names for ways to parse languages. Wikipedia can help in this regard.
In linguistics, to divide language into small components that can be analyzed. For example, parsing this sentence would involve dividing it into words and phrases and identifying the type of each component (e.g.,verb, adjective, or noun).
Parsing is a very important part of many computer science disciplines. For example, compilers must parse source code to be able to translate it into object code. Likewise, any application that processes complex commands must be able to parse the commands. This includes virtually all end-user applications.
Parsing is often divided into lexical analysis and semantic parsing. Lexical analysis concentrates on dividing strings into components, called tokens, based on punctuationand other keys. Semantic parsing then attempts to determine the meaning of the string.
http://www.webopedia.com/TERM/P/parse.html
Simple explanation: Parsing is breaking a block of data into smaller pieces (tokens) by following a set of rules (using delimiters for example),
so that this data could be processes piece by piece (managed, analysed, interpreted, transmitted, ets).
Examples: Many applications (like Spreadsheet programs) use CSV (Comma Separated Values) file format to import and export data. CSV format makes it possible for the applications to process this data with a help of a special parser.
Web browsers have special parsers for HTML and CSS files. JSON parsers exist. All special file formats must have some parsers designed specifically for them.

Resources