I am currently using the lark parser for python to try and read in some problem specifications. I am getting confused about what the "proper" syntax is for Extended Backus-Naur form, especially about how the LHS and RHS are separated. The wikipedia page uses an equals = sign, lark expects just a colon; see lark cheat sheet. Other sources use the ::= separator - e.g. the atom ebnf package.
Is there a definitive answer? The official ISO spec seems to suggest that the "defining-symbol" should be = but there seems to be wriggle room in the spec. So why all the different versions?
Since the world hasn't yet appointed a Lord High Commissioner of Grammar Formalisms, there is no definitive syntax. You're certainly free to use the ISO "Extended BNF" standard, particularly if you're writing some other ISO standard, but don't expect it to be implemented by a parser generator, even one which extends normal BNF. (There's no definitive standard for BNF, either.)
I have no way of knowing what was going on in the minds of the authors of the ISO standard, but I suspect that their expectations were realistic: it's intended to allow precise description of syntaxes for standards documents, but there are many features which are not suitable for automated implementation (including a way of writing rule restrictions in English to be used when the formalism isn't sufficiently general). It's often possible to automatically extract (most of) a grammar from an ISO standard, but the task is neither simple nor -- as far as I can see -- intended to be simple, since most ISO standards are not distributed as plain text documents and extracting formatted text from either PDF or HTML formats presents its own challenges.
The options you present for punctuation are most of the common ones, although mathematicians often write BNF using ⇒ to separate left- and right-hand sides. (Unfortunately, most keyboards lack that useful character.)
I'm personally not fond of the ::= separator, although it is used by various parser generators. It seems to me to be way too much typing for a simple punctuator, and it is also annoyingly difficult to align with alternatives flagged with |. But to each their own.
I am making something like formula validator and I am using ParseKit framework to accomplish that. My approach is to create proper grammar and when didMatchFormula callback method is called on sample string I assume formula has been found and therefore it is valid.
There is one difficulty however - formula is detected from sample string even if it contains also other characters following formula part. I would need something like greedy mode for matching - an entire string would be matched against formula grammar so that didMatchFormula would be called only if string contains formula and no other characters.
Can you give me some hints how to accomplish that with PaseKit or in other way.
I cannot use regular expressions since my formulas would use recursion and regexp is not a good tool for handling that.
Developer of ParseKit here.
Probably the simplest and most elegant way to do this with ParseKit (or any parsing toolkit) is to design your formula language have a terminator char after every statement. This would be the same concept as ; terminating statements in most C-like programming languages.
Here's an example toy formula language which uses . as the statement terminator:
#start = lang;
lang = statment+;
statment = Word+ terminator;
terminator = '.';
Notice how I have designed the language so that your "greedy" requirement is an inherent feature of the language. Think about it – if the input string ends with any junk content which is not a valid statement ending in a ., my lang production will not find a match and the parse will fail.
With this type of design, you won't need any "greedy" features in the parsking toolkit you use. Rather, your requirement will be naturally met by your language design.
I am trying to understand how to use EBNF to define a formal grammar, in particular a sequence of words separated by a space, something like
<non-terminal> [<word>[ <word>[ <word>[ ...]]] <non-terminal>
What is the correct way to define a word terminal?
What is the correct way to represent required whitespace?
How are optional, repetitive lists represented?
Are there any show-by-example tutorials on EBNF anywhere?
Many thanks in advance!
You have to decide whether your lexical analyzer is going to return a token (terminal) for the spaces. You also have to decide how it (the lexical analyzer) is going to define words, or whether your grammar is going to do that (in which case, what is the lexical analyzer going to return as terminals?).
For the rest, it is mostly a question of understanding the niceties of EBNF notation, which is an ISO standard (ISO 14977:1996 — and it is available as a free download from Freely Available Standards, which you can also get to from ISO), but it is a standard that is largely ignored in practice. (The languages I deal with — C, C++, SQL — use a BNF notation in the defining documents, but it is not EBNF in any of them.)
Whatever you want to make the correct definition of a word. You need to think about how you'd want to treat the name P. J. O'Neill, for example. What tokens will the lexical analyzer return for that?
This is closely related to the previous issue; what are the terminals that lexical analyzer is going to return.
Optional repetitive lists are enclosed in { and } braces, or you can use the Kleene Star notation.
There is a paper Extended BNF — A generic base standard by R. S. Scowen that explains EBNF. There's also the Wikipedia entry on EBNF.
I think that a non-empty, space-separated word list might be defined using:
non_empty_word_list = word { space word }
where all the names there are non-terminals. You'd need to define those in terms of the relevant terminals of your system.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Does my variable naming convention have a name?
Notation in question is described by example below:
T for type
P for pointer
F for field
A for argument
L for local
et cetera, there is at least S missing from the list, but i'm not sure which string it designates.
First 3 prefices was with Delphi since very beginning, last 2 i've noticed relatively recently. I'd like to know notation name (if any), and read some normative whitepaper (and adopt then, may be).
Zarko Gajic has a pretty good Delphi-specific list here:
http://delphi.about.com/od/standards/l/bldnc.htm
Personally, I find some conventions like this useful. I still remember my first language FORTRAN, where the convention for Integers was to start them any letter from I to N, and it was easy to remember because they are the first two letters of INteger.
Section "3.3 Field Naming" of the Object Pascal Style Guide by Charles Calvert gives a brief but good guide as to when to use Hungarian notation, and also what single character identifier names are appropriate. My FORTRAN background (8 character names max) also made me use "N" as the count of items and led to code such as:
DO 10 I = 1, N
DO 20 J = I, N
...
20 CONTINUE
10 CONTINUE
Ouch! The memories hurt.
My personal favorite of all these standards, is to obey the standards already established in the code you're in, and not try to impose a different standard 50% of the way through, and to religiously avoid bikeshed discussions.
But if you press me really hard, I'll admit, I prefer Charlie Calvert's standards as used by JVCL devs, same as "section 3.3" link by LKessler above.
Hungarian notation.
With modern IDEs (including Delphi's) many people (myself included) feel it is no longer necessary.
EDIT: Technically this is not true Hungarian notation, as sometimes the prefix indicates the scope rather than the type.
Let me explain. Suppose I want to teach Python to someone who only speaks Spanish. As you know, in most programming languages all keywords are in English. How complex would it be to create a program that will find all keywords in a given source code and translate them? Would I need to use a parser and stuff, or will a couple of regexes and string functions be enough?
If it depends on the source programming language, then Python and Javascript would be the most important.
What I mean by "how complex would it be" is that would it be enough to have a list of keywords, and parse the source code to find keywords not in quotes? Or are there enough syntactical weirdnesses that something more complicated is required?
If all you want is to translate keywords, then (while you definitely DO need a proper parser, as otherwise avoiding any change in strings, comments &c becomes a nightmare) the task is quite simple. For example, since you mentioned Python:
import cStringIO
import keyword
import token
import tokenize
samp = '''\
for x in range(8):
if x%2:
y = x
while y>0:
print y,
y -= 3
print
'''
translate = {'for': 'per', 'if': 'se', 'while': 'mentre', 'print': 'stampa'}
def toks(tokens):
for tt, ts, src, erc, ll in tokens:
if tt == token.NAME and keyword.iskeyword(ts):
ts = translate.get(ts, ts)
yield tt, ts
def main():
rl = cStringIO.StringIO(samp).readline
toki = toks(tokenize.generate_tokens(rl))
print tokenize.untokenize(toki)
main()
I hope it's obvious how to generalize this to "translate" any Python source and in any language (I'm supplying only a very partial Italian keyword translation dict). This emits:
per x in range (8 ):
se x %2 :
y =x
mentre y >0 :
stampa y ,
y -=3
stampa
(strange though correct whitespace, but that could be easily enough remedied). As an Italian speaker I can tell you this is terrible to read, but that's par for the course for any "programming language translation" as you desire. Worse, NON-keywords such as range remain un-translated (as per your specs) -- of course, you don't have to constrain your translation to keywords-only (it's easy enough to remove the if that does that above;-).
The problem you will encounter is that, unless you have strict coding standards, the fact that people will not necessarily follow a pattern in how they do the code. And in any dynamic language you will have a problem where the eval function will have keywords within quotes.
If you are trying to teach a language, you could create a DSL that has keywords in spanish, so that you can teach in your language, and it can be processed in python or javascript, so you have basically made your own language, with the constructs you want, for teaching.
Once they understand how to program, they will then need to start learning languages with the "English" keywords, so that they can communicate with others, but that could come after they understand how to program, if it would make your life easier.
So, to answer your question, there is enough syntactic weirdness that it would be considerably more complicated to translate the keywords.
This is not an optimistic answer nor a great one. However, I feel it has some merit.
I can speak about C# and the translation is not worth it. Here are reasons:
C# is based on English but it is not English literature per se. For example, what would "var" or "int" be in Spanish?
It is possible to create a program to let you use Spanish words in place of English keywords like "for", "in" and "as". However, some Spanish equivalent words may be compound words (two words instead of one, dealing with space can get tricky) or an English keyword may not have a direct Spanish equivalent.
Debugging may get tricky. Converting to English and to Spanish and back to English then Spanish has the marks of "loaded with bugs" written all over it.
The user will not have then benefit of having learning resources. All C# code examples are in the way Microsooft designed it. No one will try to Spanish-ize the syntax just for a few users who will use your app.
I have seen a few people discuss C# code in language other than English. In all cases the authors explain code in their native language but write it in English-looking code as it naturally is. The best approach seems to be try to learn enough of English to be comfortable with C# as it naturally is.
It would be impossible to make a translation that would handle every case. Take for example this Javascript code:
var x = Math.random() < 0.5 ? window : { location : { href : '' } };
var y = x.location.href;
The x variable can either become a reference to the window object, or a reference to the newly created object. It would only make sense to translate the members if it's the window object, otherwise you would have to translate the variable names too, which would be a mess and could easily cause problems.
Besides, it's not really useful to know a language in the wrong language. All the documentation and examples out there is going to be in the original language, so they would be useless.
You should think that the 'de facto' language for tokens on commonly used programming languages is english. So, for purely educational objectives, to teach on a translated language can be harmful for your student(s).
But, if you really want to translate a computer language tokents, you should think on the following issues:
You should translate language primitive constructs. This is easy... you have to learn and use a basic parser like yacc or antlr
You should translate language API's. This can be so painful and difficult... first, modern API's like java's one are very extensive; second, you have to translate the API's documentation.... no more words about that.
While I don't have an answer to the question, I think it's an interesting one. It brings up some issues which I have been thinking about:
As developing countries start introducing their population to higher technologies, naturally some will be interested in learning to program. Will English-only programming languages be an impediment?
Let's say a programming language was developed in a non-English part of the world: the keywords were written in the native language for that area and it used the native punctuation (eg, «» instead of " ", a comma as the decimal point (123,45), and so forth). It's a fantastic programming language, generating lots of buzz. Do you think it would see widespread adoption? Would you use it?
Most English-speaking people answer "no" to the first question. Even non-English (but educated) people answer no. But they also answer "no" to the second question, which seems to be a contradiction.
There was a moment I was thinking about something like that for bash scripts, but idea can be implemented in other languages too:
#!/bin/bash
PrintOnScreen() {
echo "$1 $2 $3 $4 $5 $6 $7 $8 $9"
}
PrintOnScreenWithoutNewline() {
echo -n "$1 $2 $3 $4 $5 $6 $7 $8 $9"
}
MathAdd() {
expr $1 + $2
}
Then we can add this to some script:
#!/bin/bash
. HumanLanguage.sh
PrintOnScreen Hello
PrintOnScreenWithoutNewline "Some number:"
MathAdd 2 3
This will produce:
Hello
Some number: 5
You might find Perl's Lingua::Romana::Perligata interesting -- it allows you to write your perl programs in latin. It's not quite the same as your idea, as it essentially restructures the language semantics around Latin ideas, rather than just translating the strings.
It is relatively easy to translate the keywords from one programming language into another language. There are several non-English-based programming languages, including Chinese Python, which replaces English keywords with Chinese keywords.
It would be much more difficult to translate each individual variable name from English into another natural language. If two different English variable names had only one translation in another language, there would be a name collision.