Related
Are there any example ontologies where the same word has different meaning in different contexts?
For example, when building an ontology for a large company, it is not uncommon for different departments and systems to have a different definition and understanding of common words like "customer", "account", etc.
Is there a generally accepted way to model this in Protege that preserves the original words in their context, while also introducing a layer of disambiguating words for enterprise use?
This is a problem we encounter often in the biological community. I.e., the concept Eye is very dependent on the context, i.e. human eye vs fish vs spider eye etc. You can see a search for eye on the Ontology Lookup Service (OLS) and the results it return for eye from different ontologies. Disclosure: I am responsible for this tool.
Provide an IRI for your concept. This IRI should be similar to a surrogate key for your concept. I.e., instead of giving your Account concept an IRI like http://MyBusiness/someBusinessContex/Account you give it an IRI like http://MyBusiness/someBusinessContex/Context0000001. For the Eye concept the IRI for a human eye is http://purl.obolibrary.org/obo/NCIT_C12401 and for an insect it is http://purl.obolibrary.org/obo/SIBO_0000086.
I explain in this StackOverflow question the reason for using "surrogate keys".
Assign a context specific label and definition to your concept. You can use rdfs:label for label and rdfs:comment or skos:definition for definition.
You may find that you need alternatives for you concept. I.e. may be you refer to customers also as members. In this case you can use skos:altlabel to provide alternative names for your concept and skos:preflabel to define a preferred label.
So how does this work? For user interfaces you make use of rdfs:label/skos:preflabel and rdfs:comment/skos:definition for display purposes. From a data integration perspective you use the IRI.
I have a database of words (including nouns and verbs). Now I would like to generate all the different (inflected) forms of those nouns and verbs. What would be the best strategy to do this?
As Latin is a highly inflected language, there is:
a) the declension of nouns
b) the conjugation of verbs
See this translated page for an example of a verb's conjugation ("mandare"): conjugation
I don't want to type in all those forms for all the words manually.
How can I generate them automatically? What is the best approach?
a list of complex rules how to inflect all the words
Bayesian methods
...
There's a program called "William Whitaker's Words". It creates inflections for Latin words as well, so it's exactly doing what I want to do.
Wikipedia says that the program works like this:
Words uses a set of rules based on natural pre-, in-, and suffixation, declension, and conjugation to determine the possibility of an entry. As a consequence of this approach of analysing the structure of words, there is no guarantee that these words were ever used in Latin literature or speech, even if the program finds a possible meaning to a given word.
The program's source is also available here. But I don't really understand how this is to work. Can you help me? Maybe this would be the solution to my question ...
You could do something similar to hunspell dictionary format (see http://www.manpagez.com/man/4/hunspell/)
You define 2 tables. One contains roots of the words (the part that never change), and the other contains modifications for a given class. For a given class, for each declension (or conjugation), it tells what characters to add at the end (or the beginning) of the root. It even can specify to replace a given number of characters. Now, to get a word at a specific declension, you take the root, apply the transformation from the class it belongs, and voilà!
For example, for mandare, the root would be mand, and the class would contains suffixes like o, as, ate, amous, atis... for active indicative present.
I'll use as example the nouns, but it applies also to verbs.
First, I would create two classes: Regular and Irregular. For the Regular nouns, I would make three classes for the three declensions, and make them all implement a Declensable (or however the word is in English :) interface (FirstDeclension extends Regular implements Declensable). The interface would define two static enums (NOMINATIVE, VOCATIVE, etc, and SINGULAR, PLURAL).
All would have a string for the root and a static hashmap of suffixes. The method FirstDeclension#get (case, number) would then append the right suffix based on the hashmap.
The Irregular class should have to define a local hashmap for each word and then implement the same Declensable interface.
Does it make any sense?
Addendum: To clarify, the constructor of class Regular would be
public Regular (String stem) {
this.stem = stem
}
Perhaps, you could follow the line of AOT in your implementation. (It's under LGPL.)
http://prometheus.altlinux.org/en/Sisyphus/srpms/aot
http://seman.sourceforge.net/
http://aot.ru/
There's no Latin morphology in AOT, rather only Russian, German, English, where Russian is of course an example of an inflectional morphology as complex as Latin, so AOT should be ready as a framework for implementing it.
Still, I believe one has to have an elaborate precise formal system for the morphology already clearly defined before one goes on to programming. As for Russian, I guess, most of the working morphological computer systems are based on the serious analysis of Russian morphology done by Andrey Zalizniak and in the Grammatical Dictionary of Russian and related works.
I have what I imagine will be a fairly involved technical challenge: I want to be able to reliably alpha-rename identifiers in multiple languages (as many as possible). This will require special consideration for each language, and I'm asking for advice for how to minimize the amount of work I need to do by sharing code. Something like a unified parsing or abstract syntax framework that already has support for many languages would be great.
For example, here is some python code:
def foo(x):
def bar(y):
return x+y
return bar
An alpha renaming of x to y changes the x to a y and preserves semantics. So it would become:
def foo(y):
def bar(y1):
return y+y1
return bar
See how we needed to rename y to y1 in order to keep from breaking the code? That is why this is a hard problem. It seems like the program would have to have a pretty good knowledge of what constitutes a scope, rather than just doing, say, a string search and replace.
I would also like to preserve as much of the formatting as possible: comments, spacing, indentation. But that is not 100% necessary, it would just be nice.
Any tips?
To do this safely, you need to be able to to determine
all the identifiers (and those things that are not, e.g., the middle of a comment) in your code
the scopes of validity for each identifer
the ability to substitute a new identifier for an old one in the text
the ability to determine if renaming an identifier causes another name to be shadowed
To determine identifiers accurately, you need a least a langauge-accurate lexer. Identifiers in PHP look different than the do in COBOL.
To determine scopes of validity, you have to be determine program structure in practice, since most "scopes" are defined by such structure. This means you need a langauge-accurate parser; scopes in PHP are different than scopes in COBOL.
To determine which names are valid in which scopes, you need to know the language scoping rules. Your language may insist that the identifier X will refer to different Xes depending on the context in which X is found (consider object constructors named X with different arguments). Now you need to be able to traverse the scope structures according to the naming rules. Single inheritance, multiple inheritance, overloading, default types all will pretty much require you to build a model of the scopes for the programs, insert the identifiers and corresponding types into each scope, and then climb from the point of encounter of an identifier in the program text through the various scopes according to the language semantics. You will need symbol tables, inheritance linkages, ASTs, and the ability to navigage all of these. These structures are different from PHP and COBOL, but they share lots of common ideas so you likely need a library with the common concept support.
To rename an identifier, you have to modify the text. In a million lines of code, you need to point carefully. Modifying an AST node is one way to point carefully. Actually, you need to modify all the identifiers that correspond to the one being renamed; you have to climb over the tree to find them all, or record in the AST where all the references exist so they can be found easily. After modifyingy the tree you have to regenerate the source text after modifying the AST. That's a lot of machinery; see my SO answer on how to prettyprint ASTs preseriving all of the stuff you reasonably suggest should be preserved.
(Your other choice is to keep track in the AST of where the text for the string is,
and the read/patch/write the file.)
Before you update the file, you need to check that you haven't shadowed something. Consider this code:
{ local x;
x=1;
{local y;
y=2;
{local z;
z=y
print(x);
}
}
}
We agree this code prints "1". Now we decide to rename y to x.
We've broken the scoping, and now the print statement which referred
conceptually to the outer x refers to an x captured by the renamed y. The code now prints "2", so our rename broke it. This means that one must check all the other identifiers in scopes in which the renamed variable might be found, to see if the new name "captures" some name we weren't expecting. (This would be legal if the print statement printed z).
This is a lot of machinery.
Yes, there is a framework that has almost all of this as well as a number of robust language front ends. See our DMS Software Reengineering Toolkit. It has parsers producing ASTs, prettyprinters to produce text back from ASTs, generic symbol table management machinery (including support for multiple inheritance), AST visiting/modification machinery. Ithas prettyprinting machinery to turn ASTs back into text. It has front ends for C, C++, COBOL and Java that implement name and type resolution (e.g. instanting symbol table scopes and identifier to symbol table entry mappings); it has front ends for many other langauges that don't have scoping implemented yet.
We've just finished an exercise in implementing "rename" for Java. (All the above issues of course appeared). We about about to start one for C++.
You could try to create Xtext based implementations for the involved languages. The Xtext framework provides reliable infrastructure for cross language rename refactoring. However, you'll have to provide a grammar a at least a "good enough" scope resolution for each language.
Languages mostly guarantee tokens will be unique, whatever the context. A naive first approach (and this will break many, many pieces of code) would be:
cp file file.orig
sed -i 's/\b(newTokenName)\b/TEMPTOKEN/g' file
sed -i 's/\b(oldTokenName)\b/newTokenName/g' file
With GNU sed, this will break on PHP. Rewriting \b to a general token match, like ([^a-zA-Z~$-_][^a-zA-Z0-9~$-_]) would work on most C, Java, PHP, and Python, but not Perl (need to add # and % to the token characters. Beyond that, it would require a plugin architecture that works for any language you wanted to add. At some point, there will be two languages whose variable and function naming rules will be incompatible, and at that point, you'll need to do more and more in the plugin.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm building an application that receives source code as input and analyzes several aspects of the code. It can accept code from many common languages, e.g. C/C++, C#, Java, Python, PHP, Pascal, SQL, and more (however many languages are unsupported, e.g. Ada, Cobol, Fortran). Once the language is known, my application knows what to do (I have different handlers for different languages).
Currently I'm asking the user to input the programming language the code is written in, and this is error-prone: although users know the programming languages, a small percentage of them (on rare occasions) click the wrong option just due to recklessness, and that breaks the system (i.e. my analysis fails).
It seems to me like there should be a way to figure out (in most cases) what the language is, from the input text itself. Several notes:
I'm receiving pure text and not file names, so I can't use the extension as a hint.
The user is not required to input complete source codes, and can also input code snippets (i.e. the include/import part may not be included).
it's clear to me that any algorithm I choose will not be 100% proof, certainly for very short input codes (e.g. that could be accepted by both Python and Ruby), in which cases I will still need the user's assistance, however I would like to minimize user involvement in the process to minimize mistakes.
Examples:
If the text contains "x->y()", I may know for sure it's C++ (?)
If the text contains "public static void main", I may know for sure it's Java (?)
If the text contains "for x := y to z do begin", I may know for sure it's Pascal (?)
My question:
Are you familiar with any standard library/method for figuring out automatically what the language of an input source code is?
What are the unique code "tokens" with which I could certainly differentiate one language from another?
I'm writing my code in Python but I believe the question to be language agnostic.
Thanks
Vim has a autodetect filetype feature. If you download vim sourcecode you will find a /vim/runtime/filetype.vim file.
For each language it checks the extension of the file and also, for some of them (most common), it has a function that can get the filetype from the source code. You can check that out. The code is pretty easy to understand and there are some very useful comments there.
build a generic tokenizer and then use a Bayesian filter on them. Use the existing "user checks a box" system to train it.
Here is a simple way to do it. Just run the parser on every language. Whatever language gets the farthest without encountering any errors (or has the fewest errors) wins.
This technique has the following advantages:
You already have most of the code necessary to do this.
The analysis can be done in parallel on multi-core machines.
Most languages can be eliminated very quickly.
This technique is very robust. Languages that might appear very similar when using a fuzzy analysis (baysian for example), would likely have many errors when the actual parser is run.
If a program is parsed correctly in two different languages, then there was never any hope of distinguishing them in the first place.
I think the problem is impossible. The best you can do is to come up with some probability that a program is in a particular language, and even then I would guess producing a solid probability is very hard. Problems that come to mind at once:
use of features like the C pre-processor can effectively mask the underlyuing language altogether
looking for keywords is not sufficient as the keywords can be used in other languages as identifiers
looking for actual language constructs requires you to parse the code, but to do that you need to know the language
what do you do about malformed code?
Those seem enough problems to solve to be going on with.
One program I know which even can distinguish several different languages within the same file is ohcount. You might get some ideas there, although I don't really know how they do it.
In general you can look for distinctive patterns:
Operators might be an indicator, such as := for Pascal/Modula/Oberon, => or the whole of LINQ in C#
Keywords would be another one as probably no two languages have the same set of keywords
Casing rules for identifiers, assuming the piece of code was writting conforming to best practices. Probably a very weak rule
Standard library functions or types. Especially for languages that usually rely heavily on them, such as PHP you might just use a long list of standard library functions.
You may create a set of rules, each of which indicates a possible set of languages if it matches. Intersecting the resulting lists will hopefully get you only one language.
The problem with this approach however, is that you need to do tokenizing and compare tokens (otherwise you can't really know what operators are or whether something you found was inside a comment or string). Tokenizing rules are different for each language as well, though; just splitting everything at whitespace and punctuation will probably not yield a very useful sequence of tokens. You can try several different tokenizing rules (each of which would indicate a certain set of languages as well) and have your rules match to a specified tokenization. For example, trying to find a single-quoted string (for trying out Pascal) in a VB snippet with one comment will probably fail, but another tokenizer might have more luck.
But since you want to perform analysis anyway you probably have parsers for the languages you support, so you can just try running the snippet through each parser and take that as indicator which language it would be (as suggested by OregonGhost as well).
Some thoughts:
$x->y() would be valid in PHP, so ensure that there's no $ symbol if you think C++ (though I think you can store function pointers in a C struct, so this could also be C).
public static void main is Java if it is cased properly - write Main and it's C#. This gets complicated if you take case-insensitive languages like many scripting languages or Pascal into account. The [] attribute syntax in C# on the other hand seems to be rather unique.
You can also try to use the keywords of a language - for example, Option Strict or End Sub are typical for VB and the like, while yield is likely C# and initialization/implementation are Object Pascal / Delphi.
If your application is analyzing the source code anyway, you code try to throw your analysis code at it for every language and if it fails really bad, it was the wrong language :)
My approach would be:
Create a list of strings or regexes (with and without case sensitivity), where each element has assigned a list of languages that the element is an indicator for:
class => C++, C#, Java
interface => C#, Java
implements => Java
[attribute] => C#
procedure => Pascal, Modula
create table / insert / ... => SQL
etc. Then parse the file line-by-line, match each element of the list, and count the hits.
The language with the most hits wins ;)
How about word frequency analysis (with a twist)? Parse the source code and categorise it much like a spam filter does. This way when a code snippet is entered into your app which cannot be 100% identified you can have it show the closest matches which the user can pick from - this can then be fed into your database.
Here's an idea for you. For each of your N languages, find some files in the language, something like 10-20 per language would be enough, each one not too short. Concatenate all files in one language together. Call this lang1.txt. GZip it to lang1.txt.gz. You will have a set of N langX.txt and langX.txt.gz files.
Now, take the file in question and append to each of he langX.txt files, producing langXapp.txt, and corresponding gzipped langXapp.txt.gz. For each X, find the difference between the size of langXapp.gz and langX.gz. The smallest difference will correspond to the language of your file.
Disclaimer: this will work reasonably well only for longer files. Also, it's not very efficient. But on the plus side you don't need to know anything about the language, it's completely automatic. And it can detect natural languages and tell between French or Chinese as well. Just in case you need it :) But the main reason, I just think it's interesting thing to try :)
The most bulletproof but also most work intensive way is to write a parser for each language and just run them in sequence to see which one would accept the code. This won't work well if code has syntax errors though and you most probably would have to deal with code like that, people do make mistakes. One of the fast ways to implement this is to get common compilers for every language you support and just run them and check how many errors they produce.
Heuristics works up to a certain point and the more languages you will support the less help you would get from them. But for first few versions it's a good start, mostly because it's fast to implement and works good enough in most cases. You could check for specific keywords, function/class names in API that is used often, some language constructions etc. Best way is to check how many of these specific stuff a file have for each possible language, this will help with some syntax errors, user defined functions with names like this() in languages that doesn't have such keywords, stuff written in comments and string literals.
Anyhow you most likely would fail sometimes so some mechanism for user to override language choice is still necessary.
I think you never should rely on one single feature, since the absence in a fragment (e.g. somebody systematically using WHILE instead of for) might confuse you.
Also try to stay away from global identifiers like "IMPORT" or "MODULE" or "UNIT" or INITIALIZATION/FINALIZATION, since they might not always exist, be optional in complete sources, and totally absent in fragments.
Dialects and similar languages (e.g. Modula2 and Pascal) are dangerous too.
I would create simple lexers for a bunch of languages that keep track of key tokens, and then simply calculate a key tokens to "other" identifiers ratio. Give each token a weight, since some might be a key indicator to disambiguate between dialects or versions.
Note that this is also a convenient way to allow users to plugin "known" keywords to increase the detection ratio, by e.g. providing identifiers of runtime library routines or types.
Very interesting question, I don't know if it is possible to be able to distinguish languages by code snippets, but here are some ideas:
One simple way is to watch out for single-quotes: In some languages, it is used as character wrapper, whereas in the others it can contain a whole string
A unary asterisk or a unary ampersand operator is a certain indication that it's either of C/C++/C#.
Pascal is the only language (of the ones given) to use two characters for assignments :=. Pascal has many unique keywords, too (begin, sub, end, ...)
The class initialization with a function could be a nice hint for Java.
Functions that do not belong to a class eliminates java (there is no max(), for example)
Naming of basic types (bool vs boolean)
Which reminds me: C++ can look very differently across projects (#define boolean int) So you can never guarantee, that you found the correct language.
If you run the source code through a hashing algorithm and it looks the same, you're most likely analyzing Perl
Indentation is a good hint for Python
You could use functions provided by the languages themselves - like token_get_all() for PHP - or third-party tools - like pychecker for python - to check the syntax
Summing it up: This project would make an interesting research paper (IMHO) and if you want it to work well, be prepared to put a lot of effort into it.
There is no way of making this foolproof, but I would personally start with operators, since they are in most cases "set in stone" (I can't say this holds true to every language since I know only a limited set). This would narrow it down quite considerably, but not nearly enough. For instance "->" is used in many languages (at least C, C++ and Perl).
I would go for something like this:
Create a list of features for each language, these could be operators, commenting style (since most use some sort of easily detectable character or character combination).
For instance:
Some languages have lines that start with the character "#", these include C, C++ and Perl. Do others than the first two use #include and #define in their vocabulary? If you detect this character at the beginning of line, the language is probably one of those. If the character is in the middle of the line, the language is most likely Perl.
Also, if you find the pattern := this would narrow it down to some likely languages.
Etc.
I would have a two-dimensional table with languages and patterns found and after analysis I would simply count which language had most "hits". If I wanted it to be really clever I would give each feature a weight which would signify how likely or unlikely it is that this feature is included in a snippet of this language. For instance if you can find a snippet that starts with /* and ends with */ it is more than likely that this is either C or C++.
The problem with keywords is someone might use it as a normal variable or even inside comments. They can be used as a decider (e.g. the word "class" is much more likely in C++ than C if everything else is equal), but you can't rely on them.
After the analysis I would offer the most likely language as the choice for the user with the rest ordered which would also be selectable. So the user would accept your guess by simply clicking a button, or he can switch it easily.
In answer to 2: if there's a "#!" and the name of an interpreter at the very beginning, then you definitely know which language it is. (Can't believe this wasn't mentioned by anyone else.)
This is something I've always wondered, and I can't find any mention of it anywhere online. When a shop from, say Japan, writes code, would I be able to read it in English? Or do languages, like C, PHP, anything, have Japanese translations that they write?
I guess what I'm asking is does every single coder in the world know enough English to use the exact same reserved words I do?
Would this code:
If (i < size){
switch
case 1:
print "hi there"
default:
print "no, thank you"
} else {
print "yes, thank you"
}
display the exact same as I'm seeing it right now in English, or would some other non-English-speaking person see the words "if", "switch", "case", "default", "print", and "else" in their native language?
EDIT - yes, this is serious. I didn't know if different localizations of a language have different keywords. or if there are even different localizations at all.
If I understood well the question actually is: "does every single coder in the world know enough English to use the exact same reserved words as I do?"
Well.. English is not the subject here but programming language reserved words. I mean, when I started about 10 yrs ago, I didn't have any clue of English, and still I was able to program simple things by learning the programming language, even when I did not know what they meant ( in English ). As a matter of fact this helped me to learn English.
For example. I know to do an "iteración" ( iteration of course ) I had to write:
for( i = 0 ; i < 100 ; i++ ) {}
To me, the "for", the ";" and the "++" were simple foreign words or symbols. Later I learned that "for" meant "para", "while" meant "mientras", etc. But, in the meantime, I did not need to know English, what I needed was to know was "C".
Of course when I needed to learn more things, I had to learn English, for the documentation is written in that language.
So the answer is: No, I don't see if, while, for etc. in my native language. I see them in English, but they didn't mean to me any other thing that they meant for the programming language in turn.
Is like switch statement in bash: case .. esac. What Is "esac"... for me the end of the switch statement in bash.
I guess that's what we call "abstraction"
In the Java language some methods must be named (at least partially) using the English language because of the JavaBeans convention.
This convention requires that a property X be established via a pair of getX() and setX() methods. Here in French-Canada, where some developers are obliged to code in the French language this leads to the following travesty:
interface Foo {
Color getCouleur();
void setCouleur(Color couleur);
}
I'm having trouble finding references, but I'm reminded of three stories.
A Lisp hacker defends meaningless functions like "cdr" and "car" by comparing them to programming in your non-native language:
http://people.csail.mit.edu/gregs/ll1-discuss-archive-html/msg01171.html
When Yukihiro Matsumoto ("Matz") started developing Ruby, he used english keywords even though he was writing all the documentation in Japanese!. There was no English documentation for Ruby for a couple years, and very few Americans using the language. But now it's a world-class language, and it the fact that it was born in Japan is only of historical interest. If the language had been using keywords in hiragana, it would have had a much more difficult time gaining popularity.
I read an essay once -- maybe someone else can find it, Google is no help today -- that suggested that translating keywords was misguided because the words aren't actually English-- they're jargon. Not only do (to use the examples above) para and pour not quite have the exact meaning that for has in English, to non-programmers the phrase "for loop" is jibberish. Even Americans have to learn a new meaning. So to translate the words's superficial meaning into another language is more like making a cross-language pun rather than actually being helpful.
I really have not thought too much about programming in Japanese before, but here we go, using the question's code sample.
Using only the language statements in Japanese with the variables in English:
// In Japanese, it makes more sense to put the keywords/modifiers as
// postfix expressions rather than prefix expressions.
(i < size)か {
(l[i])は {
1だ:
「もしもし。」を書く;
省略時値:
「いいえ、いいですよ。」を書く;
}
} ない {
「はい、ありがとうございます。」を書く;
}
As many people already pointed out, in most programming languages you just have to learn a few keywords, so it doesn't matter that much if they're in English (or a language other than yours, for that matter). It's just a symbol you associate with some construct. For instance, in VB you have "THEN", which in many C-style languages would be "{" and it doesn't make a big difference in readability (well, at least that's how I see it, being a Non-English native speaker).
But where things can sometimes get hairy, and where the choice of (natural) language matters is in naming identifiers. If the names of variables, functions, classes, etc, don't have a meaningful name for you because of a language barrier, following even the simplest code can be rather challenging.
I remember someone once gave me a short snippet of Actionscript taken from some blog. The names were in German and since I don't speak a word of that language, stuff could have been called var_123, var_562 or func_333 as well (and probably it would have been easier for me to remember the names or at least to have a chance of spelling them right without copying and pasting). Since this was a short, self-contained snippet, I used an online translator to give those vars and functions meaningful names in my native language (Spanish) and after that, everything was clear. The point is that the code was actually simple, but I was only able to make sense out of it without too much (unnecessary) extra effort just when I overcame the language barrier.
Since then, I've switched to using English for naming identifiers. Whether you like it or not, it's the "koine" for programming, engineering and generally technical stuff. Most of the APIs are written in English and so is most documentation (and probably the best resources you can find are in English as well). As a nice aside, it keeps your code more coherent with the code you're likely to be interacting with, and I think it tends to be more compact and succinct than other languages like Spanish (which otherwise would be my natural choice).
Of course, if you can't understand at least some English, the problem remains the same, so it's not a perfect solution. But, given a number of developers from many different countries, chances are that the common language for them to communicate (through code and of course other means) will be English. So, choosing English is perhaps the best option, even though it would be not the perfect solution to this problem.
The programming language defines keywords and standard class names, and it's best practice to give user defined types, variables and functions also English names (as a non-native speaker I can tell ;-).
So yes, if all is well, you'll be able to read the code.
However languages like Java and Perl allow the full Unicode set for identifiers, so if somebody writes his class names in Kanji, you'll likely have a problem.
Update: For Perl there's a joke module that allows you to write Perl in Latin. But it's really just that, a joke. Nobody uses things like this seriously.
Second Update: The idea of localized programming languages isn't that ridiculous. Excel's macro language is localized, but luckily it's stored in one canonical language (English) in the file, so the localization is just a layer on top of the normal thing. Such things only make sense for small "programs", for "real" programs it becomes hard to maintain.
Actually there are some Non-English-based programming languages (Wikipedia)
I'm Norwegian but I've allways used English for all code except output (ignoring some silly code from school). Actually I usually write everything in English and then translate it to my native language, using gettext (or something).
I am British and a problem we often run into is the American/British spelling clash. This often occurs with programming related terms such as Initialise() or Initialize(), Analyse() or Analyze() etc. This can (has) lead to problems trying to overriding methods, and is sometimes difficult to spot.
Since the framework (in our case C#) was designed by Americans, we found that it is best to be consistent and use American spellings. We even adopt Color.
We have a mix of nationalities in our development teams and most non-British people tend towards American spellings naturally.
AppleScript was once available in French and Japanese dialects. I do not know why it was withdrawn.
Taking this to the next level, what about being able to substitute symbols?
After seeing languages like Brainf**k and Whitespace I thought of making a language like this: it'd be identical to C except you use closing braces to open, opening braces to close, swap the meanings of + and -, * and /, ; and :, > and <, etc.
The concept is nothing more than a gimmicky altered C compiler. But, like thinking of keywords differently, it challenges you to rethink some basic assumptions if you've never thought of such things before. Ex:
int foo)int i, char c( }
int six = 2 / 3:
int two = six + 4:
if )i > 0( }
printf)"i is negative"(:
{
{
I'm in a French team developing a software system in C#. Despite the fact that the programming language keywords are ostensibly English, I imagine that you would have great difficulty reading the code as all the function names, variables, code comments, database tables and columns, technical specifications, protocols and so on, are all in French, including those lovely accented characters ç, é, è, ù, etc. I'm not even certain if the system would even run elsewhere due to localisation bugs, such as relying on the comma to be the default decimal seperator.
Otherwise, WinDev is a popular programming platform in France, and its programming language WLanguage has keywords in either French or English, see and example here : link text
The only language I saw localized is Excel with its macros. If you try to sum a column using an Italian version of Office you have to write SOMMA(A1:A10) and not SUM. That's a shame.
By the way, just because it's fun, here's how your code should look like with Italian keywords:
se (i < size){
commuta
caso 1:
stampa "hi there"
normalmente:
stampa "no, thank you"
} altrimenti {
stampa "yes, thank you"
}
i've seen VBA translated into spanish-like commands. it's one of the ugliest things ever seen. i would be ashamed to have something like this on my computer.
PD: i happen to think that spanish is a much nicer language than english; but translating is WRONG
Well, As others pointed-out, the keywords and system calls would likely remain in English.
However, understanding the keywords of the language is only a small part in understanding the code. Variable names, function names and comments all risk being in the native language of the author.
Edit: I just flashed-back to my youth where I went in the mapping tables of my TRS-80 built-in BASIC to switch the keywords to French. I could change all the keywords but I couldn't make any of them larger. Made for funny programs.
Don't make fun of this. Some years ago, Microsoft had announced G# (German Sharp) - C# with German keywords and API. Of course, it was an April Fools joke, but the entire site about that looked so real and professional (and was on microsoft.com). Scary.
At work, we use two field bus systems, both developed in German-speaking countries, which have a scary mix of German and English for identifiers, including some lovely false friends. It's a mess.
No, English keywords and identifiers are fine. Though some might argue if it should be Color or Colour :)
In several VBA project I've worked on (yes, very early in my career) we had to detect the version of office which was installed on the user's machine and change the formulas used in the speradsheets accordingly.
As i program in portuguese"SUM" would have to be translated into "SOMA" and so on and so forth. I just can't imagine the necessary work to make this happen in several languages. Has anyone else suffered with this problem?
There are some languages that have translated keywords. Excel formulas, for example. If you write some calculations in a spreadsheet, this will be in your language.
Fortunately, this is not a general practice, and even non-English speakers like me thank God that there is a standard language for keywords :
it's easier to share you work.
it prevent documentation from becoming a bigger nightmare that it already is.
English words and sentences are usually short and syntactically pragmatic. In literature, Latin languages are much more beautiful, but for technical stuff, English rocks.
And where to stop ? Can you imagine a C in ancient Greek ?
Keywords must stay in one language, and well, it started with English, let it stay that way. This could have been worst (Asian language ?). And so we have to write methods and comments in English. Ok, more work for us, but at least the international code base stay congruent.
There is, however, one case where using native language method names and comments can be a good practice : in third world country. I'm going to Senegal in some months to manage a Django project. Senegal have a huge analphabetization rate, and therefor it's already great that they spead energy in improving they programming knowledge. French is the native language here, so it would be inefficient to force them to learn computing AND a new tongue at the same time.
BTW, that would be your code with French keywords :
Si (i < taille) {
cas par cas :
cas 1:
afficher "salut"
défaut:
afficher "non merci"
} sinon {
afficher "oui, merci"
}
Not that translating the keywords have nothing to do with translating the strings. Of course, we have "hi, there" translated in our language. European coders even tend to use I18N much more than American sot their service can reach a wider audience.
Generally speaking, most programmers adapt to the English form.
I learned to program when I was 7 years old and only spoke Hebrew (which is right to left) and with no english, which made it quite a fascinating experience.
The problem you would usually get is with documentation, variables, and function names. I have seen my share of variables in other languages using english alphabet.
The only language I'm familiar with that actually got translated was good old Logo (still amazing to this day).
When I was a kid we went to France, and in a museum we went to, I remember finding a display which showed you how to write computer programmes. The language was some kind of BASIC variant and I distinctly remember it using POUR instead of FOR, and so on. I was 7 years old and had only just learned BASIC, and it seemed completely natural to me that the French would have their own dialect like this!!
I guess it may have been LSE that I saw?
Filemaker's scripting language is localized. The scripts (and data!) are stored in a terrible "sorta canonical" form.
So if you write a script in the American version, then open it up in the French version, all the keywords and built-in function names will be in French. But why won't it run?! Aha! The French version uses "," as the decimal point, and therefore to avoid ambiguity uses ";" to separate function arguments -- where the American version uses "." and "," respectively. This conversion you have to do yourself.
So you work through the incredibly bad script editing interface (you can't write scripts as text files) to fix all these things. It runs! Great! The results are all wrong! Oh no! Aha! The Jan-7-2004 date you entered in the American version is being interpreted as July-1-2004 -- apparently dates are not only displayed but stored in locale-dependent order. Am I kidding you? No.
[Note: Filemaker 8 and 9 may be sane -- I only ever worked with 3 - 7.]
Your question is an interesting one with regard to Perl because it's syntax is designed to follow (English) natural language. I wonder if that makes it more difficult for non-English speakers...
Of course, Perl and Perlers refuse to play by conventional rules. Mad scientist Damian Conway wrote the Lingua::Romana::Perligata module which uses the black magic of source filters to allow you to write Perl in latin!
Here in Australia we still need to spell colour like color.
However, I do find it annoying when other (Australian) developers, working on an Australian project, decide that internal variable names need to be spelt the american way.
It would be pointless, IMHO, to i18n a language syntax. It would just kill any sort of portability.
The only exception are educational languages, such as LOGO. They were designed for ease learning, so portability is not an issue.
I read a lot of code, but the problem always is at variable/method names and comments, if they are commenting their code on their own language, using a language special characters like Japanese or Cyrillic, we are in trouble! but the keywords I think they will stay in English as they are.
in Italian
se (i < dimensione){
scegli
caso 1:
stampa "ciao"
mancante:
stampa "no, grazie"
} altrimenti {
stampa "sì, grazie"
}
To confirm the worries of some previous poster I've seen a Fortran code with a macro include to translate all the keywords from English to French. Allow me not to continue on this.
I also had to work with a code simultaneously containing identifiers in Italian, German, English and French, not only because it was developed in many different places, but also because the main developer thought it was fun and helped him not to duplicate identifier names (of course, with a routine 2000 lines long....)
I think WordBasic was localized. WordBasic was used to write macro's for in Word before VBA was used.
If I remember it correctly, only WordBasic written in the English version would execute on all localized version. If you would write a Dutch version, you could only execute it on a Dutch Word.