Doubts regarding config_setting with Bazel

Doubts regarding config_setting with Bazel - bazel

I'm trying to understand config_setting for detecting the underlying platform and had some doubts. Could you help me clarify them?
What is the difference between x64_windows and x64_windows_(msvc|msys) cpus? If I create config_setting's for all of them, will only one of them trigger? Should I just ignore x64_windows?
To detect Windows, what is the recommended way? Currently I'm doing:
config_setting(
name = "windows",
values = {"crosstool_top": "//crosstools/windows"},
)
config_setting(
name = "windows_msvc",
values = {
"crosstool_top": "//crosstools/windows",
"cpu": "x64_windows_msvc",
},
)
config_setting(
name = "windows_msys",
values = {
"crosstool_top": "//crosstools/windows",
"cpu": "x64_windows_msys",
},
)
By using this I want to use :windows to match all Windows versions and :windows_msvc, for example, to match only MSVC. Is this the best way to do it?
What is the difference between darwin and darwin_x86_64 cpus? I know they match macOS, but do I need to always specify both when selecting something for macOS? If not, is there a better way to detect macOS with only one config_setting? Like using //crosstools with Windows?
How do detect Linux? I know you can detect the operating systems you care about first and then use //conditions:default, but it'd be nice to have a way to detect specifically Linux and not leave it as the default.
What are k8, piii, etc? Is there any documentation somewhere describing all the possible cpu values and what they mean?
If I wanted to use //crosstools to detect each platform, is there somewhere I can look up all available crosstools?
Thanks!

Great questions, all. Let me tackle them one by one:
--cpu=x64_windows_msys triggers the C++ toolchain that relies on MSYS/Cygwin. --cpu=x64_windows_msvc triggers the Windows-native (MSVC) toolchain. -cpu=x64_windows triggers the default, which is still MSYS but being converted to MSVC.
Which ones you want to support is up to you, but it's probably safest to support all for generality (and if one is just an alias for the other it doesn't require very complicated logic).
Only one config_setting can trigger at a time.
Unless you'e using a custom -crosstool_top= flag to specify Windows builds, you'll probably want to trigger on --cpu, e.g:
config_setting(
name = "windows",
values = {"cpu": "x64_windows"}
There's not a great way now to define all Windows. This is a current deficiency in Bazel's ability to recognize platforms, which settings like --cpu and --crosstool_top don't quite model the right way. Ongoing work to create a first-class concept of platform will provide the best solution to what you want. But for now --cpu is probably your best option.
This would basically be the same story as Windows. But to my knowledge there's only darwin for default crosstools, no darwin_x86_64.
For the time being it's probably best to use the //conditions:default approach you'd rather not do. Once first-class platforms are available that'll give you the fidelity you want.
k8 and piii are pseudonyms for 86 64-bit and 32-bit CPUs, respectively. They also tend to be associated with "Linux" by convention, although this is not a guaranteed 1-1 match.
There is no definitive set of "all possible CPU values". Basically, --cpu is just a string that gets resolved in CROSSTOOL files to toolchains with identifiers that match that string. This allows you to write new CROSSTOOL files for new CPU types you want to encode yourself. So the exact set of available CPUs depends on who's using Bazel and how they have their workspace set up.
For the same reasons as 5., there is no definitive list. See Bazel's github tools/ directory for references to defaults.

Related

Distinguish between empty and non-existing environment variables in Octave

I mean to tell in Octave if a given environment variable exists.
Something that works in most cases is size(getenv(varname))(1).
If this equals to 1, the variable exists.
But if this equals 0, the variable may not exist or it may exist and be set to a null value.
How can I distinguish these two cases?
Can this be done "natively" in Octave?
I would like my script to work irrespective of the host OS and shell.
In POSIX one could issue a system call and read the results.
Then for Windows one would have to check in another way, and write a wrapper that deals with both cases separately.
EDIT (tl;dr)
I was using a defined/not-defined environment variable to handle compilation cases, via a combination of
if(DEFINED ENV{ENV_TEST})
add_definitions(-DENV_TEST=$ENV{ENV_TEST})
endif()
in CMakeLists.txt and
#ifdef ENV_TEST
cout << "ENV_TEST is defined and set to \'" ENV_TEST "\'" << endl;
#else
cout << "ENV_TEST is not defined" << endl;
#endif
in myprog.cc.
This is an MCVE-fication of my actual case, where the three possible states (1. not defined, 2. defined and null, 3. defined and not null) produce different results.
In my actual case, I only needed to distinguish #1 vs. #2 (then the question), but I could turn all my #2 cases into #3 cases so now Octave also knows about it.
The downside is that this was deployed across several computers, so instead of writing the Octave code that would handle this right away, I considered changing handling as described above. I am not sure this will not have side effects...

#sancho.sReinstateMonicaCellio, If you don't have a specific use case then this is either a XY problem (see http://xyproblem.info) or you're wasting people's time by writing a misleading question that boils down to "Shouldn't GNU Octave provide a way to determine if an env var is defined?" See https://octave.org/doc/v4.0.1/Environment-Variables.html.
I can't speak to why the GNU Octave developers did not provide a mechanism for determining if an env var is not defined. But I'll bet it has a lot to do with the general design of the language.
You wrote "I would like my script to work irrespective of the host OS and shell.". The problem is that your question is independent of the host OS and shell. The problem is that GNU Octave apparently does not provide an API for distinguishing whether an env var is unset or set to an empty string. That has nothing to do with the OS or shell from which you launched GNU Octave.

COBOL: What is the benefit of using paragraphs and sections instead of subprograms?

What is the benefit of using paragraphs and sections for executing pieces of code, instead of using a subprogram instead? As far as I can see paragraphs and sections are dangerous because they have an non intuitive control flow, its easy to fall through and execute stuff you never meant to execute, and there is no variable (item) scoping, therefore it encourages a style of programming where everything is visible to everything else. Its a slippery soup.
I read a lot, but I could not find anything related to the comparative benefit of paragraphs/sections vs a subprogram. I also asked online some people in some COBOL forum, but their answers were along the lines of "is this a joke" or "go learn programming"(!!!).
I do not wish to engage in a discussion of stylistic preferences, everyone writes the way that their brain works, I only want to know, is there any benefit to using paragraphs/sections for flow control? As in, are there any COBOL operations that can be done only by using paragraphs/sections? Or is it just a remnant of an early way of thinking about code?
Because no other language I know of has mimicked that, so it either has some mechanical concrete essential reason to exist in COBOL, or it is a stylistic preference of the COBOL people. Can someone illuminate me on what is happening?

These are multiple questions... the two most important ones:
Are there any COBOL operations that can be done only by using paragraphs/sections?
Yes. A likely not complete list:
USE statements in DECLARATIVES can only apply to a paragraph or a section. These are used for handling file errors and exceptions. Not all compilers support this COBOL standard feature in full.
Segmentation (primary: a program that is only partially loaded in memory) is only possible with sections; but that is to be considered a "legacy feature" (at least I don't know of people actually using it this way explicitly); see the comment of Gilbert Le Blanc for more details on this
fall-through, many other languages have this feature with a kind of a switch statement (COBOL's EVALUATE, which is not the same as a common switch but can be used similar has an explicit break and no fall-through)
GO TO DEPENDING ON (could be recoded to achieve something similar with EVALUATE and then either PERFORM, if the paragraphs are expected to fall-through, which is not uncommon, then that creates a lot of extra code)
GO TO in general and especially nice - the old obsolete ALTER statement
PERFORM statement, format 1 "out-of-line"
file state is only shared between programs when you define it as EXTERNAL, and you often want to have a file state being limited to a single program
up to COBOL85: EXIT statement (plain without anything else, actually doing nothing else then a CONTINUE would)
What is the benefit of using paragraphs and sections for executing pieces of code, instead of using a subprogram instead?
shared data (I guess you know of programs with static data or otherwise (module)global data that is shared between functions/methods and also different source code files)
much less overhead than a CALL is
consistency:
you know what's in your code, you don't know what another program does (or at least: you cannot guarantee that it will do the same some years later exactly the same)
easier to extend/change: adding another variable (and also removing part of it, change its size) to a CALL USING means that you also have to adjust the called program - and all programs that call this, even when you place the complete definition in a copybook, which is very reasonable, this means you have to recompile all programs that use this
a section/paragraph is always available (it is already loaded when the program runs), a CALLed program may not be available or lead to an exception, for example because it cannot be loaded as its parameters have changed
less stuff to code
Note: While not all compilers support this you can work around nearly all of the runtime overhead and consistency issue when you use one source files with multiple program definitions in (possibly nested) and using a static call-convention. This likely gives you the "modern" view you aim for with scope-limitation of variables, within the programs either persistent (like local-static) when defined in WORKING-STORAGE or always passed when in LINKAGE or "local-temporary" when in LOCAL-STORAGE.
Should all code of an application be in one program?
[I've added this one to not lead to bad assumptions] Of course not!
Using sub-programs and also user-defined functions (possibly even nested providing the option for "scoped" and "shared" data) is a good thing where you have a "feature boundary" (for example: access to data, user-interface, ...) or with "modern" COBOL where you have a "language boundary" (for example: direct CALLs of C/Java/whatever), but it isn't "jut for limiting a counter to a section" - in this case: either define a variable which state is not guaranteed to be available after any PERFORM or define one for the section/paragraph; in both cases it would be reasonable to use a prefix telling you this.
Using that "separate by boundary" approach also takes care of the "bad habit of everything being seen by everyone" issue (which is in any case only true for "all sections/paragraphs in the same program).
Personal side note: I would only use paragraphs where it is a "shop/team rule" (it is better to stay consistent then to do things different "just because they are better" [still providing an option to possibly change the common rule]) or for GO TO, which I normally not use.
SECTIONs and EXIT SECTION + EXIT PERFORM [CYCLE] (and very rarely GOBACK/EXIT PROGRAM) make paragraphs nearly unnecessary.

very short answer. subroutines!!
Subroutines execute in the context of the calling routine. Two virtues: no parameter passing, easy to create. In some languages, subroutines are private to (and are part of) the calling (invoking) routine (see various dialects of BASIC).
direct answer: Section and Paragraph support a different way of thinking about programming. Higher performance than call subprogram. Supports overlays. The "fall thru" aspect can be quite useful, a feature rather than a vice. They may be necessary depending on what you are doing with a specific COBOL compiler.
See also PL/1, BAL/360, architecture 360/370/...

As a veteran Cobol dinosaur, I would say asking about the benefit is not the right question. I used paragraph (or section) differently than a subprogram. The right question in my opinion is when to use them logically. If I can make an analogy, if you have a Dog java class, you will write Dog-appropriate methods within it. If there's a cat involved, you may need a helper class. In this case the helper class is the subprogram. Though, you can instead code the helper class methods inside the Dog class, but that will be bad coding.

In any other language I would recommend putting self contained functions into subroutines.
However in COBOL not so much. If the code is very likely to be used in other programs then a subroutine is a good idea. Otherwise not!
The reason being the total lack of any checks on the number type or existence of passed parameters at compile time. Small errors in call statements lead to program crashes at run time. Limiting the use of sub-routines and carefully checking the calling code for errors makes for a more reliable program.
Using paragraphs any type mismatch will be flagged at compile time, or, an automatic conversion will occur.

How can I change the formatter's decimal separator in Rust?

The function below results in "10.000". Where I live this means "ten thousand".
format!("{:.3}", 10.0);
I would like the output to be "10,000".

There is no support for internationalization (i18n) or localization (l10n) baked in the Rust standard library.
There are several reasons, in no particular order:
a locale-dependent output should be a conscious choice, not a default,
i18n and l10n are much more complicated than just formatting numbers,
the Rust std aims at being small.
The format! machinery is going to be used to write JSON or XML files. You really do NOT want to end up with a differently formatted file depending on the locale of the machine that encoded it. It's a recipe for disaster.
The detection of locale at run-time is also optimization unfriendly. Suddenly you cannot pre-compute things at compile-time (even partially), you cannot even know which size of buffer to allocate at compile-time.
And this ties in with a dubious usefulness. Dates and numbers are arguably important, however this American vs English formatting war is ultimately a drop in the ocean. A French grammar schooler will certainly appreciate that the number is formatted in the typical French format... but it will be of no avail to her if the surrounding text is in English (we French are notoriously bad at teaching/learning foreign languages). Locale should influence language selection, sorting order, etc... merely changing the format of numbers is pointless, everything should switch with it, and this requires much more serious support (check gettext for a C library that provides a good base).
Basing the detection of the locale on the host locale, and it being global to the whole process, is also a very dubious architectural choice in this age of multi-threaded web servers. Imagine if Facebook was served in Swedish in Europe just because its datacenter is running there.
Finally, all this language/date/... support requires a humongous amount of data. ICU has several dozens (or is it hundreds?) of MBs of such data embedded inside it. This would make the size of the std explode, and make it completely unsuitable for embedded development; which probably do not care about this anyway.
Of course, you could cut down on this significantly if you only chose to support a handful of languages... which is yet another argument for putting this outside the standard library.

Since the standard library doesn't have this functionality (localization of number format), you can just replace the dot with a comma:
fn main() {
println!("{}", format!("{:.3}", 10.0).replacen(".", ",", 1));
}
There are other ways of doing this, but this is probably the most straightforward solution.

This is not the role of the macro format!. This option should be handle by Rust. Unfortunately, my search lead me to the conclusion that Rust don't handle locale (yet ?).
There is a library rust-locale, but they are still in alpha.

Define every symbol as a command in LaTeX

I'm working on a large project involving multiple documents typeset in LaTeX. I want to be consistent in my use of symbols, so it might be a nice idea to define a command for every symbol that has a specific meaning throughout the project. Does anyone have any experience with this? Are there issues I should pay attention to?
A little more specific. Say that, throughout the document I would denote something called permability by a script P, would it be an idea to define
\providecommand{\permeability}{\mathscr{P}}
or would this be more like the case "defining a command for $n$"?

A few tips:
Using \providecommand will define that command only if it's not been previously defined. So if you're not getting the results you expected, you may be trying to define a command that's been defined elsewhere.
If you wrap the math in your commands with \ensuremath, it will do the right thing regardless of whether you're in math mode when you issue the command:
\providecommand{\permeability}{\ensuremath{\mathscr{P}}}
Now I can easily use \permeability in text or $\permeability$ in math mode.
Using your own commands allows you to easily change the typographical representation of something later. For instance:
\newcommand{\vect}[1]{\ensuremath{\mathbf{#1}}}
would print \vect{x} as a boldfaced x. If you later decide you prefer arrows above your vectors, you could change the command to:
\newcommand{\vect}[1]{\ensuremath{\vec{#1}}}

I have been doing this for anything that has a specific meaning and is longer than a single symbol, mostly to save typing:
\newcommand{\objId}{\mbox{$\mathit{objId}$}\xspace}
\newcommand{\insOp}[1]{#1\mbox{$^+$}\xspace}
\newcommand{\delOp}[1]{#1\mbox{$^-$}\xspace}
However then I noticed that I stopped making inconsistency errors (objId vs ObjId vs ObjID), so I agree that it is a nice idea.
However I am not sure if it is a good idea in case symbols in the output are, well, single Latin symbols, as in:
\newcommand{\numOfObjs}{$n$}
It is too easy to type a single symbol and forget about it even though a command was defined for it.
EDIT: using your example IMHO it'd be a good idea to define \permeability because it is more than a single P that you have to type in without the command. But it's a close call.

Parsing Source Code - Unique Identifiers for Different Languages? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm building an application that receives source code as input and analyzes several aspects of the code. It can accept code from many common languages, e.g. C/C++, C#, Java, Python, PHP, Pascal, SQL, and more (however many languages are unsupported, e.g. Ada, Cobol, Fortran). Once the language is known, my application knows what to do (I have different handlers for different languages).
Currently I'm asking the user to input the programming language the code is written in, and this is error-prone: although users know the programming languages, a small percentage of them (on rare occasions) click the wrong option just due to recklessness, and that breaks the system (i.e. my analysis fails).
It seems to me like there should be a way to figure out (in most cases) what the language is, from the input text itself. Several notes:
I'm receiving pure text and not file names, so I can't use the extension as a hint.
The user is not required to input complete source codes, and can also input code snippets (i.e. the include/import part may not be included).
it's clear to me that any algorithm I choose will not be 100% proof, certainly for very short input codes (e.g. that could be accepted by both Python and Ruby), in which cases I will still need the user's assistance, however I would like to minimize user involvement in the process to minimize mistakes.
Examples:
If the text contains "x->y()", I may know for sure it's C++ (?)
If the text contains "public static void main", I may know for sure it's Java (?)
If the text contains "for x := y to z do begin", I may know for sure it's Pascal (?)
My question:
Are you familiar with any standard library/method for figuring out automatically what the language of an input source code is?
What are the unique code "tokens" with which I could certainly differentiate one language from another?
I'm writing my code in Python but I believe the question to be language agnostic.
Thanks

Vim has a autodetect filetype feature. If you download vim sourcecode you will find a /vim/runtime/filetype.vim file.
For each language it checks the extension of the file and also, for some of them (most common), it has a function that can get the filetype from the source code. You can check that out. The code is pretty easy to understand and there are some very useful comments there.

build a generic tokenizer and then use a Bayesian filter on them. Use the existing "user checks a box" system to train it.

Here is a simple way to do it. Just run the parser on every language. Whatever language gets the farthest without encountering any errors (or has the fewest errors) wins.
This technique has the following advantages:
You already have most of the code necessary to do this.
The analysis can be done in parallel on multi-core machines.
Most languages can be eliminated very quickly.
This technique is very robust. Languages that might appear very similar when using a fuzzy analysis (baysian for example), would likely have many errors when the actual parser is run.
If a program is parsed correctly in two different languages, then there was never any hope of distinguishing them in the first place.

I think the problem is impossible. The best you can do is to come up with some probability that a program is in a particular language, and even then I would guess producing a solid probability is very hard. Problems that come to mind at once:
use of features like the C pre-processor can effectively mask the underlyuing language altogether
looking for keywords is not sufficient as the keywords can be used in other languages as identifiers
looking for actual language constructs requires you to parse the code, but to do that you need to know the language
what do you do about malformed code?
Those seem enough problems to solve to be going on with.

One program I know which even can distinguish several different languages within the same file is ohcount. You might get some ideas there, although I don't really know how they do it.
In general you can look for distinctive patterns:
Operators might be an indicator, such as := for Pascal/Modula/Oberon, => or the whole of LINQ in C#
Keywords would be another one as probably no two languages have the same set of keywords
Casing rules for identifiers, assuming the piece of code was writting conforming to best practices. Probably a very weak rule
Standard library functions or types. Especially for languages that usually rely heavily on them, such as PHP you might just use a long list of standard library functions.
You may create a set of rules, each of which indicates a possible set of languages if it matches. Intersecting the resulting lists will hopefully get you only one language.
The problem with this approach however, is that you need to do tokenizing and compare tokens (otherwise you can't really know what operators are or whether something you found was inside a comment or string). Tokenizing rules are different for each language as well, though; just splitting everything at whitespace and punctuation will probably not yield a very useful sequence of tokens. You can try several different tokenizing rules (each of which would indicate a certain set of languages as well) and have your rules match to a specified tokenization. For example, trying to find a single-quoted string (for trying out Pascal) in a VB snippet with one comment will probably fail, but another tokenizer might have more luck.
But since you want to perform analysis anyway you probably have parsers for the languages you support, so you can just try running the snippet through each parser and take that as indicator which language it would be (as suggested by OregonGhost as well).

Some thoughts:
$x->y() would be valid in PHP, so ensure that there's no $ symbol if you think C++ (though I think you can store function pointers in a C struct, so this could also be C).
public static void main is Java if it is cased properly - write Main and it's C#. This gets complicated if you take case-insensitive languages like many scripting languages or Pascal into account. The [] attribute syntax in C# on the other hand seems to be rather unique.
You can also try to use the keywords of a language - for example, Option Strict or End Sub are typical for VB and the like, while yield is likely C# and initialization/implementation are Object Pascal / Delphi.
If your application is analyzing the source code anyway, you code try to throw your analysis code at it for every language and if it fails really bad, it was the wrong language :)

My approach would be:
Create a list of strings or regexes (with and without case sensitivity), where each element has assigned a list of languages that the element is an indicator for:
class => C++, C#, Java
interface => C#, Java
implements => Java
[attribute] => C#
procedure => Pascal, Modula
create table / insert / ... => SQL
etc. Then parse the file line-by-line, match each element of the list, and count the hits.
The language with the most hits wins ;)

How about word frequency analysis (with a twist)? Parse the source code and categorise it much like a spam filter does. This way when a code snippet is entered into your app which cannot be 100% identified you can have it show the closest matches which the user can pick from - this can then be fed into your database.

Here's an idea for you. For each of your N languages, find some files in the language, something like 10-20 per language would be enough, each one not too short. Concatenate all files in one language together. Call this lang1.txt. GZip it to lang1.txt.gz. You will have a set of N langX.txt and langX.txt.gz files.
Now, take the file in question and append to each of he langX.txt files, producing langXapp.txt, and corresponding gzipped langXapp.txt.gz. For each X, find the difference between the size of langXapp.gz and langX.gz. The smallest difference will correspond to the language of your file.
Disclaimer: this will work reasonably well only for longer files. Also, it's not very efficient. But on the plus side you don't need to know anything about the language, it's completely automatic. And it can detect natural languages and tell between French or Chinese as well. Just in case you need it :) But the main reason, I just think it's interesting thing to try :)

The most bulletproof but also most work intensive way is to write a parser for each language and just run them in sequence to see which one would accept the code. This won't work well if code has syntax errors though and you most probably would have to deal with code like that, people do make mistakes. One of the fast ways to implement this is to get common compilers for every language you support and just run them and check how many errors they produce.
Heuristics works up to a certain point and the more languages you will support the less help you would get from them. But for first few versions it's a good start, mostly because it's fast to implement and works good enough in most cases. You could check for specific keywords, function/class names in API that is used often, some language constructions etc. Best way is to check how many of these specific stuff a file have for each possible language, this will help with some syntax errors, user defined functions with names like this() in languages that doesn't have such keywords, stuff written in comments and string literals.
Anyhow you most likely would fail sometimes so some mechanism for user to override language choice is still necessary.

I think you never should rely on one single feature, since the absence in a fragment (e.g. somebody systematically using WHILE instead of for) might confuse you.
Also try to stay away from global identifiers like "IMPORT" or "MODULE" or "UNIT" or INITIALIZATION/FINALIZATION, since they might not always exist, be optional in complete sources, and totally absent in fragments.
Dialects and similar languages (e.g. Modula2 and Pascal) are dangerous too.
I would create simple lexers for a bunch of languages that keep track of key tokens, and then simply calculate a key tokens to "other" identifiers ratio. Give each token a weight, since some might be a key indicator to disambiguate between dialects or versions.
Note that this is also a convenient way to allow users to plugin "known" keywords to increase the detection ratio, by e.g. providing identifiers of runtime library routines or types.

Very interesting question, I don't know if it is possible to be able to distinguish languages by code snippets, but here are some ideas:
One simple way is to watch out for single-quotes: In some languages, it is used as character wrapper, whereas in the others it can contain a whole string
A unary asterisk or a unary ampersand operator is a certain indication that it's either of C/C++/C#.
Pascal is the only language (of the ones given) to use two characters for assignments :=. Pascal has many unique keywords, too (begin, sub, end, ...)
The class initialization with a function could be a nice hint for Java.
Functions that do not belong to a class eliminates java (there is no max(), for example)
Naming of basic types (bool vs boolean)
Which reminds me: C++ can look very differently across projects (#define boolean int) So you can never guarantee, that you found the correct language.
If you run the source code through a hashing algorithm and it looks the same, you're most likely analyzing Perl
Indentation is a good hint for Python
You could use functions provided by the languages themselves - like token_get_all() for PHP - or third-party tools - like pychecker for python - to check the syntax
Summing it up: This project would make an interesting research paper (IMHO) and if you want it to work well, be prepared to put a lot of effort into it.

There is no way of making this foolproof, but I would personally start with operators, since they are in most cases "set in stone" (I can't say this holds true to every language since I know only a limited set). This would narrow it down quite considerably, but not nearly enough. For instance "->" is used in many languages (at least C, C++ and Perl).
I would go for something like this:
Create a list of features for each language, these could be operators, commenting style (since most use some sort of easily detectable character or character combination).
For instance:
Some languages have lines that start with the character "#", these include C, C++ and Perl. Do others than the first two use #include and #define in their vocabulary? If you detect this character at the beginning of line, the language is probably one of those. If the character is in the middle of the line, the language is most likely Perl.
Also, if you find the pattern := this would narrow it down to some likely languages.
Etc.
I would have a two-dimensional table with languages and patterns found and after analysis I would simply count which language had most "hits". If I wanted it to be really clever I would give each feature a weight which would signify how likely or unlikely it is that this feature is included in a snippet of this language. For instance if you can find a snippet that starts with /* and ends with */ it is more than likely that this is either C or C++.
The problem with keywords is someone might use it as a normal variable or even inside comments. They can be used as a decider (e.g. the word "class" is much more likely in C++ than C if everything else is equal), but you can't rely on them.
After the analysis I would offer the most likely language as the choice for the user with the rest ordered which would also be selectable. So the user would accept your guess by simply clicking a button, or he can switch it easily.

In answer to 2: if there's a "#!" and the name of an interpreter at the very beginning, then you definitely know which language it is. (Can't believe this wasn't mentioned by anyone else.)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart