Static code analysis for SenseTalk - parsing

I am looking for a static code analysis tool for SenseTalk which can be extended to perform my custom rules. I need to identify dead codes and verify the naming conventions of variables and functions(Bonus: identify duplicate codes).I was not able to find a readily available tool to perform SCA on SenseTalk, although there were few tools which can be extended to a new language like PMD. I was also thinking if NLP can be incorporated for this scenario, but not sure on this. Any suggestions on how I can perform SCA on SenseTalk would be useful.

Related

Understanding how do the coverage modules work

I came across with a new dynamic language. I would like to create a coverage tool for that language. I started reading the source code of Perl 5 and Python coverage modules but it got complicated. It's a dynamic scripting language so I guess that source code of static languages (like Java & C++) won't help me here. Also, as I understand, each language was built in a different way and the same ideas won't work. But, the big concepts could be similar.
My question is as follows: how do I "attack" this task? What is the proper workflow I need to follow? What I need to investigate? Are there any books or blogs I can read about those kind of stuff?
There are two kinds of coverage collection mechanisms:
1) Real-time sampling of the program counter, typically by a clock running at 1-10ms. Difficulties: a) mapping an actual PC value back to a source line, b) sampling means you might not see execution of a rarely used bit of code, so your coverage reporting is inaccurate. Because of these issues, this approach isn't used very often.
2) Instrumenting the program so that it collects coverage as it runs. This is hard to do with object code... a) you have to decode the instructions to see where to put probes, and this can be very hard to do right, b) you have patch the source code to include the probes (this can be really awkward; a "probe" might consist of a 5 byte subroutine call but the probe has replace a single-byte instruction). c) you still have to figure out how to map a probe location back to a source code line. A more effective way is to instrument the source code, which requires pretty sophisticated machinery to read source, make probe patches, and regenerate the instrumented code for execution/compilation.
My technical paper Branch Coverage for Arbitrary Languages Made Easy provides explicit detail for how to do this in a general way. My company has built commercial test coverage tools for a wide variety of languages (C, Python, PHP, COBOL, Java, C++, C#, ProC,....) using this approach. This covers most static and dynamic languages. Some dynamic mechanisms are extremely difficult to instrument, e.g., eval() but that is true of every approach.
In addition to Ira's answer, there is a third coverage collection mechanism: the language implementation provides a callback that can inform you about program events. For example, Python has sys.settrace: you provide it a function, and Python calls your function for every function called or returned, and every line executed.

Programmatic access to fslex and fsyacc

The fslex and fsyacc tools currently require 2-stage compilation, generating files that are then compiled by fsc. It seems to me that these tools would be much easier to use if the source files were embedded resources, fed to fslex and fsyacc programmatically and the generated code compiled on-the-fly using the CodeDom.
Is this feasible and, if so, what would be required to implement this?
Jon, this is a great question; in fact, one of the design goals I have for fsharp-tools (new lexer- and parser-generator implementations for F#) is for them to be embeddable, specifically to enable scenarios like this.
As of now, I haven't implemented (yet) the functionality which would let you do this easily in fsharplex, but don't let that deter you; I've written fsharplex (and the other tools in fsharp-tools) in a more-or-less purely-functional style, so there shouldn't be any issues with global state or anything like that. It should be relatively straightforward to hack up the compiler code so you can build a regex AST using some combinators, run the compiler to get a compiled DFA, then emit IL for your state machine into a dynamic assembly (which you could then "bake" and execute).
fsharpyacc currently uses an approach where I've put the bulk of the compilation logic into a purely-functional library, Graham; the idea there is that the grammar analysis/manipulation and parser DFA compilation algorithms should be generic, reusable, and easy to test, so anyone else wanting to build language tools with F# will have a common framework on which to build them. Likewise, contributions/improvements to Graham can easily flow back to fsharpyacc. Eventually, I will modify fsharplex to use this same approach, which will allow you to embed the regex compiler in your own code simply by referencing the NuGet package (you'd just need to write the code to generate IL from the DFA).
fsharplex and fsharpyacc use MEF to allow various backends to be plugged in; for now, they're only targetting fslex and fsyacc for compatibility reasons, but I'd like to implement code-based backends (as opposed to the current table-based backends) to get better performance in the future.
Update -- I just re-read your question and noticed you want to embed the *.fsl and *.fsy files themselves and invoke the respective compilers at run-time. You could accomplish this by compiling the tools and referencing the assemblies from your own projects. IIRC, I exposed an entry point in both compilers so they could be called from outside code; the main entry points (e.g., what gets executed when you invoke the tools from a console) simply parse the command-line arguments then pass them into this "external" entry point.
There is one problem with directly embedding the *.fsl and *.fsy files though; if you embed them, then run them through fsharplex and fsharpyacc at run-time, your user-defined actions (e.g., the code executed when a lexer or parser rule is matched) will still be specified as F# source code -- you'd need to decide how you want to compile them into executable code.
It should be feasible to provide a parser combinator-like interface with a backend that uses expression trees (the LISP "eval" of F#) or something similar, for full integration with the language. Or else a TypeProvider. There are many options. If table generation is an expensive computation, it could be cached by providing a Cache, for example a disk cache.
I think nothing except lack of time, dedication and expertise, prevents us from having tools with (non-monadic) parser combinator-like interface, yet efficient compiled implementation.
Sometimes I get back to this pet project of mine, playing with an algebraic approach to optimizing regular expressions (and lexers) specified in source using combinators and then compiled to a state machine. It still lacks a few key pieces for efficiency, but there it is:
https://github.com/toyvo/ocaml-regex-algebraic

How does Fortify software work? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Fortify is a SCA used to find the security vulnerabilities in software code. I was just curious about how this software works internally. I know that you need to configure a set of rules against which the code will be run. But how exactly it is able to find the vulnerabilities in code.
Does anyone have any thoughts about this?
Thanks in advance.
HP Fortify SCA has 6 analyzers: data flow, control flow, semantic, structural, configuration, and buffer. Each analyzer finds different types of vulnerabilities.
Data Flow
This analyzer detects potential vulnerabilities that involve tainted data (user-controlled input) put to potentially dangerous use. The data flow analyzer uses global, inter-procedural taint propagation analysis to detect the flow of data between a source (site of user input) and a sink (dangerous function call or operation). For example, the data flow analyzer detects whether a user-controlled input string of unbounded length is being copied into a statically sized buffer, and detects whether a user controlled string is being used to construct SQL query text.
Control Flow
This analyzer detects potentially dangerous sequences of operations. By analyzing control flow paths in a program, the control flow analyzer determines whether a set of operations are executed in a certain order. For example, the control flow analyzer detects time of check/time of use issues and uninitialized variables, and checks whether utilities, such as XML readers, are configured properly before being used.
Structural
This detects potentially dangerous flaws in the structure or definition of the program. For example, the structural analyzer detects assignment to member variables in Java servlets, identifies the use of loggers that are not declared static final, and flags instances of dead code that will never be executed because of a predicate that is always false.
Semantic
This analyzer detects potentially dangerous uses of functions and APIs at the intra-procedural level. Basically a smart GREP.
Configuration
This analyzer searches for mistakes, weaknesses, and policy violations in an application's deployment configuration files.
Buffer
This analyzer detects buffer overflow vulnerabilities that involve writing or reading more data than a buffer can hold.
#LaJmOn has a very good answer, but at a completely different level of abstraction I can answer the question in another way:
Your source code is translated into an intermediate model, which is optimized for analysis by SCA.
Some types of code require multiple stages of translation. For instance, a C# file needs first to be compiled into a debug .DLL or .EXE, and then that .NET binary file disassembled into Microsoft Intermediate Language (MSIL) by the .NET SDK utility ildasm.exe. Whereas other files like a Java file or ASP file are translated in one pass by the appropriate Fortify SCA translator for that language.
SCA loads the model into memory and loads the analyzers. Each analyzer loads rules and applies those roles to functions in your program model, in a coordinated manner.
The matches are written into an FPR file, with the vulnerability match information, security advice, source code, source cross-reference and code navigation information, user filtering specification, any custom rules, and digital signatures all zipped into the package.
Also an addendum to #Doug Held comment above ... Starting with Fortify 16.20, SCA now supports scanning .Net C#/ASP/VB source code directly - no longer requiring pre-compilation.
Yes - Fortify SCA supports scanning Objective-C and Swift for iOS and about 20 other languages and numerous frameworks. See more in the Fortify SCA Data Sheet:
https://www.hpe.com/h20195/V2/GetPDF.aspx/4AA5-6055ENW.pdf
You can also leverage Fortify SCA through via SaaS at Fortify on Demand and have experts run the scans and audit the results for you:
http://www8.hp.com/us/en/software-solutions/application-security-testing/index.html

Pipeline for writing a book for programmers in a collaboration

I'm a member of a group of enthusiast writers, who decided to collaborate on a cookbook-style book for one of programming languages.
We're trying to pick a pipeline for the collaboration.
I like how ProGit is made.
That is Markdown + some custom pre-processing, processed by Pandoc. But I'm concerned that Markdown is too simple for our case.
I look at Sphinx, but I have no experience using it.
I know that LaTeX would work — but I'm afraid that it will scare off the contributors. Also it may be too powerful, and too easy to build a byzantine pipeline if you don't have the necessary experience (which I do not).
Please do not suggest solutions where a person have to write XML by hand or must use some specific GUI (optionally available GUIs are good, of course). Commercial and non-crossplatform solutions are not an option as well.
It's hard to say whether pandoc's extended version of markdown would be too simple for your case unless you say what features you need. Note also that, if you're able to do a bit of very simple Haskell scripting, you can use the pandoc API to add features.

Standard format for concrete and abstract syntax trees

I have an idea for a hobby project which performs some code analysis and manipulation. This project will require both the concrete and abstract syntax trees of a given source file. Additionally, bi-directional references between the two trees would be helpful. I would like to avoid the work of transcribing a grammar to construct my own lexer and parser.
Is there a standard format for describing either concrete or abstract syntax trees?
Do any widely-used tool chains support outputting to these formats?
I don't have a particular target programming language in mind. Any popular one will do for a prototype, but I'd prefer one I know well: Python, C#, Javascript, or C/C++.
I'd like the ability to run a source file through a tool or library and get back both trees. In an ideal world, it would be practical to run this tool on code as it is being edited by a user and be tolerant of errors. Again, I am simply trying to develop a prototype, so these requirements are pretty lax.
Thanks!
The research community decided that graph exchange was the right thing to do when moving information from one program analysis tool to another.
See http://www.gupro.de/GXL
More recently, the OMG has defined a standard for interchanging Abstract Syntax Trees.
See http://www.omg.org/spec/ASTM/1.0/Beta1/
This problem seems to get solved over and over again.
There's half a dozen "tool bus" proposals made over the years
that all solved it, with no one ever overtaking the industry.
The problem is that a) it is easy to represent ASTs using
any kind of nestable notation [parentheses like LISP,
like XML, ...] so people roll their own solution easily,
and b) for one tool to exchange an AST with another, they
both have to agree essentially on what the AST nodes mean;
but most ASTs are rather accidentally derived from the particular
grammar/parsing technology used by each tool, and there's
almost always disagreement about that between tools.
So, I've seen very few tools that exchange ASTs meaningfully.
If you're doing a hobby thing, I'd stick with a lisp-like
encoding of trees, where each node has the following format:
( ... )
Its easy to generate, and easy to read.
I work on a professional tool to manipulate programs. If we
have print out the AST, we do the above. Mostly individual
ASTs are far too complicated to look at in practice,
so we hardly ever print out the entire AST, at best only
a node and a few children deep. Our tool doesn't exchange
ASTs with anybody (see above reasons :) but does just
fine building it in memory, doing whizzy things with it
for analysis reasons or transformation reasons, and then
either just deleteing it (no need to send it anywhere)
or regenerating the original language text from the tree.
[The latter means you need anti-parsing or "prettyprinting"
technology]
In our project we defined the AST metamodel in UML and use ANTLR (Java) to populate the model. We also maintain the token information from ANTLR after parsing, but we have not yet tried to update the underlying text-file with modifications made on the model.
This has a hideous overhead (in infrastructure, such as Eclipse UML2/EMF), but our goal is to use high-level tools for Model-based/driven Development (MDD, MDA) anyway, so we decided to use it on each level.
I think one of our students once played with OpenArchitectureWare and managed to get changes from the Eclipse-based, generated editor back into the syntax tree (not related to the UML model above) automatically, but I don't know the details about this.
You might also want to look at ANTLR's tree grammars.
Specific standards are an expectation, while more general purpose standards may also be appropriate. Ira Baxter already mentioned GXL, and RDF may be added too, just that it would require an appropriate ontology and is more oriented toward semantic than syntax. Still may be an option to investigate.
For specific standards, Ira Baxter already mentioned ASTM, another one, although it rather targets a specific kind of programming language (logic languages), is a standard for semantic/conceptual graph, known as ISO‑IEC 24707 2007.
Not a standard on its own, but a paper about that matter: Towards Portable Source Code Representations Using XML
.
I don't know any effectively used standard (in this area, that's always house‑made cooking everywhere), I'm just interested too in this topic.

Resources