Machine Learning for extracting text from bunch of files - machine-learning

I have a case where I have lots of specification files and I need to extract a specific kind of information from them (a block of text). It couldn't be done through RegExp solution because the files are quite irregular (could be done but with great effort to create a RegExp string and I do not want to do that). My first thought was to use information extraction (I have lots of examples which could be used to learn a model) from the machine learning branch. My main language is C# so I've checked ML.NET but it appears there isn't such functionality in the library. So my question is, are there any libraries which could allow me to achieve the goal? Or does anyone have an idea to automate such task without writing a complex RegExp?

Related

Are there visual/graphical MDD tools for Ruby on Rails?

This is going to be pretty vague, so I hope I done get banned for it.
I've been learning about various dynamic web tools such as ruby on rails that can require a huge number of referencing between files (master view controllers, assets etc). Typically, when designing a rails application, I now draw the whole thing out in inkscape so that I have a visual representation of how all the files are connected to one another.
It would be really useful if I could translate the simple workflow diagram into some skeletal code. For example dragging a red block onto the page would create a controller, dragging an arrow in a direction towards a named view would then create the def in the controller etc etc. It's just an idea, but I wondered as a result if there were any graphical tools I could manipulate in order to do this kind of task?
If such a tool doesn't exist I'm happy to try and code one up myself - any ideas for a starting point?
A quick web search for model-driven rails came up with a master thesis (pdf) comparing graphical model driven development (in a J2EE context) with Rails' textual model driven development approach. So one could assume that the usual way to develop a Ruby-on-Rails application is already considered model-driven, just that the used domain specific language is a textual (and ruby-based) one instead of a graphical one, and that this textual approach is deemed sufficient. This would make it unlikely for graphical modeling tools for Rails to exist.
But another search result is the ModelDriven Rails Plugin which claims to be just such a tool. It doesn't use SVG images but UML diagrams.
If you decide to actually come up with your own code generator, consider accepting UML input as well. UML is the standard for visual software modeling and much better suited than SVGs: SVGs are more about the look of the diagram than its semantics.
One problem with UML is though, that I don't know of a single, universally accepted file format for exchanging UML. Almost each UML editor/modeling software seems to come with a file format of its own.

best way to parse text based log files

I have these relatively big log files which are generated from a machine via a serial connection.
This log isn`t structured and I need to check various different things. I wonder if there is some kind of existing language or tool which is specialized in this kind of thing?
languages I currently know:
c and c++
python
some java
various scripting language
I hope some of you have a good recomendationt!
Going with what you already know, Just use regular expressions in python.

Software to identify patterns in text files

I work on some software that parses large text files and inserts data into a database. Every time we get a new client, we have to write new parsing code for their text files.
I'm looking for some software to help simplify analyzing the text files. It would be nice to have some software that could identify patterns in the file.
I'm also open to any general purpose parsing libraries (.NET) that may simplify the job. Or any other relevant software.
Thanks.
More Specific
I open a text file with some magic software that shows me repeating patterns that it has identified. Really I'm just looking for any tools that developers have used to help them parse files. If something has helped you do this, please tell me about it.
Well, likely not exactly what you are looking for, but clone detection might be the right kind of idea.
There are a variety of such detectors. Some work only one raw lines of text, and that might apply directly to you.
Some work only on the works ("tokens") that make up the text, for some definition of "token".
You'd have to define what you mean by tokens to such tools.
But you seem to want something that discovers the structure of the text and then looks for repeating blocks with some parametric variation. I think this is really hard to do, unless you know sort of what that structure is in advance.
Our CloneDR does this for programming language source code, where the "known structure" is that of the programming language itself, as described specifically by the BNF grammar rules.
You probably don't want to Java-biased duplicate detection on semi-structured text. But if you do know something about the structure of the documents, you could write that down as a grammar, and our CloneDR tool would then pick it up.

Pipeline for writing a book for programmers in a collaboration

I'm a member of a group of enthusiast writers, who decided to collaborate on a cookbook-style book for one of programming languages.
We're trying to pick a pipeline for the collaboration.
I like how ProGit is made.
That is Markdown + some custom pre-processing, processed by Pandoc. But I'm concerned that Markdown is too simple for our case.
I look at Sphinx, but I have no experience using it.
I know that LaTeX would work — but I'm afraid that it will scare off the contributors. Also it may be too powerful, and too easy to build a byzantine pipeline if you don't have the necessary experience (which I do not).
Please do not suggest solutions where a person have to write XML by hand or must use some specific GUI (optionally available GUIs are good, of course). Commercial and non-crossplatform solutions are not an option as well.
It's hard to say whether pandoc's extended version of markdown would be too simple for your case unless you say what features you need. Note also that, if you're able to do a bit of very simple Haskell scripting, you can use the pandoc API to add features.

Standard format for concrete and abstract syntax trees

I have an idea for a hobby project which performs some code analysis and manipulation. This project will require both the concrete and abstract syntax trees of a given source file. Additionally, bi-directional references between the two trees would be helpful. I would like to avoid the work of transcribing a grammar to construct my own lexer and parser.
Is there a standard format for describing either concrete or abstract syntax trees?
Do any widely-used tool chains support outputting to these formats?
I don't have a particular target programming language in mind. Any popular one will do for a prototype, but I'd prefer one I know well: Python, C#, Javascript, or C/C++.
I'd like the ability to run a source file through a tool or library and get back both trees. In an ideal world, it would be practical to run this tool on code as it is being edited by a user and be tolerant of errors. Again, I am simply trying to develop a prototype, so these requirements are pretty lax.
Thanks!
The research community decided that graph exchange was the right thing to do when moving information from one program analysis tool to another.
See http://www.gupro.de/GXL
More recently, the OMG has defined a standard for interchanging Abstract Syntax Trees.
See http://www.omg.org/spec/ASTM/1.0/Beta1/
This problem seems to get solved over and over again.
There's half a dozen "tool bus" proposals made over the years
that all solved it, with no one ever overtaking the industry.
The problem is that a) it is easy to represent ASTs using
any kind of nestable notation [parentheses like LISP,
like XML, ...] so people roll their own solution easily,
and b) for one tool to exchange an AST with another, they
both have to agree essentially on what the AST nodes mean;
but most ASTs are rather accidentally derived from the particular
grammar/parsing technology used by each tool, and there's
almost always disagreement about that between tools.
So, I've seen very few tools that exchange ASTs meaningfully.
If you're doing a hobby thing, I'd stick with a lisp-like
encoding of trees, where each node has the following format:
( ... )
Its easy to generate, and easy to read.
I work on a professional tool to manipulate programs. If we
have print out the AST, we do the above. Mostly individual
ASTs are far too complicated to look at in practice,
so we hardly ever print out the entire AST, at best only
a node and a few children deep. Our tool doesn't exchange
ASTs with anybody (see above reasons :) but does just
fine building it in memory, doing whizzy things with it
for analysis reasons or transformation reasons, and then
either just deleteing it (no need to send it anywhere)
or regenerating the original language text from the tree.
[The latter means you need anti-parsing or "prettyprinting"
technology]
In our project we defined the AST metamodel in UML and use ANTLR (Java) to populate the model. We also maintain the token information from ANTLR after parsing, but we have not yet tried to update the underlying text-file with modifications made on the model.
This has a hideous overhead (in infrastructure, such as Eclipse UML2/EMF), but our goal is to use high-level tools for Model-based/driven Development (MDD, MDA) anyway, so we decided to use it on each level.
I think one of our students once played with OpenArchitectureWare and managed to get changes from the Eclipse-based, generated editor back into the syntax tree (not related to the UML model above) automatically, but I don't know the details about this.
You might also want to look at ANTLR's tree grammars.
Specific standards are an expectation, while more general purpose standards may also be appropriate. Ira Baxter already mentioned GXL, and RDF may be added too, just that it would require an appropriate ontology and is more oriented toward semantic than syntax. Still may be an option to investigate.
For specific standards, Ira Baxter already mentioned ASTM, another one, although it rather targets a specific kind of programming language (logic languages), is a standard for semantic/conceptual graph, known as ISO‑IEC 24707 2007.
Not a standard on its own, but a paper about that matter: Towards Portable Source Code Representations Using XML
.
I don't know any effectively used standard (in this area, that's always house‑made cooking everywhere), I'm just interested too in this topic.

Resources