Effect of using Apache Beam schemas - google-cloud-dataflow

What is the use of specifying beam schemas in our code when we are reading source? How does it make our pipeline more efficient?

Schemas are not so much for facilitating code reading, but for making it easier to do type conversions inside the pipeline (better performance, less need to specify coders, etc), and for being able to apply higher level transforms, such Beam SQL, joins in Java or dataframes in Python.
The pipeline is more efficient because of type conversions. Schema types have a direct mapping to the programming language types, and they also have efficient coders for serialization. But I would say the main purpose and advantage of schemas is in the possibilities I mention in the previous paragraph.

Related

Ply VS Jflex+Cup

I'm trying to build a compiler for a Pascal like language, and trying to decide between Ply(python) or a combination of Jflex+Cup(Java). I can't find any proper materials in order to evaluate which one is better.
I want to know about performance, ease of use and how large the communities around each are.
From what I can gather, the Java options to build compilers and parsers in general seem to be used more, but for these one's in particular, Ply seems to be much more widely used.
Which one is better?

Is there any way to write a stored procedure or something similar, but in Cypher, NOT Java?

I know that you can write extensions that you can call from Cypher, but I'd really like to avoid having to write Java. I'm thinking something similar to SQL Server stored procedures. Is this possible, or could I maybe write a Cypher query and wrap it in some minimal Java to make the current capabilities work?
Besides #InverseFalcon's answer, there is really no Transact-SQL or PL/SQL-like languages for graphs yet.
The closest language I am aware of is SAP's GraphScriph:
GraphScript is a domain-specific, read-only graph query language tailored to serve advanced graph analysis tasks and to ease the specification of custom, complex graph algorithms.
Caveats: it is only available in the SAP HANA Graph product, and, as the quote says, it is read-only. For more details, see presentation slides and paper.
If you would like to avoid Java due to its verbosity but are fine with writing general purpose code on the JVM, you might want to try the Kotlin language. However, using anything else than Java tends to introduce some integration issues (across all JVM-based applications, not just Neo4j in particular), so be prepared to tackle those. There is an example project on GitHub for Neo4j Kotlin procedures to get you started. Caveats: even though there is basic Kotlin support in the Eclipse IDE, it's not on par with the IntelliJ edition. So you will probably need an IntelliJ license.
If you have access to APOC Procedures, you can use apoc.cypher.run() (or apoc.cypher.doIt() for write-queries) to execute a string cypher query.
You can always follow the tutorial for creating your own procedure and have it call the appropriate APOC cypher run procedure with a hardcoded query.

What are common properties in an Abstract Syntax Tree (AST)?

I'm new to compiler design and have been watching a series of youtube videos by Ravindrababu Ravula.
I am creating my own language for fun and I'm parsing it to an Abstract Syntax Tree (AST). My understanding is that these trees can be portable given they follow the same structure as other languages.
How can I create an AST that will be portable?
Side notes:
My parser is currently written in javascript but I might move it to C#.
I've been looking at SpiderMonkey's specs for guidance. Is that a good approach?
Portability (however defined) is not likely to be your primary goal in building an AST. Few (if any) compiler frameworks provide a clear interface which allows the use of an external AST, and particular AST structures tend to be badly-documented and subject to change without notice. (Even if they are well-documented, the complexity of a typical AST implementation is challenging.)
An AST is very tied to the syntactic details of a language, as well as to the particular parsing strategy being used. While it is useful to be able to repurpose ASTs for multiple tasks -- compiling, linting, pretty-printing, interactive editing, static analysis, etc. -- the conflicting demands of these different use cases tends to increase complexity. Particularly at the beginning stages of language development, you'll want to give yourself a lot of scope for rapid prototyping.
The most tempting reason for portable ASTs would be to use some other language as a target, thereby saving the cost of writing code-generation, etc. However, in practice it is usually easier to generate the textual representation of the other language from your own AST than to force your parser to use a foreign AST. Even better is to target a well-documented virtual machine (LLVM, .Net IL, JVM, etc.), which is often not much more work than generating, say, C code.
You might want to take a look at the LLVM Kaleidoscope tutorial (the second section covers ASTs, although implemented in C++). Also, you might find this question on a sister site interesting reading. And finally, if you are going to do your implementation in Javascript, you should at least take a look at the jison parser generator, which takes a lot of the grunt-work out of maintaining a parser and scanner (and thus allows for easier experimentation.)

Is Erlang the right choice for a webcrawler?

I am planning to write a webcrawler for a NLP project, that reads in the thread structure of a forum everytime in a specific interval and parses each thread with new content. Via regular expressions, the author, the date and the content of new posts is extracted. The result is then stored in a database.
The language and plattform used for the crawler have to match the following criteria:
easily scalable on multiple cores and cpus
suited for high I/O loads
fast regular expression matching
easily to maintain/few operational overhead
After some research I think Erlang might be a fitting candidate, but I read it's not very good at string processing (and so regular expression matching). Neither do I have any expirience about the maintenance factor.
Is Erlang a good technology for the scenario described above? And if not, what would be a good alternative?
I am also evaluating erlang for use as a web crawler and it looks good so far.
There are lots of existing helpful modules: HTML parser, HTTP client, XPath, regex, cache.
And other people are interested in the same use case, so you can learn from them.
However if this is just a one off project I recommend Python / Ruby / Perl because it will be easier to get started with.
If you're familiar and comfortable with erlang then I'd stick with it if I were you, although I'm not familiar with erlang. With that noted, I'll give you some pointers:
Don't use regular expressions to parse HTML, use XPATH instead.
HTML, while structured, is still quite difficult to parse in the wild and regular expressions are fairly slow and unreliable for parsing HTML.
Determine what your crawler architecture is going to be and what is your re-visit policy.
Find the best selection policy for you and implement it.
A web crawler is a fairly complex system to build and you have to be concerned about speed, performance, scalability and concurrency. Some of the most notable crawlers are written in C++ and Java, but I have not heard of any crawlers written in erlang.
Erlang is fine for this. Its regex library delegates (nearly all) work to PCRE, which should be fast enough. But avoid strings and use binaries instead! They both use a lot less memory and are faster to translate to C strings.

Standard format for concrete and abstract syntax trees

I have an idea for a hobby project which performs some code analysis and manipulation. This project will require both the concrete and abstract syntax trees of a given source file. Additionally, bi-directional references between the two trees would be helpful. I would like to avoid the work of transcribing a grammar to construct my own lexer and parser.
Is there a standard format for describing either concrete or abstract syntax trees?
Do any widely-used tool chains support outputting to these formats?
I don't have a particular target programming language in mind. Any popular one will do for a prototype, but I'd prefer one I know well: Python, C#, Javascript, or C/C++.
I'd like the ability to run a source file through a tool or library and get back both trees. In an ideal world, it would be practical to run this tool on code as it is being edited by a user and be tolerant of errors. Again, I am simply trying to develop a prototype, so these requirements are pretty lax.
Thanks!
The research community decided that graph exchange was the right thing to do when moving information from one program analysis tool to another.
See http://www.gupro.de/GXL
More recently, the OMG has defined a standard for interchanging Abstract Syntax Trees.
See http://www.omg.org/spec/ASTM/1.0/Beta1/
This problem seems to get solved over and over again.
There's half a dozen "tool bus" proposals made over the years
that all solved it, with no one ever overtaking the industry.
The problem is that a) it is easy to represent ASTs using
any kind of nestable notation [parentheses like LISP,
like XML, ...] so people roll their own solution easily,
and b) for one tool to exchange an AST with another, they
both have to agree essentially on what the AST nodes mean;
but most ASTs are rather accidentally derived from the particular
grammar/parsing technology used by each tool, and there's
almost always disagreement about that between tools.
So, I've seen very few tools that exchange ASTs meaningfully.
If you're doing a hobby thing, I'd stick with a lisp-like
encoding of trees, where each node has the following format:
( ... )
Its easy to generate, and easy to read.
I work on a professional tool to manipulate programs. If we
have print out the AST, we do the above. Mostly individual
ASTs are far too complicated to look at in practice,
so we hardly ever print out the entire AST, at best only
a node and a few children deep. Our tool doesn't exchange
ASTs with anybody (see above reasons :) but does just
fine building it in memory, doing whizzy things with it
for analysis reasons or transformation reasons, and then
either just deleteing it (no need to send it anywhere)
or regenerating the original language text from the tree.
[The latter means you need anti-parsing or "prettyprinting"
technology]
In our project we defined the AST metamodel in UML and use ANTLR (Java) to populate the model. We also maintain the token information from ANTLR after parsing, but we have not yet tried to update the underlying text-file with modifications made on the model.
This has a hideous overhead (in infrastructure, such as Eclipse UML2/EMF), but our goal is to use high-level tools for Model-based/driven Development (MDD, MDA) anyway, so we decided to use it on each level.
I think one of our students once played with OpenArchitectureWare and managed to get changes from the Eclipse-based, generated editor back into the syntax tree (not related to the UML model above) automatically, but I don't know the details about this.
You might also want to look at ANTLR's tree grammars.
Specific standards are an expectation, while more general purpose standards may also be appropriate. Ira Baxter already mentioned GXL, and RDF may be added too, just that it would require an appropriate ontology and is more oriented toward semantic than syntax. Still may be an option to investigate.
For specific standards, Ira Baxter already mentioned ASTM, another one, although it rather targets a specific kind of programming language (logic languages), is a standard for semantic/conceptual graph, known as ISO‑IEC 24707 2007.
Not a standard on its own, but a paper about that matter: Towards Portable Source Code Representations Using XML
.
I don't know any effectively used standard (in this area, that's always house‑made cooking everywhere), I'm just interested too in this topic.

Resources