storing decision tree in neo4j - neo4j

I've been looking for a way to 'productionize' R or python based Random Forest/Gradient boosting tree models, and had thought that since all the individual component decision tree are binary trees, exporting to a graphical database might be a workable solution (deploying by holding the models in memory and invoking from a lightweight restful library like Flask doesn't scale that well). Here's how a decision tree is normally traversed:
1.) Data gets passed to the root node
2.) We check if the present node is a leaf node; if it is, we return a set of attributes (the predicted distribution/value).
If not, the node stores a decision rule, and checks the relevant column for which node to pass the data to next (e.g., "If age>9.5, move to left node")
Repeat 2-3.
I'm new to neo4j and graph databases in general, and it wasn't clear to me that it is possible to store(and subsequently traverse) decision rules in a node; all the examples I saw tended to be in the vein of
MATCH (neo:Database {name:"Neo4j"})
MATCH (johan:Person {name:"Johan"})
CREATE (johan)-[:FRIEND]->(:Person:Expert {name:"Max"})-[:WORKED_WITH]-> (neo)
where the conditional statements are prespecified in a query. Is this something which is feasible with neo4j, and if so, which areas of the documentation should I be focusing on?
Thank you for any guidance you could provide.

Interesting problem.
You need a way to export a model out of R or Python and translate that into a Neo4J graph.
The export mechanism can be PMML (if you're using R rpart package to generate prunded trees), Google protobuf (if you're using R gbm package to generate trees), or simply an Excel spreadsheet.
Parsing and unmarshalling to Neo4J is your issue.

I am not affiliated with Yhat in any way, but reading your question made me think of an alternative approach.
Yhat Science Ops
I don't know what that means for your team internally, but it seems like a pretty simple way to have a model easy to call via a basic API call.

Related

Parse batch of SequenceExample

There is function to parse SequenceExample --> tf.parse_single_sequence_example().
But it parses only single SequenceExample, which is not effective.
Is there any possibility to parse a batch of SequenceExamples?
tf.parse_example can parse many Examples.
Documentation for tf.parse_example contain a little info about SequenceExample:
Each FixedLenSequenceFeature df maps to a Tensor of the specified type (or tf.float32 if not specified) and shape (serialized.size(), None) + df.shape. All examples in serialized will be padded with default_value along the second dimension.
But it is not clear, how to do that. Have not found any examples in google.
Is it possible to parse many SequenceExamples using parse_example() or may be other function exists?
Edit:
Where can I ask question to tensorflow developers: does they plan to implement parse function for multiple SequenceExample -s?
Any help ll be appreciated.
If you have many small sequences where batching at this stage is important, I would recommend VarLenFeatures or FixedLenSequenceFeatures with regular Example protos (which, as you note, can be parsed in batches with parse_example). For examples of this, see the unit tests associated with example parsing (testSerializedContainingSparse parses Examples with FixedLenSequenceFeatures).
SequenceExamples are more geared toward cases where there is significant amounts of preprocessing work to be done for each SequenceExample (which can be done in parallel with queues). parse_example does does not support SequenceExamples.

XOR, AND Tree in Neo4j Cypher

I have a problem trying to "decypher" a logical tree with Neo4js Cypher.
I have a logical tree of Operation to Leaves. I want to collect valid sets of Leaves.
I am currently trying to collect valid Sets of Leaves on a Valid Configuration Node. So I can later quickly path through that Configuration node.
Example
(1 AND 2) AND (3 AND 4)
Is easy to match (rule)-[AND*]->(leaf) return collect(leaf)
However
(1 XOR 2) AND (3 XOR 4)
Is a problem because whenever I collect 1,2,3,4 in a single variable, I cannot later properly get the cartesian product of the AND Operation. (13,14,23,24) would be valid.
In general I have a tree of variable depth (upto max about 3-4)
Operations are XOR, AND, Not AND, Not XOR
Is there a simple way in Cypher I am missing for navigating such trees?
Is trying to merge Valid Sets in a ValidConfiguration Node a good idea for fast Queries?
Later it should support a query of the form
(:Model)->(:ValidConf)->(:Leaf:Option)->(:Feature)
then return all models that have a certain Feature in a valid configuration.
Or multiple Features at a certain configuration price.
Do I need UDFs or ObjectGraphMapper to get this problem solved?
Are there any UDFs that work with such decision trees which I can use?
Any help would be highly appreciated.
Create Example
CREATE (r:Rule{id:123})-[:COMPOSITION]->
startOp:AndOperation:Operation:Operand)
CREATE (startOp)-[:AND]->(intermediateOp1:OrOperation:Operation:Operand)
CREATE (startOp)-[:AND]->(intermediateOp2:OrOperation:Operation:Operand)
CREATE (intermediateOp1)-[:XOR]->(o1:Option:Operand{id:321})
CREATE (intermediateOp1)-[:XOR]->(o2:Option:Operand{id:564})
CREATE (intermediateOp2)-[:XOR]->(o3:Option:Operand{id:876})
CREATE (intermediateOp2)-[:XOR]->(o4:Option:Operand{id:227})
CREATE (o1)-[:CONSISTS_OF]->(f1:Feature{text:"magicwand"})
....
This tree is symmetric but they usually aren't. I need to make o1 + o4 be valid and o1 + o2 to not be valid. The OR are to be understood as XOR.
I don't think Cypher is going to work for evaluating a boolean binary expression tree. To quote cybersam's answer to a related question:
This is because Cypher has no looping statements powerful enough to
iteratively calculate subresults (in the correct order) for trees of
arbitrary depth.
You're going to have to look for some additional system to do the evaluation.
If you can code Java, you should be able to do this by implementing your own custom procedure to evaluate a boolean expression tree in the correct order.

Plotting rules as a tree for Cubist package in R

Is there any way I can plot the rules obtained from a Cubist model in a decision tree format?
I can visualize the rules in text format (in console) by viewing the model summary, but I am unable to obtain a graphical tree presentation of the same. I have tried using "partykit" , "rattle" , "Rgraphviz" and "Rweka" packages
I had the same problem - and didn't succeed.
Since cubist is originally written in C and the R library simply returns the output captured from the C code (see https://cran.r-project.org/web/packages/Cubist/Cubist.pdf, page 3), I am pretty sure that plotting routines from other R libraries won't work.
Hence, I only see these solutions:
write your own visualisation, based on parsing the text output (definately a lot of work)
wait for a later version of cubist with such a routine included (although I have no idea, if this is even planned)

Source code logic evaluation

I was given a fragment of code (a function called bubbleSort(), written in Java, for example). How can I, or rather my program, tell if a given source code implements a particular sorting algorithm the correct way (using bubble method, for instance)?
I can enforce a user to give a legitimate function by analyzing function signature: making sure the the argument and return value is an array of integers. But I have no idea how to determine that algorithm logic is being done the right way. The input code could sort values correctly, but not in an aforementioned bubble method. How can my program discern that? I do realize a lot of code parsing would be involved, but maybe there's something else that I should know.
I hope I was somewhat clear.
I'd appreciate if someone could point me in the right direction or give suggestions on how to tackle such a problem. Perhaps there are tested ways that ease the evaluation of program logic.
In general, you can't do this because of the Halting problem. You can't even decide if the function will halt ("return").
As a practical matter, there's a bit more hope. If you are looking for a bubble sort, you can decide that it has number of parts:
a to-be-sorted datatype S with a partial order,
a container data type C with single instance variable A ("the array")
that holds the to-be-sorted data
a key type K ("array index") used to access the container that has a partial order
such that container[K] is type S
a comparison of two members of container, using key A and key B
such that A < B according to the key partial order, that determines
if container[B]>container of A
a swap operation on container[A], container[B] and some variable T of type S, that is conditionaly dependent on the comparison
a loop wrapped around the container that enumerates keys in according the partial order on K
You can build bits of code that find each of these bits of evidence in your source code, and if you find them all, claim you have evidence of a bubble sort.
To do this concretely, you need standard program analysis machinery:
to parse the source code and build an abstract syntax tree
build symbol tables (ST) that know the type of each identifier where it is used
construct a control flow graph (CFG) so that you check that various recognized bits occur in appropriate ordering
construct a data flow graph (DFG), so that you can determine that values recognized in one part of the algorithm flow properly to another part
[That's a lot of machinery just to get started]
From here, you can write ad hoc code procedural code to climb over the AST, ST, CFG, DFG, to "recognize" each of the individual parts. This is likely to be pretty messy as each recognizer will be checking these structures for evidence of its bit. But, you can do it.
This is messy enough, and interesting enough, so there are tools which can do much of this.
Our DMS Software Reengineering Toolkit is one. DMS already contains all the machinery to do standard program analysis for several languages. DMS also has a Dataflow pattern matching language, inspired by Rich and Water's 1980's "Programmer's Apprentice" ideas.
With DMS, you can express this particular problem roughly like this (untested):
dataflow pattern domain C;
dataflow pattern swap(in out v1:S, in out v2:S, T:S):statements =
" \T = \v1;
\v1 = \v2;
\v2 = \T;";
dataflow pattern conditional_swap(in out v1:S, in out v2:S,T:S):statements=
" if (\v1 > \v2)
\swap(\v1,\v2,\T);"
dataflow pattern container_access(inout container C, in key: K):expression
= " \container.body[\K] ";
dataflow pattern size(in container:C, out: integer):expression
= " \container . size "
dataflow pattern bubble_sort(in out container:C, k1: K, k2: K):function
" \k1 = \smallestK\(\);
while (\k1<\size\(container\)) {
\k2 = \next\(k1);
while (\k2 <= \size\(container\) {
\conditionalswap\(\container_access\(\container\,\k1\),
\container_access\(\container\,\k2\) \)
}
}
";
Within each pattern, you can write what amounts to the concrete syntax of the chosen programming language ("pattern domain"), referencing dataflows named in the pattern signature line. A subpattern can be mentioned inside another; one has to pass the dataflows to and from the subpattern by naming them. Unlike "plain old C", you have to pass the container explicitly rather than by implicit reference; that's because we are interested in the actual values that flow from one place in the pattern to another. (Just because two places in the code use the same variable, doesn't mean they see the same value).
Given these definitions, and ask to "match bubble_sort", DMS will visit the DFG (tied to CFG/AST/ST) to try to match the pattern; where it matches, it will bind the pattern variables to the DFG entries. If it can't find a match for everything, the match fails.
To accomplish the match, each of patterns above is converted essentially into its own DFG, and then each pattern is matched against the DFG for the code using what is called a subgraph isomorphism test. Constructing the DFG for the patter takes a lot of machinery: parsing, name resolution, control and data flow analysis, applied to fragments of code in the original language, intermixed with various pattern meta-escapes. The subgraph isomorphism is "sort of easy" to code, but can be very expensive to run. What saves the DMS pattern matchers is that most patterns have many, many constraints [tech point: and they don't have knots] and each attempted match tends to fail pretty fast, or succeed completely.
Not shown, but by defining the various bits separately, one can provide alternative implementations, enabling the recognition of variations.
We have used this to implement quite complete factory control model extraction tools from real industrial plant controllers for Dow Chemical on their peculiar Dowtran language (meant building parsers, etc. as above for Dowtran). We have version of this prototyped for C; the data flow analysis is harder.

Why build an AST walker instead of having the nodes responsible for their own output?

Given an AST, what would be the reason behind making a Walker class that walks over the tree and does the output, as opposed to giving each Node class a compile() method and having it responsible for its own output?
Here are some examples:
Doctrine 2 (an ORM) uses a SQLWalker to walk over an AST and generate SQL from nodes.
Twig (a templating language) has the nodes output their own code (this is an if statement node).
Using a separate Walker for code generation avoids combinatorial explosion in the number of AST node classes as the number of target representations increases. When a Walker is responsible for code generation, you can retarget it to a different representation just by altering the Walker class. But when the AST nodes themselves are responsible for compilation, you need a different version of each node for each separate target representation.
Mostly because of old literature and available tools. Experimenting with both methods you can easily find that AST traversal produces very slow and convoluted code. Moreover, code separated from immediate syntax doesn't resemble it anymore. It's very much like supporting two synchronized code bases, which is always a bad idea. Debugging, maintenance become difficult.
Of course, it can be also difficult to process semantics on the nodes unless you have a well designed state machine. In fact you are never worse than having to traverse AST after the fact, because it's just one particular case of processing semantics on nodes.
You can often hear that AST traversal allows for implementation of multiple semantics for the same syntax. In reality you would never want that, not only because it's rarely needed, but also for performance reasons. And frankly, there is no difficulty in writing separate syntax for a different semantics. The results were always better when both designed together.
And finally, in every non-trivial task, get syntax parsed is the easiest part, getting semantics correct and process actions fast is a challenge. Focusing on AST is approaching the task backwards.
To have support for a feature that the "internal AST walker" doesn't have.
For example, there are several ways to trasnverse a "hierarchical" or "tre" structure,
like "walk thru the leafs first", or "walk thru the branches first".
Or if the nodes siblings have a sort index, and you want to "walk" / "visit" them decremantally by their index, instead of incrementally.
If the AST class or structure you have only works with one method, you may want to use another method using your custom "walker" / "visitor".

Resources