Is there any way I can plot the rules obtained from a Cubist model in a decision tree format?
I can visualize the rules in text format (in console) by viewing the model summary, but I am unable to obtain a graphical tree presentation of the same. I have tried using "partykit" , "rattle" , "Rgraphviz" and "Rweka" packages
I had the same problem - and didn't succeed.
Since cubist is originally written in C and the R library simply returns the output captured from the C code (see https://cran.r-project.org/web/packages/Cubist/Cubist.pdf, page 3), I am pretty sure that plotting routines from other R libraries won't work.
Hence, I only see these solutions:
write your own visualisation, based on parsing the text output (definately a lot of work)
wait for a later version of cubist with such a routine included (although I have no idea, if this is even planned)
Related
I am working on a custom programming language. On compiling it, the parser first converts the text into a simple stream of tokens. The tokens are then converted into a simple tree. The tree is then converted into an object graph (with holes in it, as the types haven't yet been necessarily fully figured out). The holey tree is then transformed into a compact object graph.
Then we can go further and compile it to, say, JavaScript. The compact object graph is then transformed into a JavaScript AST. The JS AST is then transformed into a "concrete" syntax tree (with whitespace and such), and then that is converted into the JS text.
So in going from text to compact object graph, there are 5 transformation steps (text -> token_list -> tree -> holey_graph -> graph). In other situations (other languages), you might have more or less.
The way I am doing this transformation now is very ad-hoc and not keeping track of line numbers, so it's impossible to really tell where an error is coming from. I would like to fix that.
In my case, I am wondering how you could create a data model to keep track of the line of text where something was defined. This way, you could report any compilation errors nicely to the developer. The way I have modeled that so far is with a sort of "folding" model as I'm calling it. The initial "fold" is on the text -> token_list transformation. For each token, it keeps track of 3 things: the line, the column, and the text length, for the token. At first you may model it like this:
{
token: 'function',
line: 10,
column: 2,
size: 8
}
But that is tying two concepts into one object: the token itself, and the "fold" as I am calling it. Really it would be better like this:
fold = {
line: 10,
column: 2,
size: 8
}
token = {
value: 'function'
}
// bind the two together.
fold.data = token
token.fold = fold
Then, you transform from token to AST node in the simple tree. That might be like:
treeNode = {
type: 'function'
}
fold = {
previous: tokenFold,
data: treeNode
}
And so connecting the dots like this. In the end, you would have a fold list, which could be traversed theorertically from the compact object graph, to the text, so if there was a compile eror when doing typechecking for example, you could report the exact line number and everything to the developer. The navigation would look something like this:
data = compactObjectGraph
.fold
.previous.previous.previous.previous
.data
data.line
data.column
data.size
In theory. But the problem is, the "compact object graph" might have been created not from a simple linear chain of inputs, but from a suite of inputs. While I have modeled this on paper so far, I am starting to think there isn't actually in reality a clear way of mapping from object to object how it was transfformed, using this sort of "fold" sort of system.
The question is, how can I define the data model to allow for getting back to the source text line/column number, given there is a complex sequence of transformations from one data structure to the next? That is, at a high level, what is a way to model this that will allow you to isolate the transformation data structures, yet be able to map from the last generated one to the first, to find how some compact object graph node was actually represented in the original source text?
I would create a data structure containing the filename, line and column. In C++ it may work well to store a reference to this structure, rather than copy it to many places.
There isn't really that many ways to solve this, but having a single structure that is re-usable across your other data structures is almost certainly the right solution.
I answered your question on Quora in July, so maybe you missed it: https://qr.ae/pvkrwJ
Basically you have to stamp all the compiler artifacts with source information from which they are derived. Best represented a some kind of structure (Mats' response). Yeah, that takes effort, because
you have to do it everywhere in the compiler.
To do a perfect job, you'd need to stamp it with the complete set of source items that caused its generation; you're essentially producing a dependency graph. (You could represent such sets as trees of subsets to maximize sharing). Then any complaint the compiler issued could clearly identify the set of causes.
To do a less perfect job, you can pick any of of the contributing items and use that as the source location dependency. That means that a compiler complaint will only identify one source location that might be the cause, and the reader will have to guess at others if that isn't the principal source of the problem. Judicious choice of which-cause source information can arrange it so the answer is right much of the time and that's probably good enough.
I am attempting to 'simplify' some smtlib2 files using the z3 Python API via the following:
reading in an SMTLIB2 file
applying some tactics & extracting a simplified goal
adding the simplified goal to a new solver
printing the new solver via to_smt2()
I have an odd use case where it would be ideal if the resulting smtlib file did not contain any let expressions. Is there a way to expand them via the python API?
Creation of let-expressions are controlled by the pretty-printer. Try something like:
set_option(max_args=10000000, max_lines=1000000, max_depth=10000000, max_visited=1000000)
You can play with the actual numbers to find a setting that works for your use case. Essentially, the larger the numbers, the less the sharing/chopping off will be.
Also of importance is the parameter min_alias_size. Also try setting that to a large number. (The default is 10, which forces let-expressions.)
I'm not too familiar with Machine Learning techniques, and i want to know if I can transfer a final trained-model to another machine. More specifically, i'm trying to solve a sound classification problem by training a model on a regular PC, and then implement / transfer its output model to an embedded system where no libraries are allowed (C programming). The system does not support file reading either.
So my question is.
Are there learning methods with output models simple enough that it can be implemented easily on other systems? How would you implement it? (Something like Q-learning? although Q-learning wouldn't be appropriate in my project.)
I would like some pointers, thanks in advance.
Any arbitrary "blob" of data can be converted into a C byte array and compled and linked directly with your code. A code generator is simple enough to write, but there are tools that will do that directly such a Segger Bin2C (and any number of other tools called "bin2c") or the swiss-army knife of embedded data converters SRecord.
Since SRecord can do so many things, getting it to do this one thing is less than obvious:
srec_cat mymodel.nn -binary -o model.c -C-Array model -INClude
will generate a model.c and model.h file defining a data array containing the byte content of mymodel.nn.
There is function to parse SequenceExample --> tf.parse_single_sequence_example().
But it parses only single SequenceExample, which is not effective.
Is there any possibility to parse a batch of SequenceExamples?
tf.parse_example can parse many Examples.
Documentation for tf.parse_example contain a little info about SequenceExample:
Each FixedLenSequenceFeature df maps to a Tensor of the specified type (or tf.float32 if not specified) and shape (serialized.size(), None) + df.shape. All examples in serialized will be padded with default_value along the second dimension.
But it is not clear, how to do that. Have not found any examples in google.
Is it possible to parse many SequenceExamples using parse_example() or may be other function exists?
Edit:
Where can I ask question to tensorflow developers: does they plan to implement parse function for multiple SequenceExample -s?
Any help ll be appreciated.
If you have many small sequences where batching at this stage is important, I would recommend VarLenFeatures or FixedLenSequenceFeatures with regular Example protos (which, as you note, can be parsed in batches with parse_example). For examples of this, see the unit tests associated with example parsing (testSerializedContainingSparse parses Examples with FixedLenSequenceFeatures).
SequenceExamples are more geared toward cases where there is significant amounts of preprocessing work to be done for each SequenceExample (which can be done in parallel with queues). parse_example does does not support SequenceExamples.
I've been looking for a way to 'productionize' R or python based Random Forest/Gradient boosting tree models, and had thought that since all the individual component decision tree are binary trees, exporting to a graphical database might be a workable solution (deploying by holding the models in memory and invoking from a lightweight restful library like Flask doesn't scale that well). Here's how a decision tree is normally traversed:
1.) Data gets passed to the root node
2.) We check if the present node is a leaf node; if it is, we return a set of attributes (the predicted distribution/value).
If not, the node stores a decision rule, and checks the relevant column for which node to pass the data to next (e.g., "If age>9.5, move to left node")
Repeat 2-3.
I'm new to neo4j and graph databases in general, and it wasn't clear to me that it is possible to store(and subsequently traverse) decision rules in a node; all the examples I saw tended to be in the vein of
MATCH (neo:Database {name:"Neo4j"})
MATCH (johan:Person {name:"Johan"})
CREATE (johan)-[:FRIEND]->(:Person:Expert {name:"Max"})-[:WORKED_WITH]-> (neo)
where the conditional statements are prespecified in a query. Is this something which is feasible with neo4j, and if so, which areas of the documentation should I be focusing on?
Thank you for any guidance you could provide.
Interesting problem.
You need a way to export a model out of R or Python and translate that into a Neo4J graph.
The export mechanism can be PMML (if you're using R rpart package to generate prunded trees), Google protobuf (if you're using R gbm package to generate trees), or simply an Excel spreadsheet.
Parsing and unmarshalling to Neo4J is your issue.
I am not affiliated with Yhat in any way, but reading your question made me think of an alternative approach.
Yhat Science Ops
I don't know what that means for your team internally, but it seems like a pretty simple way to have a model easy to call via a basic API call.