Extract underlying xgboost.core.booster from H2O model - machine-learning

I am using Pysparkling and I think Pysparkling/H2O use regular xgboost. In regular dmlc xgboost, there is an attribute get_booster which allows me to extract the underlying booster. But I did not find any method in H2O/Pysparkling model object. Is it possible do the same thing for Pysparkling model?

Related

When to use mlflow.set_tag() vs mlflow.log_params()?

I am confused about the usecase of mlflow.set_tag() vs mlflow.log_params() as both takes key and value pair. Currently, I use mlflow.set_tag() to set tags for data version, code version, etc and mlflow.log_params() to set model training parameters like loss, accuracy, optimizer, etc.
As teedak8s pointed out in the comments, tags and params are supposed to log different things. Params are something you want to tune based on the metrics, whereas tags are some extra information that doesn't necessarily associate with the model's performance. See how they use tags and params differently for sklearn, torch, and other packages in the Automatic Logging. That being said, as I understand it, there's no hard constraint on which to use to log which; they can be used interchangeably without error.

Is there a well-defined difference between "normalizing" and "canonicalizing" data?

I understand canonicalization and normalization to mean removing any non-meaningful or ambiguous parts of of a data's presentation, turning effectively identical data into actually identical data.
For example, if you want to get the hash of some input data and it's important that anyone else hashing the canonically same data gets the same hash, you don't want one file indenting with tabs and the other using spaces (and no other difference) to cause two very different hashes.
In the case of JSON:
object properties would be placed in a standard order (perhaps alphabetically)
unnecessary white spaces would be stripped
indenting either standardized or stripped
the data may even be re-modeled in an entirely new syntax, to enforce the above
Is my definition correct, and the terms are interchangeable? Or is there a well-defined and specific difference between canonicalization and normalization of input data?
"Canonicalize" & "normalize" (from "canonical (form)" & "normal form") are two related general mathematical terms that also have particular uses in particular contexts per some exact meaning given there. It is reasonable to label a particular process by one of those terms when the general meaning applies.
Your characterizations of those specific uses are fuzzy. The formal meanings for general & particular cases are more useful.
Sometimes given a bunch of things we partition them (all) into (disjoint) groups, aka equivalence classes, of ones that we consider to be in some particular sense similar or the same, aka equivalent. The members of a group/class are the same/equivalent according to some particular equivalence relation.
We pick a particular member as the representative thing from each group/class & call it the canonical form for that group & its members. Two things are equivalent exactly when they are in the same equivalence class. Two things are equivalent exactly when their canonical forms are equal.
A normal form might be a canonical form or just one of several distinguished members.
To canonicalize/normalize is to find or use a canonical/normal form of a thing.
Canonical form.
The distinction between "canonical" and "normal" forms varies by subfield. In most fields, a canonical form specifies a unique representation for every object, while a normal form simply specifies its form, without the requirement of uniqueness.
Applying the definition to your example: Have you a bunch of values that you are partitioning & are you picking some member(s) per each class instead of the other members of that class? Well you have JSON values and short of re-modeling them you are partitioning them per what same-class member they map to under a function. So you can reasonably call the result JSON values canonical forms of the inputs. If you characterize re-modeling as applicable to all inputs then you can also reasonably call the post-re-modeling form of those canonical values canonical forms of re-modeled input values. But if not then people probably won't complain that you call the re-modeled values canonical forms of the input values even though technically they wouldn't be.
Consider a set of objects, each of which can have multiple representations. From your example, that would be the set of JSON objects and the fact that each object has multiple valid representations, e.g., each with different permutations of its members, less white spaces, etc.
Canonicalization is the process of converting any representation of a given object to one and only one, unique per object, representation (a.k.a, canonical form). To test whether two representations are of the same object, it suffices to test equality on their canonical forms, see also wikipedia's definition.
Normalization is the process of converting any representation of a given object to a set of representations (a.k.a., "normal forms") that is unique per object. In such case, equality between two representations is achieved by "subtracting" their normal forms and comparing the result with a normal form of "zero" (typically a trivial comparison). Normalization may be a better option when canonical forms are difficult to implement consistently, e.g., because they depend on arbitrary choices (like ordering of variables).
Section 1.2 from the "A=B" book, has some really good examples for both concepts.

Data Layer Convention

I am currectly defining a data layer definition/convention that is to be used at a large oranisation.
So every time someone is defining new tags, collect some sort of information from a web page, should follow the convention.
It covers variable naming, values, type description and when to use.
The convention is later to be used with GTM/Tealium iQ but it should be tool agnostic.
What is the best way, from a technical perspective, to define the data layer schema? I am thinking if swagger of json-schema. Any thoughts?
It's important to define your data layer in a way in which works for your organisation. That said, the best data layers have an easy to understand naming convention, are generally not nested and they contain good quality data.
A good tag manager will be able to read your data layer in whatever format you would like, whether this is out of the box or a converter which runs before tag execution.
Here is Tealium's best practice:
https://community.tealiumiq.com/t5/Data-Layer/Data-Layer-Best-Practices/ta-p/15987

Missing Values in WEKA output

I'm trying to compare J48 and MLP on a variety of datasets using WEKA. One of these is: https://archive.ics.uci.edu/ml/datasets/primary+tumor. I have converted this to CSV form which can be easily imported into WEKA. You can download this file here: https://ufile.io/8nj13
I used the "numeric to nominal" on the class and all the attributes to fit the natural structure of the data. However, when I ran J48 (and MLP), I got a bunch of question marks "?" in my output, presumably due to not having enough observations/instances of the appropriate type.
How can I get around this? I'm sure there must be a filter for this kind of thing. I've attached a picture below.
The detailed accuracy table is displaying a question mark since no instance was actually classified as that specific class. This for example means that since no instance was classified as class 16, WEKA can not provide you with details regarding said class 16 classifications. This image might help you understand.
In regards to the amount of instances of the appropriate class, you can use the ClassBalancer filter under, found at weka/filters/supervised/instance/ClassBalancer. This should help balance out the amount of the various classes.
Also note that your dataset contains some missing values, this could be solved by either discarding the instances with missing data or running the ReplaceMissingValues filter, found at weka/filters/unsupervised/attribute/ReplaceMissingValues.

Does the coder we select significantly affect performance?

I'm having trouble understanding the purpose of "coders". My understanding is that we choose coders in order to "teach" dataflow how a particular object should be encoded in byte format and how equality and hash code should be evaluated.
By default, and perhaps by mistake, I tend to put the words " implement serializable" on almost all my custom classes. This has the advantage the dataflow tends not to complain. However, because some of these classes are huge objects, I'm wondering if the performance suffers, and instead I should implement a custom coder in which I specify exactly which one or two fields can be used to determine equality and hash code etc. Does this make sense? Put another way, does creating a custom coder (which may only use one or two small primitive fields) instead of the default serial coder improve performance for very large classes?
Java serialization is very slow compared to other forms of encoding, and can definitely cause performance problems. However, only serializing part of your object means that the rest of the object will be dropped when it is sent between processes.
Much better that using Serializable, and pretty much just as easy, you can use AvroCoder by annotation your classes with
#DefaultCoder(AvroCoder.class)
This will automatically deduce an Avro schema from your class. Note that this does not work for generic types, so you'll likely want to use a custom coder in that case.

Resources