How do i write the syntax to fit a model using glmer with fixed and random effects? - glm

How do i write the syntax to use glmer() to fit a model with the following fixed and random effects:
#Fixed effects: "rack" interacting with "amd", and "nutrient" (not interacting with any other variable)
#Random effects: "gen" nested within "popu" as a random intercept
Dataset used: "Arabidopsis" dataset from the "lme4" package.
d2=(Arabidopsis)
str(data)
d2$rack=as.factor(d2$rack)
d2$nutrient=as.factor(d2$nutrient)
str(d2)
mod2.1=glmer(total.fruits~rack*amd+nutrient+(1|popu/gen),family=poisson,nAGQ=1, data=d2)
Is this correct?

Related

How to create a language model with 2 different heads in huggingface?

I know I can create a language model with 1 head:
from transformers import AutoModelForMultipleChoice
model = AutoModelForMultipleChoice.from_pretrained("distilbert-base-cased").to(device)
But how can I create the same base model structure (e.g., distilbert-base-cased) with 2 heads? Say, one is AutoModelForMultipleChoice and the second is AutoModelForSequenceClassification. I need the only difference between the 2 models (1 head vs 2 heads) to be the additional head (from parameters perspective).
So now my input for the 2 heads model is something like [sequence_label, multiple_choice_labels]
In general case you will need to create a custom class derived from the DistilBertPreTrainedModel. Inside __init__() you will need to define your desired heads architectures. Then you will need to create your own forward() function and define inside it a custom loss involving both heads, and return result.
But if you are talking specifically about DistilBertForMultipleChoice and DistilBertForSequenceClassification, there is a shortcut, as the heads architecture happen to be identical (see source) and the difference is only in loss function. So you can try to train your model as multi label sequence classification problem, where the label per sequence will be [sequence_label, multiple_choice_label_0, multiple_choice_label_1, ...] . For example, in case you have an entry like {sequence, choice0, choice1, seq_label:True, correct_choice:0}
your dataset will be
[ {'text':(sequence, choice0), 'label':(1 1 0)},
{'text':(sequence, choice1), 'label':(1 0 0)} ]
This way the result of the sequence classification will be in the first position and to get the correct choice probability you will need to apply softmax function on the rest of the logits.

How to use inverse dynamics controller on under-actuated systems

I'm trying to use Drake's inverse dyanmics controller on an arm with a floating base, and based on this discussion it seems like the most straightforward way to go about this is to use two separate plants since the controller only supports fully actuated systems.
Following Python bindings error when adding two plants to a scene graph in pyDrake, I attempted to create two plants using the following code:
def register_plant_with_scene_graph(scene_graph, plant):
plant.RegsterAsSourceForSceneGraph(scene_graph)
builder.Connect(
plant.get_geometry_poses_output_port(),
scene_graph.get_source_pose_port(plant.get_source_id()),
)
builder.Connect(
scene_graph.get_query_output_port(),
plant.get_geometry_query_input_port(),
)
builder = DiagramBuilder()
scene_graph = builder.AddSystem(SceneGraph())
plant_1 = builder.AddSystem(MultibodyPlant(time_step=0.0))
register_plant_with_scene_graph(scene_graph, plant_1)
plant_2 = builder.AddSystem(MultibodyPlant(time_step=0.0))
register_plant_with_scene_graph(scene_graph, plant_2)
which produced the error
AttributeError: 'MultibodyPlant_[float]' object has no attribute 'RegsterAsSourceForSceneGraph'
Which seems odd because according to the documentation, the function should exist.
Is this function available in the python bindings for drake? Also, more broadly, is this the correct way to approach using the inverse dynamics controller on a free-floating manipulator?
Inverse dynamics takes desired positions, velocities, and accelerations and computes the required torques. If your robot has a floating base, then you cannot accept arbitrary acceleration commands. For instance the total center of mass of your robot will be falling according to gravity; any acceleration that does not satisfy this requirement will not have a feasible solution to the inverse dynamics. I think there must be something more that we need to understand about your problem formulation.
Often when people ask this question, they are thinking of a robot that is relying on contact forces in addition to generalized force/torques in order to achieve the requested accelerations. In that case, the problem needs to include those contact forces as decision variables, too. Since contact forces have unilateral constraints (e.g. feet cannot pull on the ground), and friction cone constraints, this inverse dynamics problem is almost always formulated as a quadratic program. For instance, as in this paper. We don't currently provide that QP formulation in Drake, but it is not hard to write it against the MathematicalProgram interface. And we do have some older code that was removed from Drake (since it wasn't actively developed) that we can point you to if it helps.

How to Implement Integrated Random Walk Trend Component in Tensorflow Probability

I'm running tensorflow 2.1 and tensorflow_probability 0.9. I have fit a Structural Time Series Model with a seasonal component.
I wish to implement an Integrated Random walk in order to smooth the trend component, as per Time Series Analysis by State Space Methods: Second Edition, Durbin & Koopman. The integrated random walk is achieved by setting the level component variance to equal 0.
Is implementing this constraint possible in Tensorflow Probability?
Further to this in Durbin & Koopman, higher order random walks are discussed. Could this be implemented?
Thanks in advance for your time.
If I understand correctly, an integrated random walk is just the special case of LocalLinearTrend in which the level simply integrates the randomly-evolving slope component (ie it has no independent source of variation). You could patch this in by subclassing LocalLinearTrend and fixing level_scale = 0. in the models it builds:
class IntegratedRandomWalk(sts.LocalLinearTrend):
def __init__(self,
slope_scale_prior=None,
initial_slope_prior=None,
observed_time_series=None,
name=None):
super(IntegratedRandomWalk, self).__init__(
slope_scale_prior=slope_scale_prior,
initial_slope_prior=initial_slope_prior,
observed_time_series=observed_time_series,
name=None)
# Remove 'level_scale' parameter from the model.
del self._parameters[0]
def _make_state_space_model(self,
num_timesteps,
param_map,
initial_state_prior=None,
initial_step=0):
# Fix `level_scale` to zero, so that the level
# cannot change except by integrating the
# slope.
param_map['level_scale'] = 0.
return super(IntegratedRandomWalk, self)._make_state_space_model(
num_timesteps=num_timesteps,
param_map=param_map,
initial_state_prior=initial_state_prior,
initial_step=initial_step)
(it would be mathematically equivalent to build a LocalLinearTrend with level_scale_prior concentrated at zero, but that constraint makes inference difficult, so it's generally better to just ignore or remove the parameter entirely, as I did here).
By higher-order random walks, do you mean autoregressive models? If so, sts.Autoregressive might be relevant.

Is it possible to split a dataset in Google Dataprep? If so, how?

I've been looking into Google Dataprep as an ETL solution to perform some basic data transformation before feeding it to a machine learning platform. I'm wondering if it's possible to use the Dataprep/Dataflow tools to split a dataset into train, test, and validation sets. Ideally I'm looking to do a stratified split on a target column, but for starters I'd settle for a simple uniform random split by percent of whole (e.g. 50% train, 30% validation, 20% test).
So far I haven't been able to find anything about whether this is even possible with Dataprep, so I'm wondering if anyone knows definitively if this is possible and, if so, how to accomplish it.
EDIT 1
Thanks #jakub-janoštík for getting me going in the right direction! I modified your answer slightly and came up with the following (in wrangle form):
case condition: customConditions cases: [false,0] default: rand() as: 'split_condition'
case condition: customConditions cases: [split_condition < 0.6,'train'],[split_condition >= 0.8,'test'] default: 'validation' as: 'dataset_type'
drop col: split_condition action: Drop
By assigning random values in a separate step, I got the guaranteed percentage split I was looking for. The flow ended up looking like this:
Image: final flow diagram with dataset splitting
EDIT 2
I just figured out how to do the stratified split too, so I thought I'd add it in case anyone else is trying to do this. Here's the rough steps:
Split your dataset based on whatever subpopulations you're targeting (e.g. target0, target1)
For each subpopulation, do the uniform random split described above (e.g. now you have target0-train, target0-test, target0-validation, target1-train, etc.)
For each set type (i.e. train, test, validation):
Create a new recipe from one of the sets
Edit the recipe, and use the Union transform to merge it with other datasets of the same type (e.g. target0-train union with target1-train). The union button is in the middle of the toolbar on the Edit Recipe page.
I hope that's helpful to someone!
I'm looking at the same problem and I was able to partially solve this using "case on custom condition" and "Random" functions. What I do is that I create new column named target and apply following logic:
After applying this you'll have new column with these 3 new labels and you can generate 3 new datasets by applying row filtering rules based on those values. Thing to keep in mind is that each time you'll run the job you'll get different validation set. So if you want to keep it fixed you need to use the dataset created in first run as input for future runs (and randomise only train and test sets).
If you need more control on the distribution of labels in your datasets there is ROWNUMBER window function that could potentially be used. But I haven't been able to make it work yet.

save binarizer together with sklearn model

I'm trying to build a service that has 2 components. In component 1, I train a machine learning model using sklearn by creating a Pipeline. This model gets serialized using joblib.dump (really numpy_pickle.dump). Component 2 runs in the cloud, loads the model trained by (1), and uses it to label text that it gets as input.
I'm running into an issue where, during training (component 1) I need to first binarize my data since it is text data, which means that the model is trained on binarized input and then makes predictions using the mapping created by the binarizer. I need to get this mapping back when (2) makes predictions based on the model so that I can output the actual text labels.
I tried adding the binarizer to the pipeline like this, thinking that the model would then have the mapping itself:
p = Pipeline([
('binarizer', MultiLabelBinarizer()),
('vect', CountVectorizer(min_df=min_df, ngram_range=ngram_range)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(clf))
])
But I get the following error:
model = p.fit(training_features, training_tags)
*** TypeError: fit_transform() takes 2 positional arguments but 3 were given
My goal is to make sure the binarizer and model are tied together so that the consumer knows how to decode the model's output.
What are some existing paradigms for doing this? Should I be serializing the binarizer together with the model in some other object that I create? Is there some other way of passing the binarizer to Pipeline so that I don't have to do that, and would I be able to get the mappings back from the model if I did that?
Your intuition that you should add the MultiLabelBinarizer to the pipeline was the right way to solve this problem. It would have worked, except that MultiLabelBinarizer.fit_transform does not take the fit_transform(self, X, y=None) method signature which is now standard for sklearn estimators. Instead, it has a unique fit_transform(self, y) signature which I had never noticed before. As a result of this difference, when you call fit on the pipeline, it tries to pass training_tags as a third positional argument to a function with two positional arguments, which doesn't work.
The solution to this problem is tricky. The cleanest way I can think of to work around it is to create your own MultiLabelBinarizer that overrides fit_transform and ignores its third argument. Try something like the following.
class MyMLB(MultiLabelBinarizer):
def fit_transform(self, X, y=None):
return super(MultiLabelBinarizer, self).fit_transform(X)
Try adding this to your pipeline in place of the MultiLabelBinarizer and see what happens. If you're able to fit() the pipeline, the last problem that you'll have is that your new MyMLB class has to be importable on any system that will de-pickle your now trained, pickled pipeline object. The easiest way to do this is to put MyMLB into its own module and place a copy on the remote machine that will be de-pickling and executing the model. That should fix it.
I misunderstood how the MultiLabelBinarizer worked. It is a transformer of outputs, not of inputs. Not only does this explain the alternative fit_transform() method signature for that class, but it also makes it fundamentally incompatible with the idea of inclusion in a single classification pipeline which is limited to transforming inputs and making predictions of outputs. However, all is not lost!
Based on your question, you're already comfortable with serializing your model to disk as [some form of] a .pkl file. You should be able to also serialize a trained MultiLabelBinarizer, and then unpack it and use it to unpack the outputs from your pipeline. I know you're using joblib, but I'll write this up this sample code as if you're using pickle. I believe the idea will still apply.
X = <training_data>
y = <training_labels>
# Perform multi-label classification on class labels.
mlb = MultiLabelBinarizer()
multilabel_y = mlb.fit_transform(y)
p = Pipeline([
('vect', CountVectorizer(min_df=min_df, ngram_range=ngram_range)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(clf))
])
# Use multilabel classes to fit the pipeline.
p.fit(X, multilabel_y)
# Serialize both the pipeline and binarizer to disk.
with open('my_sklearn_objects.pkl', 'wb') as f:
pickle.dump((mlb, p), f)
Then, after shipping the .pkl files to the remote box...
# Hydrate the serialized objects.
with open('my_sklearn_objects.pkl', 'rb') as f:
mlb, p = pickle.load(f)
X = <input data> # Get your input data from somewhere.
# Predict the classes using the pipeline
mlb_predictions = p.predict(X)
# Turn those classes into labels using the binarizer.
classes = mlb.inverse_transform(mlb_predictions)
# Do something with predicted classes.
<...>
Is this the paradigm for doing this? As far as I know, yes. Not only that, but if you desire to keep them together (which is a good idea, I think) you can serialize them as a tuple as I did in the example above so they stay in a single file. No need to serialize a custom object or anything like that.
Model serialization via pickle et al. is the sklearn approved way to save estimators between runs and move them between computers. I've used this process successfully many times before, including in productions systems with success.

Resources