Including interaction terms without main effects in rms::cph - interaction

I need to retain an interaction term but excluding main effects in a Cox model with rms::cph.
With survival::coxph i usually do
coxph(Surv(time, death)~a:b) instead of coxph(Surv(time, death)~a*b)
but this is not working with Harrel's rms::cph.
Any suggestion?

rms does not support this and I would not recommend it in general. If you need a special model fitted where you have paid enormous attention to the zero origin of each variable you can create product terms before passing to cph.

Related

How to use inverse dynamics controller on under-actuated systems

I'm trying to use Drake's inverse dyanmics controller on an arm with a floating base, and based on this discussion it seems like the most straightforward way to go about this is to use two separate plants since the controller only supports fully actuated systems.
Following Python bindings error when adding two plants to a scene graph in pyDrake, I attempted to create two plants using the following code:
def register_plant_with_scene_graph(scene_graph, plant):
plant.RegsterAsSourceForSceneGraph(scene_graph)
builder.Connect(
plant.get_geometry_poses_output_port(),
scene_graph.get_source_pose_port(plant.get_source_id()),
)
builder.Connect(
scene_graph.get_query_output_port(),
plant.get_geometry_query_input_port(),
)
builder = DiagramBuilder()
scene_graph = builder.AddSystem(SceneGraph())
plant_1 = builder.AddSystem(MultibodyPlant(time_step=0.0))
register_plant_with_scene_graph(scene_graph, plant_1)
plant_2 = builder.AddSystem(MultibodyPlant(time_step=0.0))
register_plant_with_scene_graph(scene_graph, plant_2)
which produced the error
AttributeError: 'MultibodyPlant_[float]' object has no attribute 'RegsterAsSourceForSceneGraph'
Which seems odd because according to the documentation, the function should exist.
Is this function available in the python bindings for drake? Also, more broadly, is this the correct way to approach using the inverse dynamics controller on a free-floating manipulator?
Inverse dynamics takes desired positions, velocities, and accelerations and computes the required torques. If your robot has a floating base, then you cannot accept arbitrary acceleration commands. For instance the total center of mass of your robot will be falling according to gravity; any acceleration that does not satisfy this requirement will not have a feasible solution to the inverse dynamics. I think there must be something more that we need to understand about your problem formulation.
Often when people ask this question, they are thinking of a robot that is relying on contact forces in addition to generalized force/torques in order to achieve the requested accelerations. In that case, the problem needs to include those contact forces as decision variables, too. Since contact forces have unilateral constraints (e.g. feet cannot pull on the ground), and friction cone constraints, this inverse dynamics problem is almost always formulated as a quadratic program. For instance, as in this paper. We don't currently provide that QP formulation in Drake, but it is not hard to write it against the MathematicalProgram interface. And we do have some older code that was removed from Drake (since it wasn't actively developed) that we can point you to if it helps.

RL: Self-Play with On-Policy and Off-Policy

I try to implement self play with PPO.
Suppose we have a game with 2 agents. We control one player on each side and get information like observation and reward after each step. As far as I know, you can use the information of the right and left player to generate training data and to optimize the model. But that is only possible for off-policy, isn't it?
Because with on-policy e.g. PPO, you expect that the training data to be generated by the current network version and that is usually not the case during self play?
Thanks!
Exactly, this is also the same reason why you can use experience-replay (Replay BUffers) only for off-policy methods like Q-learning. Using sample steps that were not generated by the current policy violates the mathematical assumptions behind the gradients that are being backpropagated.

State dependent action set in reinforcement learning

How do people deal with problems where the legal actions in different states are different? In my case I have about 10 actions total, the legal actions are not overlapping, meaning that in certain states, the same 3 states are always legal, and those states are never legal in other types of states.
I'm also interested in see if the solutions would be different if the legal actions were overlapping.
For Q learning (where my network gives me the values for state/action pairs), I was thinking maybe I could just be careful about which Q value to choose when I'm constructing the target value. (ie instead of choosing the max, I choose the max among legal actions...)
For Policy-Gradient type of methods I'm less sure of what the appropriate setup is. Is it okay to just mask the output layer when computing the loss?
There are two closely related works in recent two years:
[1] Boutilier, Craig, et al. "Planning and learning with stochastic action sets." arXiv preprint arXiv:1805.02363 (2018).
[2] Chandak, Yash, et al. "Reinforcement Learning When All Actions Are Not Always Available." AAAI. 2020.
Currently this problem seems to not have one, universal and straight-forward answer. Maybe because it is not that of an issue?
Your suggestion of choosing the best Q value for legal actions is actually one of the proposed ways to handle this. For policy gradients methods you can achieve similar result by masking the illegal actions and properly scaling up the probabilities of the other actions.
Other approach would be giving negative rewards for choosing an illegal action - or ignoring the choice and not making any change in the environment, returning the same reward as before. For one of my personal experiences (Q Learning method) I've chosen the latter and the agent learned what he has to learn, but he was using the illegal actions as a 'no action' action from time to time. It wasn't really a problem for me, but negative rewards would probably eliminate this behaviour.
As you see, these solutions don't change or differ when the actions are 'overlapping'.
Answering what you've asked in the comments - I don't believe you can train the agent in described conditions without him learning the legal/illegal actions rules. This would need, for example, something like separate networks for each set of legal actions and doesn't sound like the best idea (especially if there are lots of possible legal action sets).
But is the learning of these rules hard?
You have to answer some questions yourself - is the condition, that makes the action illegal, hard to express/articulate? It is, of course, environment-specific, but I would say that it is not that hard to express most of the time and agents just learn them during training. If it is hard, does your environment provide enough information about the state?
Not sure if I understand your question correctly, but if you mean that in certain states some actions are impossible then you simply reflect it in the reward function (big negative value). You can even decide to end the episode if it is not clear what state would the illegal action result in. The agent should then learn that those actions are not desirable in the specific states.
In exploration mode, the agent might still choose to take the illegal actions. However, in exploitation mode it should avoid them.
I recently built a DDQ agent for connect-four and had to address this. Whenever a column was chosen that was already full with tokens, I set the reward equivalent to losing the game. This was -100 in my case and it worked well.
In connect four, allowing an illegal move (effectively skipping a turn) can in some cases be advantageous for the player. This is why I set the reward equivalent to losing and not a smaller negative number.
So if you set the negative reward greater than losing, you'll have to consider in your domain what are the implications of allowing illegal moves to happen in exploration.

ELKI: Normalization undo for result

I am using the ELKI MiniGUI to run LOF. I have found out how to normalize the data before running by -dbc.filter, but I would like to look at the original data records and not the normalized ones in the output.
It seems that there is some flag called -normUndo, which can be set if using the command-line, but I cannot figure out how to use it in the MiniGUI.
This functionality used to exist in ELKI, but has effectively been removed (for now).
only a few normalizations ever supported this, most would fail.
there is no longer a well defined "end" with the visualization. Some users will want to visualize the normalized data, others not.
it requires carrying over normalization information along, which makes data structures more complex (albeit the hierarchical approach we have now would allow this again)
due to numerical imprecision of floating point math, you would frequently not get out the exact same values as you put in
keeping the original data in memory may be too expensive for some use cases, so we would need to add another parameter "keep non-normalized data"; furthermore you would need to choose which (normalized or non-normalized) to use for analysis, and which for visualization. This would not be hard with a full-blown GUI, but you are looking at a command line interface. (This is easy to do with Java, too...)
We would of course appreciate patches that contribute such functionality to ELKI.
The easiest way is this: Add a (non-numerical) label column, and you can identify the original objects, in your original data, by this label.

Does it help to duplicate original data in order to make more data for building model?

I just got an interview question.
"Assume you want to build a statistical or machine learning model, but you have very limited data on hand. Your boss told you can duplicate original data several times, to make more data for building the model" Does it help?
Intuitively, it does not help, because duplicating original data doesn't create more "information" to feed the model.
But is there anyone can explain it more statistically? Thanks
Consider e.g. variance. The data set with the duplicated data will have the exact same variance - you don't have a more precise estimate of the distrbution afterwards.
There are, however, some exceptions. For example bootstrap validation helps when evaluating your model, but you have very little data.
Well, it depends on exactly what one means by "duplicating the data".
If one is exactly duplicating the whole data set a number of times, then methods based on maximum likelihood (as with many models in common use) must find exactly the same result since the log likelihood function of the duplicated data is exactly a multiple of the unduplicated data's log likelihood, and therefore has the same maxima. (This argument doesn't apply to methods which aren't based on the likelihood function; I believe that CART and other tree models, and SVM's, are such models. In that case you'll have to work out a different argument.)
However, if by duplicating, one means duplicating the positive examples in a classification problem (which is common enough, since there are often many more negative examples than positive), then that does make a difference, since the likelihood function is modified.
Also if one means bootstrapping, then that, too, makes a difference.
PS. Probably you'll get more interest in this question on stats.stackexchange.com.

Resources