How to use previous values of calculation as the initial conditions in ABAQUS - abaqus

I am trying to implement a subroutine in ABAQUS.
It is a very simple non-linear elastic model, in which the Young's modulus depends on the mean pressure, in details, E=3*(1-2*poisson)*p/kap (where, poisson=0.3 is Poisson's coefficient and kap=0.005 is swelling index). The initial stress is 1e5 Pa for sigma11, 22 and 33.
When I run the subroutine , it gives linear behavior with E=3*(1-2*0.3)*(3*1e5/3)/0.05 (which is the Young's modulus calculated with the initial stress). If the initial stress is 0 for all components, it gives us 0 for all calculation because E=3*(1-2*0.3)*(3*0/3)/0.05=0.
I would like to ask if you could help me to solve this problem (define the initial conditions as the previous values for each variables).

Related

How does DQN work in an environment where reward is always -1

Given that the OpenAI Gym environment MountainCar-v0 ALWAYS returns -1.0 as a reward (even when goal is achieved), I don't understand how DQN with experience-replay converges, yet I know it does, because I have working code that proves it. By working, I mean that when I train the agent, the agent quickly (within 300-500 episodes) learns how to solve the mountaincar problem. Below is an example from my trained agent.
It is my understanding that ultimately there needs to be a "sparse reward" that is found. Yet as far as I can see from the openAI Gym code, there is never any reward other than -1. It feels more like a "no reward" environment.
What almost answers my question, but in fact does not: when the task is completed quickly, the return (sum of rewards) of the episode is larger. So if the car never finds the flag, the return is -1000. If the car finds the flag quickly the return might be -200. The reason this does not answer my question is because with DQN and experience replay, those returns (-1000, -200) are never present in the experience replay memory. All the memory has are tuples of the form (state, action, reward, next_state), and of course remember that tuples are pulled from memory at random, not episode-by-episode.
Another element of this particular OpenAI Gym environment is that the Done state is returned on either of two occasions: hitting the flag (yay) or timing out after some number of steps (boo). However, the agent treats both the same, accepting the reward of -1. Thus as far as the tuples in memory are concerned, both events look identical from a reward standpoint.
So, I don't see anything in the memory that indicates that the episode was performed well.
And thus, I have no idea why this DQN code is working for MountainCar.
The reason this works is because in Q-learning, your model is trying to estimate the SUM (technically the time-decayed sum) of all future rewards for each possible action. In MountainCar you get a reward of -1 every step until you win, so if you do manage to win, you’ll end up getting less negative reward than usual. For example, your total score after winning might be -160 instead of -200, so your model will start predicting higher Q-values for actions that have historically led to winning the game.
You are right. There is no direct association between memory (experience replay) and the performance of the model in episode reward. The Q-value in DQN is used to predict each action's expected reward in each step. The performance measure of how good your model was is the difference between the real reward and the expected reward (TD-error).
The deployment of -1 for non-goal steps is a trick to help RL models to choose the actions that can finish the episode quicker. Because of the Q-value is an action-value. At each step, the model predicts rewards for every possible move and the policy (usually greedy or epsilon-greedy) choose the action with the most significant value. You can imagine that going back at one moment will result in 200 steps to finish the episode but going forward takes only 100 steps. The Q-value will be -200 (without discount) and -100 respectively. You might wonder how the model knows the value of each action, that is because in the repeated episodes and successive trial-and-error. The model was trained to minimise the difference between real reward and expected reward, aka TD-error.
In a randomly sampled experience replay, all experiences are sampled and deleted uniformly. However, in priority experience replay, you can reuse those experience with high estimated error. Usually, the priorities are proportional to the TD error (real_reward - expected_reward) of expected Q-value and the current model's predicted Q-value. The larger the priority means how surprising the experience, and it helps accelerate the training.
You can check the idea in Priority Experience Replay, Schaul et al., 2016
It might help to look at a reduced problem. Consider:
States:
───┬───┬───┬───┬───┐
...│L2 │L1 │ S │ R1│
───┴───┴───┴───┴───┘
Actions:
left or right
r = -1 for all states
episode terminates upon reaching R1 or after 2 steps
Memory:
(S, left, -1, L1) non-terminal
(S, right, -1, R1) terminal
(L1, left, -1, L2) terminal
(L1, right, -1, S) terminal
The obvious but important thing to note is that although the rewards are all identical, the states and actions are not. This information allows us to reason about the next state given the current one.
Let's look at the targets that we are updating towards (with no discount):
Q(S, left ) --> -1 + max{a}Q(L1, a)
Q(S, right) --> -1
Q(L1, left ) --> -1
Q(L1, right) --> -1
In this contrived example, only transition 1 presents an extra source of instability. But over time, as action value on L1 converges from having sampled enough of transitions 3 and 4, so should the target on transition 1.
At that point, when we encounter transition 1 again, we would have a better estimate Q(S, left) --> -1 + -1.
It is not enough to only look at the rewards when we are asking how DQN learns from its memory since it is also using the next observation to determine its current best estimate of the action value at the next step (relevant code), effectively linking everything together and slowly tallying up the rewards. Albeit it does so in a much more unstable manner unlike traditional Q-learning.
As an exercise, consider extending this further and put the terminal state on R2.
Then we can easily see that now max{a}Q(S, a) = -2; it takes the same amount of time to reach R2 and to simply timeout, so it doesn't matter what we do (unless we start closer to R2). However, bumping the number of timeout steps up and it should head towards R2 again. I.e. the mountain car works also because the timeout is set to a number larger than the time step it takes to reach the goal. This path should eventually propagate its (negative but better) values back to our initial state.
Although this is true for any other environment, they can still at least learn to get closer to the goal before timing out provided they are rewarded for the effort. This is not so given the reward design in the mountain car environment.
In DQN you learn the Q-function, which basically approximates your return. In the memory, you store tuples with (s,a,s',r) and re-train your Q-function on those tuples. If, for a given tuple, you performed well (you reached the flag quickly) then you are going to re-experience it by re-using the tuple for training, because the Q-function is higher for that tuple.
Anyway, usually experience replay works better for any problem, not just for the mountain car.
You say:
those returns (-1000, -200) are never present in the experience replay memory.
What is present in the replay memory is a SARdS tuple with a done flag that tells you that the episode is finished. See the openai gym deepq example:
# Store transition in the replay buffer.
replay_buffer.add(obs, action, rew, new_obs, float(done))
In the final update, if float(done) == 1 then the future Q value is ignored. Hence at the end of the episode, the Q value is 0. If that happens on step 200, then the total return in the episode will be -200. If it happens on step 1000, then the total return will be -1000.
To put it a different way - if completely at random, the episode ended on step 200, an agent that had been performing as badly as a random policy would have a Q value (expected future return) of -800. If the episode then ended, the TD error would be +799 representing the positive surprise of only losing another -1.
I note that the code you linked to does not seem to use the done flag in the replay buffer. Instead it relies on a final state of s_ == None to signify the end of the episode. Take that code out, and the agent won't learn.
if s_ is None:
t[a] = r
else:
t[a] = r + GAMMA * numpy.amax(p_[i])

Could you explain this question? i am new to ML, and i faced this problem, but its solution is not clear to me

The problem is in the picture
Question's image:
Question 2
Many substances that can burn (such as gasoline and alcohol) have a chemical structure based on carbon atoms; for this reason they are called hydrocarbons. A chemist wants to understand how the number of carbon atoms in a molecule affects how much energy is released when that molecule combusts (meaning that it is burned). The chemists obtains the dataset below. In the column on the right, kj/mole is the unit measuring the amount of energy released. examples.
You would like to use linear regression (h a(x)=a0+a1 x) to estimate the amount of energy released (y) as a function of the number of carbon atoms (x). Which of the following do you think will be the values you obtain for a0 and a1? You should be able to select the right answer without actually implementing linear regression.
A) a0=−1780.0, a1=−530.9 B) a0=−569.6, a1=−530.9
C) a0=−1780.0, a1=530.9 D) a0=−569.6, a1=530.9
Since all a0s are negative but two a1s are positive lets figure out the latter first.
As you can see by increasing the number of carbon atoms the energy is become more and more negative, so the relation cannot be positively correlated which rules out options c and d.
Then for the intercept the value that produces the least error is the correct one. For the 1 and 10 (easier to calculate) the outputs are about -2300 and -7000 for a, -1100 and -5900 for b, so one would prefer b over a.
PS: You might be thinking there should be obvious values for a0 and a1 from the data, it's not. The intention of the question is to give you a general understanding of the best fit. Also this way of solving is kinda machine learning as well

Is it possible to solve a fractional knapsack including negative values using greedy algorithm?

I have a problem which I think can be converted to a variant of
fractional knapsack problem.
The objective function is in the form of:
$\sum_{i} x_iv_i$
However, my problem differs in that it allows $v_i$ s and $x_i$ to be negative.
I want to prove that this problem can be solved using the greedy algorithm (explained in the link).
I have tested this for many test cases and greedy algorithm seems to solve it, but I want a definite
proof that greedy algorithm is still applicable given the extra constraint.
In the fractional knapsack problem, you find the Value/Weight of every item that you may put in the knapsack, and sort these items from the best V/W ratio to the worst. You then start with the best ratio, and fill the knapsack is either full or you run out. If you run out, you then head to the next item in the list and fill the knapsack with it. This pattern continues until the knapsack is full. It is greedy, because once we sort this list we know that we can confidently add the items fractionally in this order and that we will end with the greatest potential value in the bag.
By allowing the values and "weights" to be negative, as in this problem, however, the algorithm is no longer greedy. It is ruined by the fact that an item could have a negative "weight" and negative value, resulting in a positive V/W ratio. For example, take the following list of items:
V=-1, W=-1 -> V/W = 1.0
V=.9, W=1 -> V/W = 0.9
V=.8, W=1 -> V/W = 0.8
Following the greedy algorithm, we would want to add as much of item 1 as exists, because it has the best V/W ratio. However, adding item 1 really hurts us in the long run, because we are losing more value per weight then we can add later on. For example, let's assume the |W|=10 for each, and the max weight of the knapsack is 10. By adding all of 1, we will have a weight of -10 and a value of -10. Then we add all of 2, which results in a weight of 0 and a value of -1. Then we add all of 3, which results in a weight of 10 and a value of 7.
If instead of this, we just added all of item 2 from the start, we would have a weight of 10 and a value of 9. Therefore by contradiction, if weight and value can be negative, the algorithm is NOT a greedy algorithm.

What's the best objective function for the CartPole task?

I'm doing policy gradient and I'm trying to figure out what the best objective function is for the task. The task is the open ai CartPole-v0 environment in which the agent receives a reward of 1 for each timestep it survives and a reward of 0 upon termination. I'm trying to figure out which is the best way to model the objective function. I've come up with 3 possible functions:
def total_reward_objective_function(self, episode_data) :
return sum([timestep_data['reward'] for timestep_data in timestep_data])
def average_reward_objective_function(self, episode_data):
return total_reward_objective_function(episode_data) / len(episode_data)
def sum_of_discounted_rewards_objective_function(self, episode_data, discount_rate=0.7)
return sum([episode_data[timestep]['reward'] * pow(discount_rate, timestep)
for timestep in enumerate(episode_data)])
Note that for the average reward objective function will always return 1 unless I intervene and modify the reward function to return a negative value upon termination. The reason I'm asking rather than just running a few experiments is because there's errors elsewhere. So if someone could point me towards a good practice in this area I could focus on the more significant mistakes in the algorithm.
You should use the last one (sum of discounted rewards), since the cart-pole problem is an infinite horizon MDP (you want to balance the pole as long as you can). The answer to this question explains why you should use a discount factor in infinite horizon MDPs.
The first one, instead, is just an undiscounted sum of the rewards, which could be used if episodes have a fixed length (for instance, in the case of a robot performing a 10 seconds trajectory). The second one is usually used in finite horizon MDPs, but I am not very familiar with it.
For the cart-pole, a discount factor of 0.9 should work (or, depending on the algorithm used, you can search for scientific papers and see the discount factor used).
A final note. The reward function you described (+1 at each timestep) is not the only one used in literature. A common one (and I think also the "original" one) gives 0 at each timestep and -1 if the pole falls. Other reward functions are related to the angle between the pole and the cart.

Using sklearn DictVectorizer in real-time systems

Any binary one-hot encoding is aware of only values seen in training, so features not encountered during fitting will be silently ignored. For real time, where you have millions of records in a second, and features have very high cardinality, you need to keep your hasher/mapper updated with the data.
How can we do an incremental update to the hasher (rather calculating the entire fit() every time we incounter a new feature-value pair)? What is the suggested approach here the tackle this?
It depends on the learning algorithm that you are using. If you are using a method that has been designated for sparse data sets (FTRL, FFM, linear SVM) one possible approach is the following (note that it will introduce collisions in the features and a lot of constant columns).
First allocate for each element of your sample a (as large as possible) vector V, of length D.
For each categorical variable, evaluate hash(var_name + "_" + var_value) % D. This gives you an integer i, and you can store V[i] = 1.
Therefore, V never grows larger as new features appear. However, as soon as the number of features is large enough, some features will collide (i.e. be written at the same place) and this may result in an increased error rate...
Edit. You can write your own vectorizer to avoid collisions. First call L the current number of features. Prepare the same vector V of length 2L (this 2 will allow you to avoid collisions as new features arrive - at least for some time, depending of the arrival rate of new features).
Starting with an emty dictionary<input_type,int>, associate to each feature an integer. If have already seen the feature, return the int corresponding to the feature. If not, create a new entry with an integer corresponding to the new index. I think (but I am not sure) this is what LabelEncoder does for you.

Resources