What are some relevant A.I. techniques for programming a flock of entities? - machine-learning

Specifically, I am talking about programming for this contest: http://www.nodewar.com/about
The contest involves you facing other teams with a swarm of spaceships in a two dimensional world. The ships are constrained by a boundary (if they exit they die), and they must continually avoid moons (who pull the ships in with their gravity). The goal is to kill the opposing queen.
I have attempted to program a few, relatively basic techniques, but I feel as though I am missing something fundamental.
For example, I implemented some of the boids behaviours (http://www.red3d.com/cwr/boids/), but they seemed to lack a... goal, so to speak.
Are there any common techniques (or, preferably, combinations of techniques) for this sort of game?
EDIT
I would just like to open this up again with a bounty since I feel like I'm still missing critical pieces of information. The following is my NodeWar code:
boundary_field = (o, position) ->
distance = (o.game.moon_field - o.lib.vec.len(o.lib.vec.diff(position, o.game.center)))
return distance
moon_field = (o, position) ->
return o.lib.vec.len(o.lib.vec.diff(position, o.moons[0].pos))
ai.step = (o) ->
torque = 0;
thrust = 0;
label = null;
fields = [boundary_field, moon_field]
# Total the potential fields and determine a target.
target = [0, 0]
score = -1000000
step = 1
square = 1
for x in [(-square + o.me.pos[0])..(square + o.me.pos[0])] by step
for y in [(-square + o.me.pos[1])..(square + o.me.pos[1])] by step
position = [x, y]
continue if o.lib.vec.len(position) > o.game.moon_field
value = (fields.map (f) -> f(o, position)).reduce (t, s) -> t + s
target = position if value > score
score = value if value > score
label = target
{ torque, thrust } = o.lib.targeting.simpleTarget(o.me, target)
return { torque, thrust, label }
I may have implemented potential fields incorrectly, however, since all the examples I could find are about discrete movements (whereas NodeWar is continuous and not exact).
The main problem being my A.I. never stays within the game area for more than 10 seconds without flying off-screen or crashing into a moon.

You can easily make the boids algorithm play the game of Nodewar, by just adding additional steering behaviours to your boids and/or modifying the default ones. For example, you would add a steering behaviour to avoid the moon, and a steering behaviour for enemy ships (repulsion or attraction, depending on the position between yours and the enemy ship). The weights of the attraction/repulsion forces should then be tweaked (possibly by genetic algorithms, playing different configurations against each other).
I believe this approach would give you already a relatively strong baseline player, to which you can start adding collaborative strategies etc.

You can achieve the best possible flocking behaviour using control theory. We can start by expressing the state of a ship (a vector of positions and velocities) as variable X. We will use x to represent a small deviation from some state X0. For each node, linearize around the state you want the individual ships to be at:
d/dt(X) = f(X - X0)
where X0 is the state you want the ship to be at. f() can be nonlinear (as in the case of your potential field). Close to this point, the will obey
d/dt(x) = Ax
A is a fixed matrix which you should tweak to achieve stability. Much of the field of control theory tells you how to do this. It's very important to you that A leads to a stable system and that it converges fast. You should now break out MATLAB or GNU Octave. You should also read up on what poles mean in control theory, and Laplace transforms. The poles of the system (eigenvalues of matrix A) will characterize the response of the ship to disturbances of any kind, its stability, and its convergeance speed. It will also tell you whether the ship will move towards X0 or oscillate around X0.
There are several ways of choosing A (You probably don't get full control over A because of the game's limited physics) without having to do much analysis yourself. Two such algorithms are optimal control (You define the cost of controlling the system vs the cost of deviating from X0), and pole placement (You choose where the poles will go).
The state space theory also applies to an entire flock of ships, deviating from the configuration desired by every ship at that point in time. If the system is nonlinear, You might not be able to guarantee that all configurations are stable.
Conclusion:
Use control theory to guarantee individual stability, calculating X0 based on the ships around you, and add a potential field to prevent ships from bumping into eachother and moons.
Disclaimer:
There are several ways of expressing a state space system. This is the simplest but you'll need to express it differently when considering the open loop system (split up A into another 'A' which gives you the physical system and some other matrices which give you the control parameters)

Related

Decision tree split implementation

I am doing this as a part of my university assignment, but I can't find any resources online on how to correctly implement this.
I have read tons materials on metrics that define optimal set split (like Entropy, Gini and others), so I understand how we would choose an optimal value of feature to split learning set into left and right nodes.
However what I totally don't get is the complexity of implementation, considering we also have to choose optimal feature, which means that on each node to compute optimal value it would take O(n^2), which is bad considering real ML datasets are shaped about 10^2 x 10^6, this is really big in terms of computation cost.
Am I missing some kind of approach that could be used here to help reduce complexity?
I currently have this baseline implementation for choosing best feature and value to split on, but I really want to make it better:
for f_idx in range(X_subset.shape[1]):
sorted_values = X_subset.iloc[:, f_idx].sort_values()
for v in sorted_values[self.min_samples_split - 1 : -self.min_samples_split + 1]:
y_left, y_right = self.make_split_only_y(f_idx, v, X_subset, y_subset)
if threshold is not None:
G = calc_g(y_subset, y_left, y_right)
if G < tr_G:
threshold = v
feature_idx = f_idx
tr_G = G
else:
threshold = v
feature_idx = f_idx
tr_G = G
return feature_idx, threshold
So, since no one answered, here some stuff I found out.
Firstly, yes, this task is very computationaly intensive. However, several tricks may be used to reduce amount of splits you need to perform to "grow a tree".
This is especially important, since you don't really want a giant overfitted tree - it just doesn't has any value, what it is more important is to get weak model, which can be used with others in some sort of ensmebling teqnique.
As for the regularization tricks, here are couple of I used myself:
limit the maximum depth of tree
limit the minimal amount of items in node
limit the maximimum amount of leafes in tree
limit the minimum quiality change in split criteria after performing an optimal split
For algorithmic part, there is a way to build a tree a smart way. If you do it as in the code I posted earlier, time complexity will be around O(h * N^2 * D), where h is height of the tree. To work around this, there are several approaches, which I didn't personally code, but read about:
Use dynamic programming for accumulating of statistics per feature, so you don't have to recalculate them every split
Use data binning and bucket sort for O(n) sorting
Source of info: https://ml-handbook.ru/chapters/decision_tree/intro
(use google translate, since website is in russian)

What is the way to understand Proximal Policy Optimization Algorithm in RL?

I know the basics of Reinforcement Learning, but what terms it's necessary to understand to be able read arxiv PPO paper ?
What is the roadmap to learn and use PPO ?
To better understand PPO, it is helpful to look at the main contributions of the paper, which are: (1) the Clipped Surrogate Objective and (2) the use of "multiple epochs of stochastic gradient ascent to perform each policy update".
From the original PPO paper:
We have introduced [PPO], a family of policy optimization methods that use multiple epochs of stochastic gradient ascent to perform each policy update. These methods have the stability and reliability of trust-region [TRPO] methods but are much simpler to implement, requiring only a few lines of code change to a vanilla policy gradient implementation, applicable in more general settings (for example, when using a joint architecture for the policy and value function), and have better overall performance.
1. The Clipped Surrogate Objective
The Clipped Surrogate Objective is a drop-in replacement for the policy gradient objective that is designed to improve training stability by limiting the change you make to your policy at each step.
For vanilla policy gradients (e.g., REINFORCE) --- which you should be familiar with, or familiarize yourself with before you read this --- the objective used to optimize the neural network looks like:
This is the standard formula that you would see in the Sutton book, and other resources, where the A-hat could be the discounted return (as in REINFORCE) or the advantage function (as in GAE) for example. By taking a gradient ascent step on this loss with respect to the network parameters, you will incentivize the actions that led to higher reward.
The vanilla policy gradient method uses the log probability of your action (log π(a | s)) to trace the impact of the actions, but you could imagine using another function to do this. Another such function, introduced in this paper, uses the probability of the action under the current policy (π(a|s)), divided by the probability of the action under your previous policy (π_old(a|s)). This looks a bit similar to importance sampling if you are familiar with that:
This r(θ) will be greater than 1 when the action is more probable for your current policy than it is for your old policy; it will be between 0 and 1 when the action is less probable for your current policy than for your old.
Now to build an objective function with this r(θ), we can simply swap it in for the log π(a|s) term. This is what is done in TRPO:
But what would happen here if your action is much more probable (like 100x more) for your current policy? r(θ) will tend to be really big and lead to taking big gradient steps that might wreck your policy. To deal with this and other issues, TRPO adds several extra bells and whistles (e.g., KL Divergence constraints) to limit the amount the policy can change and help guarantee that it is monotonically improving.
Instead of adding all these extra bells and whistles, what if we could build these stabilizing properties into the objective function? As you might guess, this is what PPO does. It gains the same performance benefits as TRPO and avoids the complications by optimizing this simple (but kind of funny looking) Clipped Surrogate Objective:
The first term (blue) inside the minimization is the same (r(θ)A) term we saw in the TRPO objective. The second term (red) is a version where the (r(θ)) is clipped between (1 - e, 1 + e). (in the paper they state a good value for e is about 0.2, so r can vary between ~(0.8, 1.2)). Then, finally, the minimization of both of these terms is taken (green).
Take your time and look at the equation carefully and make sure you know what all the symbols mean, and mathematically what is happening. Looking at the code may also help; here is the relevant section in both the OpenAI baselines and anyrl-py implementations.
Great.
Next, let's see what effect the L clip function creates. Here is a diagram from the paper that plots the value of the clip objective for when the Advantage is positive and negative:
On the left half of the diagram, where (A > 0), this is where the action had an estimated positive effect on the outcome. On the right half of the diagram, where (A < 0), this is where the action had an estimated negative effect on the outcome.
Notice how on the left half, the r-value gets clipped if it gets too high. This will happen if the action became a lot more probable under the current policy than it was for the old policy. When this happens, we do not want to get greedy and step too far (because this is just a local approximation and sample of our policy, so it will not be accurate if we step too far), and so we clip the objective to prevent it from growing. (This will have the effect in the backward pass of blocking the gradient --- the flat line causing the gradient to be 0).
On the right side of the diagram, where the action had an estimated negative effect on the outcome, we see that the clip activates near 0, where the action under the current policy is unlikely. This clipping region will similarly prevent us from updating too much to make the action much less probable after we already just took a big step to make it less probable.
So we see that both of these clipping regions prevent us from getting too greedy and trying to update too much at once and leaving the region where this sample offers a good estimate.
But why are we letting the r(θ) grow indefinitely on the far right side of the diagram? This seems odd as first, but what would cause r(θ) to grow really large in this case? r(θ) growth in this region will be caused by a gradient step that made our action a lot more probable, and it turning out to make our policy worse. If that was the case, we would want to be able to undo that gradient step. And it just so happens that the L clip function allows this. The function is negative here, so the gradient will tell us to walk the other direction and make the action less probable by an amount proportional to how much we screwed it up. (Note that there is a similar region on the far left side of the diagram, where the action is good and we accidentally made it less probable.)
These "undo" regions explain why we must include the weird minimization term in the objective function. They correspond to the unclipped r(θ)A having a lower value than the clipped version and getting returned by the minimization. This is because they were steps in the wrong direction (e.g., the action was good but we accidentally made it less probable). If we had not included the min in the objective function, these regions would be flat (gradient = 0) and we would be prevented from fixing mistakes.
Here is a diagram summarizing this:
And that is the gist of it. The Clipped Surrogate Objective is just a drop-in replacement you could use in the vanilla policy gradient. The clipping limits the effective change you can make at each step in order to improve stability, and the minimization allows us to fix our mistakes in case we screwed it up. One thing I didn't discuss is what is meant by PPO objective forming a "lower bound" as discussed in the paper. For more on that, I would suggest this part of a lecture the author gave.
2. Multiple epochs for policy updating
Unlike vanilla policy gradient methods, and because of the Clipped Surrogate Objective function, PPO allows you to run multiple epochs of gradient ascent on your samples without causing destructively large policy updates. This allows you to squeeze more out of your data and reduce sample inefficiency.
PPO runs the policy using N parallel actors each collecting data, and then it samples mini-batches of this data to train for K epochs using the Clipped Surrogate Objective function. See full algorithm below (the approximate param values are: K = 3-15, M = 64-4096, T (horizon) = 128-2048):
The parallel actors part was popularized by the A3C paper and has become a fairly standard way for collecting data.
The newish part is that they are able to run K epochs of gradient ascent on the trajectory samples. As they state in the paper, it would be nice to run the vanilla policy gradient optimization for multiple passes over the data so that you could learn more from each sample. However, this generally fails in practice for vanilla methods because they take too big of steps on the local samples and this wrecks the policy. PPO, on the other hand, has the built-in mechanism to prevent too much of an update.
For each iteration, after sampling the environment with π_old (line 3) and when we start running the optimization (line 6), our policy π will be exactly equal to π_old. So at first, none of our updates will be clipped and we are guaranteed to learn something from these examples. However, as we update π using multiple epochs, the objective will start hitting the clipping limits, the gradient will go to 0 for those samples, and the training will gradually stop...until we move on to the next iteration and collect new samples.
....
And that's all for now. If you are interested in gaining a better understanding, I would recommend digging more into the original paper, trying to implement it yourself, or diving into the baselines implementation and playing with the code.
[edit: 2019/01/27]: For a better background and for how PPO relates to other RL algorithms, I would also strongly recommend checking out OpenAI's Spinning Up resources and implementations.
PPO, and including TRPO tries to update the policy conservatively, without affecting its performance adversely between each policy update.
To do this, you need a way to measure how much the policy has changed after each update. This measurement is done by looking at the KL divergence between the updated policy and the old policy.
This becomes a constrained optimization problem, we want change the policy in the direction of maximum performance, following the constraints that the KL divergence between my new policy and old do not exceed some pre defined (or adaptive) threshold.
With TRPO, we compute the KL constraint during update and finds the learning rate for this problem (via Fisher Matrix and conjugate gradient). This is somewhat messy to implement.
With PPO, we simplify the problem by turning the KL divergence from a constraint to a penalty term, similar to for example to L1, L2 weight penalty (to prevent a weights from growing large values). PPO makes additional modifications by removing the need to compute KL divergence all together, by hard clipping the policy ratio (ratio of updated policy with old) to be within a small range around 1.0, where 1.0 means the new policy is same as old.
PPO is a simple algorithm, which falls into policy optimization algorithms class (as opposed to value-based methods such as DQN). If you "know" RL basics (I mean if you have at least read thoughtfully some first chapters of Sutton's book for example), then a first logical step is to get familiar with policy gradient algorithms. You can read this paper or chapter 13 of Sutton's book new edition. Additionally, you may also read this paper on TRPO, which is a previous work from PPO's first author (this paper has numerous notational mistakes; just note). Hope that helps. --Mehdi
I think implementing for a discrete action space such as Cartpole-v1 is easier than for continuous action spaces. But for continuous action spaces, this is the most straight-forward implementation I found in Pytorch as you can clearly see how they get mu and std where as I could not with more renowned implementations such as Openai Baselines and Spinning up or Stable Baselines.
RL-Adventure PPO
These lines from the link above:
class ActorCritic(nn.Module):
def __init__(self, num_inputs, num_outputs, hidden_size, std=0.0):
super(ActorCritic, self).__init__()
self.critic = nn.Sequential(
nn.Linear(num_inputs, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, 1)
)
self.actor = nn.Sequential(
nn.Linear(num_inputs, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, num_outputs),
)
self.log_std = nn.Parameter(torch.ones(1, num_outputs) * std)
self.apply(init_weights)
def forward(self, x):
value = self.critic(x)
mu = self.actor(x)
std = self.log_std.exp().expand_as(mu)
dist = Normal(mu, std)
return dist, value
and the clipping:
def ppo_update(ppo_epochs, mini_batch_size, states, actions, log_probs, returns, advantages, clip_param=0.2):
for _ in range(ppo_epochs):
for state, action, old_log_probs, return_, advantage in ppo_iter(mini_batch_size, states, actions, log_probs, returns, advantages):
dist, value = model(state)
entropy = dist.entropy().mean()
new_log_probs = dist.log_prob(action)
ratio = (new_log_probs - old_log_probs).exp()
surr1 = ratio * advantage
surr2 = torch.clamp(ratio, 1.0 - clip_param, 1.0 + clip_param) * advantage
I found the link above the comments on this video on Youtube:
arxiv insights PPO

Changing Kademlia Metric - Unidirectional Property Importance

Kademlia uses XOR metric. Among other things, this has so called "unidirectional" property (= for any given point x and distance e>0, there is exactly one point y such that d(x,y)=e).
First question is a general question: Is this property of the metric critical for the functionality of Kademlia, or is it just the thing that helps with revealing pressure from certain nodes (as the original paper suggests). In other words, if we want to change the metric, how important is to come with a metric that is "unidirectional" as well?
Second question is about concrete change of the metric: Let's assume we have node identifiers (addresses) as X-bit numbers, would any of the following metric work with Kademlia?
d(x,y) = abs(x-y)
d(x,y) = abs(x-y) + 1/(x xor y)
The first metric simply provides difference between numbers, so for node ID 100 the nodes with IDs 90 and 110 are equally distant, so this is not unidirectional metric. In the second case we fix that adding 1/(x xor y), where we know that (x xor y) is unidirectional, so having 1/(x xor y) should preserve this property.
Thus for node ID 100, the node ID 90 is d(100,90) = 10 + 1/62, while the distance from node ID 110 is d(100,110) = 10 + 1/10.
You wouldn't be dealing with kademlia anymore. There are man other routing algorithms which use different distance metrics, some even non-uniform distance metrics, but they do not rely on kademlia-specific assumptions and sometimes incorporate other features to compensate for some undesirable aspect of those metrics.
Since there can be ties in the metric (two candidates for each point), lookups could no longer converge on a precise set of closest nodes.
Bucket splitting and other routing table maintenance algorithms would need to be changed since they assume that identical distances can only occur with node identity.
I'm not sure whether it would affect Big-O properties or other guarantees of kademlia.
Anyway, this seems like an X-Y problem. You want to modify the metric to serve a particular goal. Maybe you should look for routing overlays designed with that goal in mind instead.
d(x,y) = abs(x-y) + 1/(x xor y)
This seems impractical, division on integers suffers from rounding. and in reality you would not be dealing with such small numbers but much larger (e.g. 160bit) numbers, making divisions more expensive too.

How to find an eigenvector given eigenvalue 1, minimising memory use

I'd be grateful if people could help me find an efficient way (probably low memory algorithm) to tackle the following problem.
I need to find the stationary distribution x of a transition matrix P. The transition matrix is an extremely large, extremely sparse matrix, constructed such that all the columns sum to 1. Since the stationary distribution is given by the equation Px = x, then x is simply the eigenvector of P associated with eigenvalue 1.
I'm currently using GNU Octave to both generate the transition matrix, find the stationary distribution, and plot the results. I'm using the function eigs(), which calculates both eigenvalues and eigenvectors, and it is possible to return just one eigenvector, where the eigenvalue is 1 (I actually had to specify 1.1, to prevent an error). Construction of the transition matrix (using a sparse matrix) is fairly quick, but finding the eigenvector gets increasingly slow as I increase the size, and I'm running out of memory before I can examine even moderately sized problems.
My current code is
[v l] = eigs(P, 1, 1.01);
x = v / sum(v);
Given that I know that 1 is the eigenvalue, I'm wondering if there is either a better method to calculate the eigenvector, or a way that makes more efficient use of memory, given that I don't really need an intermediate large dense matrix. I naively tried
n = size(P,1); % number of states
Q = P - speye(n,n);
x = Q\zeros(n,1); % solve (P-I)x = 0
which fails, since Q is singular (by definition).
I would be very grateful if anyone has any ideas on how I should approach this, as it's a calculation I have to perform a great number of times, and I'd like to try it on larger and more complex models if possible.
As background to this problem, I'm solving for the equilibrium distribution of the number of infectives in a cattle herd in a stochastic SIR model. Unfortunately the transition matrix is very large for even moderately sized herds. For example: in an SIR model with an average of 20 individuals (95% of the time the population is between 12 and 28 individuals), P is 21169 by 21169 with 20340 non-zero values (i.e. 0.0005% dense), and uses up 321 Kb (a full matrix of that size would be 3.3 Gb), while for around 50 individuals P uses 3 Mb. x itself should be pretty small. I suspect that eigs() has a dense matrix somewhere, which is causing me to run out of memory, so I should be okay if I can avoid using full matrices.
Power iteration is a standard way to find the dominant eigenvalue of a matrix. You pick a random vector v, then hit it with P repeatedly until you stop seeing it change very much. You want to periodically divide v by sqrt(v^T v) to normalise it.
The rate of convergence here is proportional to the separation between the largest eigenvalue and the second largest eigenvalue. Each iteration takes just a couple of matrix multiplies.
There are fancier-pants ways to do this ("PageRank" is one good thing to search for here) that improve speed for really huge sparse matrices, but I don't know that they're necessary or useful here.
Your approach seems like a good one. However, what you're calling x, is the null space of Q. null(Q) would work if it supported sparse matrices, but it doesn't. There's a bunch of stuff on the web for finding the null space of a sparse matrix. For example:
http://www.mathworks.co.uk/matlabcentral/newsreader/view_thread/249467
http://www.mathworks.com/matlabcentral/fileexchange/42922-null-space-for-sparse-matrix/content/nulls.m
http://www.mathworks.com/matlabcentral/fileexchange/11120-null-space-of-a-sparse-matrix
It seems the best solution is to use the Power Iteration method, as suggested by tmyklebu.
The method is to iterate x = Px; x /= sum(x), until x converges. I'm assuming convergence if the d1 norm between successive iterations is less than 1e-5, as that seems to give good results.
Convergence can take a while, since the largest two eigenvalues are fairly close (the number of iterations needed to converge can vary considerably, from around 200 to 2000 depending on the model used and population sizes, but it gets there in the end). However, the memory requirements are low, and it's very easy to implement.

Quadcopter PID controller

Context
My task is to design and build a velocity PID controller for a micro quadcopter than flies indoors. The room where the quadcopter is flying is equipped with a camera-based high accuracy indoor tracking system that can provide both velocity and position data for the quadcopter. The resulting system should be able to accept a target velocity for each axis (x, y, z) and drive the quadcopter with that velocity.
The control inputs for the quadcopter are roll/pitch/yaw angles and thrust percentage for altitude.
My idea is to implement a PID controller for each axis where the SP is the desired velocity in that direction, the measured value is the velocity provided by the tracking system and the output value is the roll/pitch/yaw angle and respectively thrust percent.
Unfortunately, because this is my first contact with control theory, I am not sure if I am heading in the right direction.
Questions
I understand the basic PID controller principle, but it is still
unclear to me how it can convert velocity (m/s) into roll/pitch/yaw
(radians) by only summing errors and multiplying with a constant?
Yes, velocity and roll/pitch are directly proportional, so does it
mean that by multiplying with the right constant yields a correct
result?
For the case of the vertical velocity controller, if the velocity is set to 0, the quadcopter should actually just maintain its altitude without ascending or descending. How can this be integrated with the PID controller so that the thrust values is not 0 (should keep hovering, not fall) when the error is actually 0? Should I add a constant term to the output?
Once the system has been implemented, what would be a good approach
on tuning the PID gain parameters? Manual trial and error?
The next step in the development of the system is an additional layer
of position PID controllers that take as a setpoint a desired
position (x,y,z), the measured positions are provided by the indoor
tracking system and the outputs are x/y/z velocities. Is this a good approach?
The reason for the separation of these PID control layers is that the project is part of a larger framework that promotes re-usability. Would it be better to just use a single layer of PID controllers that directly take position coordinates as setpoints and output roll/pitch/yaw/thrust values?
That's the basic idea. You could theoretically wind up with several constants in the system to convert from one unit to another. In your case, your P,I, D constants will implicitly include the right conversion factors. for example, if in an idealized system you would want to control the angle by feedback with P=0.5 in degrees, bu all you have access to is speed, and you know that 5 degrees of tilt gives 1 m/s, then you would program the P controller to multiply the measured speed by 0.1 and the result would be the control input for the angle.
I think the best way to do that is to implement a SISO PID controller not on height (h) but on vertical speed (dh/dt) with thrust percentage as the ouput, but I am not sure about that.
Unfortunately, if you have no advance knowledge about the parameters, that will be the only way to go. I recommend an environment that won't do much damage if the quadcopter crashes...
That sounds like a good way to go
My blog may be of interest to you: http://oskit.se/quadcopter-control-and-how-to-really-implement-it/
The basic principle is easy to understand. Tuning is much harder. It took me the most time to actually figure out what to feed into the pid controller and small things like the + and - signs on variables. But eventually I solved these problems.
The best way to test PID controller is to have some way to set your values dynamically so that you don't have to recompile your firmware every time. I use mavlink in my quadcopter firmware that I work on. With mavlink the copter can easily be configured using any ground control software that supports mavlink protocol.
Start small, build a working copter with using a ready made controller, then write your own. Then test it and tune it and experiment. You will never truly understand PID controllers unless you try building a copter yourself. My bare metal software development library can assist you with making it easier to write your software without having to worry about hardware driver implementation. Link: https://github.com/mkschreder/martink
If you are using python programming language, you can use ddcontrol framework.
You can estimate a transfer function by SISO data. You can use tfest function for this purpose. And then you can optimize a PID Controller by using pidopt function.

Resources