Why Dice Coefficient and not IOU for segmentation tasks? - machine-learning

I have seen people using IOU as the metric for detection tasks and Dice Coeff for segmentation tasks. The two metrics looks very much similar in terms of equation except that dice gives twice the weightage to the intersection part. If I am correct, then
Dice: (2 x (A*B) / (A + B))
IOU : (A * B) / (A + B)
Is there any particular reason for preferring dice for segmentation and IOU for detection?

This is not exactly right.
The Dice coefficient (also known as the Sørensen–Dice coefficient and F1 score) is defined as two times the area of the intersection of A and B, divided by the sum of the areas of A and B:
Dice = 2 |A∩B| / (|A|+|B|) = 2 TP / (2 TP + FP + FN)
(TP=True Positives, FP=False Positives, FN=False Negatives)
The IOU (Intersection Over Union, also known as the Jaccard Index) is defined as the area of the intersection divided by the area of the union:
Jaccard = |A∩B| / |A∪B| = TP / (TP + FP + FN)
Note that the sum of the areas of A and B is not the same as the area of the union of A and B. In particular, if there is 100% overlap, then the one is twice the other. This is the reason of the "two times" in the Dice coefficent: they are both defined such that, with 100% overlap, the values are 1, and with 0% overlap the values are 0.
Which one to use depends on personal preference and customs in each field. That you see one used more in one field is related more to chance than anything else. Someone started using the Dice coefficient for segmentation, and other people just followed along. Someone started using IOU for detection, and other people just followed along.

In segmentation tasks, Dice Coeff (Dice loss = 1-Dice coeff) is used as a Loss function because it is differentiable where as IoU is not differentiable.
Both can be used as metric to evaluate the performance of your model but as a loss function only Dice Coeff/loss is used

Related

How L1 norm select features?

It has been often said that L1 regularization helps in feature selection? How the L1 norm does that?
And also why L2 normalization is not able to do that?
At the beginning please notice L1 and L2 regularization may not always work like that, there are various quirks and it depends on applied strength and other factors.
First of all, we will consider Linear Regression as the simplest case.
Secondly it's easiest to consider only two weights for this problem to get some intuition.
Now, let's introduce a simple constraint: sum of both weights has to be equal to 1.0 (e.g. first weight w1=0.2 and second w2=0.8 or any other combination).
And the last assumptions:
x1 feature has perfect positive correlation with target (e.g. 1.0 * x1 = y, where y is our target)
x2 has almost perfect positive correlation with target (e.g. 0.99 * x2 = y)
(alpha (meaning regularization strength) will be set to 1.0 for both L1 and L2 in order no to clutter the picture further).
L2
Weights values
For two variables (weights) and L2 regularization we would have the following formula:
alpha * (w1^2 + w2^2)/2 (mean of their squares)
Now, we would like to minimize the above equation as it's part of the cost function.
One can easily see both has to be set to 0.5 (remember, their sum has to be equal 1.0!), because 0.5 ^ 2 + 0.5 ^ 2 = 0.5. For any other two values summing to 1 we would get a greater value (e.g. 0.8 ^ 2 + 0.2 ^ 2 = 0.64 + 0.68), hence 0.5 and 0.5 is optimal solution.
Target predictions
In this case we are pretty close for all data points, because:
0.5 * 1.0 + 0.5 + 0.99 = 0.995 (of `y`)
So we are "off" only by 0.005 for each sample. What this means is that regularization on weights has greater effect on cost function than this small difference (that's why w1 wasn't chosen as the only variable and the values were "split").
BTW. Exact values above will differ slightly (e.g. w1 ~0.49 but it's easier to follow along this way I think).
Final insight
With L2 regularization two similar weights tend to be "split" in half as it minimizes the regularization penalty
L1
Weights values
This time it will be even easier: for two variables (weights) and L1 regularization we would have the following formula:
alpha * (|w1| + |w2|)/2 (mean of their absolute values)
This time it doesn't matter what w1 or w2 is set to (as long as their sum has to be equal to 1.0), so |0.5| + |0.5| = |0.2| + |0.8| = |1.0| + |0.0|... (and so on).
In this case L1 regularization will prefer 1.0, the reason below
Target predictions
As the distribution of weights does not matter in this case it's loss value we are after (under the 1.0 sum constraint). For perfect predictions it would be:
1.0 * 1.0 + 0.0 * 0.99 = 1.0
This time we are not "off" at all and it's "best" to choose just w1, no need for w2 in this case.
Final insight
With L1 regularization similar weights tend to be zeroed out in favor of the one connected to feature being better at predicting final target with lowest coefficient.
BTW. If we had x3 which would once again be correlated positively with our values to predict and described by equation
0.1 * x3 = y
Only x3 would be chosen with weight equal to 0.1
Reality
In reality there is almost never "perfect correlation" of variables, there are many features interacting with each other, there are hyperparameters and imperfect optimizers amongst many other factors.
This simplified view should give you an intuition to "why" though.
A common application of your question is in different types of regression. Here is a link that explains the difference between Ridge (L2) and Lasso (L1) regression:
https://stats.stackexchange.com/questions/866/when-should-i-use-lasso-vs-ridge

ML Classification - Decision Boundary Algorithm

Given a classification problem in Machine Learning the hypothesis is described as below.
hθ(x)=g(θ'x)
z = θ'x
g(z) = 1 / (1+e^−z)
In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:
hθ(x)≥0.5→y=1
hθ(x)<0.5→y=0
The way our logistic function g behaves is that when its input is greater than or equal to zero, its output is greater than or equal to 0.5:
g(z)≥0.5
whenz≥0
Remember.
z=0,e0=1⇒g(z)=1/2
z→∞,e−∞→0⇒g(z)=1
z→−∞,e∞→∞⇒g(z)=0
So if our input to g is θTX, then that means:
hθ(x)=g(θTx)≥0.5
whenθTx≥0
From these statements we can now say:
θ'x≥0⇒y=1
θ'x<0⇒y=0
If The decision boundary is the line that separates the area where y = 0 and where y = 1 and is created by our hypothesis function:
What part of this relates to the Decision Boundary? Or where does the Decision Boundary algorithm come from?
This is basic logistic regression with a threshold. So your theta' * x is just the vector notation of your weight vector multiplied by your input. If you put that into the logistic function which outputs a value between 0 and 1 exclusively, you'll threshold that value at 0.5. So if it's equal and above this, you'll treat it as a positive sample and as a negative one otherwise.
The classification algorithm is just that simple. The training is a bit more complicated and the goal of it is the find a weight vector theta which satisfies the condition to correctly classify all your labeled data...or at least as much as possible. The way to do this is to minimize a cost function which measures the difference between the output of your function and the expected label. You can do this using gradient descent. I guess, Andrew Ng is teaching this.
Edit: Your classification algorithm is g(theta'x)>=0.5 and g(theta'x)<0.5, so a basic step function.
Courtesy of other posters on a different tech forum.
Solving for theta'*x >= 0 and theta'*x<0 gives the decision boundary. The RHS of the inequality ( i.e. 0) comes from the sigmoid function.
Theta gives you the hypothesis that best fits the training set.
From theta, you can compute the decision boundary - it is the locus of points where (X * theta) = 0, or equivalently where g(X * theta) = 0.5.

Neural Networks: Why does the perceptron rule only work for linearly separable data?

I previously asked for an explanation of linearly separable data. Still reading Mitchell's Machine Learning book, I have some trouble understanding why exactly the perceptron rule only works for linearly separable data?
Mitchell defines a perceptron as follows:
That is, it is y is 1 or -1 if the sum of the weighted inputs exceeds some threshold.
Now, the problem is to determine a weight vector that causes the perceptron to produce the correct output (1 or -1) for each of the given training examples. One way of achieving this is through the perceptron rule:
One way to learn an acceptable weight vector is to begin with random
weights, then iteratively apply the perceptron to each training
example, modify- ing the perceptron weights whenever it misclassifies
an example. This process is repeated, iterating through the training
examples as many times as needed until the perceptron classifies all
training examples correctly. Weights are modified at each step
according to the perceptron training rule, which revises the weight wi
associated with input xi according to the rule:
So, my question is: Why does this only work with linearly separable data? Thanks.
Because the dot product of w and x is a linear combination of xs, and you, in fact, split your data into 2 classes using a hyperplane a_1 x_1 + … + a_n x_n > 0
Consider a 2D example: X = (x, y) and W = (a, b) then X * W = a*x + b*y. sgn returns 1 if its argument is greater than 0, that is, for class #1 you have a*x + b*y > 0, which is equivalent to y > -a/b x (assuming b != 0). And this equation is linear and divides a 2D plane into 2 parts.

How to update the bias in neural network backpropagation?

Could someone please explain to me how to update the bias throughout backpropagation?
I've read quite a few books, but can't find bias updating!
I understand that bias is an extra input of 1 with a weight attached to it (for each neuron). There must be a formula.
Following the notation of Rojas 1996, chapter 7, backpropagation computes partial derivatives of the error function E (aka cost, aka loss)
∂E/∂w[i,j] = delta[j] * o[i]
where w[i,j] is the weight of the connection between neurons i and j, j being one layer higher in the network than i, and o[i] is the output (activation) of i (in the case of the "input layer", that's just the value of feature i in the training sample under consideration). How to determine delta is given in any textbook and depends on the activation function, so I won't repeat it here.
These values can then be used in weight updates, e.g.
// update rule for vanilla online gradient descent
w[i,j] -= gamma * o[i] * delta[j]
where gamma is the learning rate.
The rule for bias weights is very similar, except that there's no input from a previous layer. Instead, bias is (conceptually) caused by input from a neuron with a fixed activation of 1. So, the update rule for bias weights is
bias[j] -= gamma_bias * 1 * delta[j]
where bias[j] is the weight of the bias on neuron j, the multiplication with 1 can obviously be omitted, and gamma_bias may be set to gamma or to a different value. If I recall correctly, lower values are preferred, though I'm not sure about the theoretical justification of that.
The amount you change each individual weight and bias will be the partial derivative of your cost function in relation to each individual weight and each individual bias.
∂C/∂(index of bias in network)
Since your cost function probably doesn't explicitly depend on individual weights and values (Cost might equal (network output - expected output)^2, for example), you'll need to relate the partial derivatives of each weight and bias to something you know, i.e. the activation values (outputs) of neurons. Here's a great guide to doing this:
https://medium.com/#erikhallstrm/backpropagation-from-the-beginning-77356edf427d
This guide states how to do these things clearly, but can sometimes be lacking on explanation. I found it very helpful to read chapters 1 and 2 of this book as I read the guide linked above:
http://neuralnetworksanddeeplearning.com/chap1.html
(provides essential background for the answer to your question)
http://neuralnetworksanddeeplearning.com/chap2.html
(answers your question)
Basically, biases are updated in the same way that weights are updated: a change is determined based on the gradient of the cost function at a multi-dimensional point.
Think of the problem your network is trying to solve as being a landscape of multi-dimensional hills and valleys (gradients). This landscape is a graphical representation of how your cost changes with changing weights and biases. The goal of a neural network is to reach the lowest point in this landscape, thereby finding the smallest cost and minimizing error. If you imagine your network as a traveler trying to reach the bottom of these gradients (i.e. Gradient Descent), then the amount you will change each weight (and bias) by is related to the the slope of the incline (gradient of the function) that the traveler is currently climbing down. The exact location of the traveler is given by a multi-dimensional coordinate point (weight1, weight2, weight3, ... weight_n), where the bias can be thought of as another kind of weight. Thinking of the weights/biases of a network as the variables for the network's cost function make it clear that ∂C/∂(index of bias in network) must be used.
I understand that the function of bias is to make level adjust of the
input values. Below is what happens inside the neuron. The activation function of course
will make the final output, but it is left out for clarity.
O = W1 I1 + W2 I2 + W3 I3
In real neuron something happens already at synapses, the input data is level adjusted with average of samples and scaled with deviation of samples. Thus the input data is normalized and with equal weights they will make the same effect. The normalized In is calculated from raw data in (n is the index).
Bn = average(in); Sn = 1/stdev((in); In= (in+Bn)Sn
However this is not necessary to be performed separately, because the neuron weights and bias can do the same function. When you subsitute In with the in, you get new formula
O = w1 i1 + w2 i2 + w3 i3+ wbs
The last wbs is the bias and new weights wn as well
wbs = W1 B1 S1 + W2 B2 S2 + W3 B3 S3
wn =W1 (in+Bn) Sn
So there exists a bias and it will/should be adjusted automagically with the backpropagation

What is the role of the bias in neural networks? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I'm aware of the gradient descent and the back-propagation algorithm. What I don't get is: when is using a bias important and how do you use it?
For example, when mapping the AND function, when I use two inputs and one output, it does not give the correct weights. However, when I use three inputs (one of which is a bias), it gives the correct weights.
I think that biases are almost always helpful. In effect, a bias value allows you to shift the activation function to the left or right, which may be critical for successful learning.
It might help to look at a simple example. Consider this 1-input, 1-output network that has no bias:
The output of the network is computed by multiplying the input (x) by the weight (w0) and passing the result through some kind of activation function (e.g. a sigmoid function.)
Here is the function that this network computes, for various values of w0:
Changing the weight w0 essentially changes the "steepness" of the sigmoid. That's useful, but what if you wanted the network to output 0 when x is 2? Just changing the steepness of the sigmoid won't really work -- you want to be able to shift the entire curve to the right.
That's exactly what the bias allows you to do. If we add a bias to that network, like so:
...then the output of the network becomes sig(w0*x + w1*1.0). Here is what the output of the network looks like for various values of w1:
Having a weight of -5 for w1 shifts the curve to the right, which allows us to have a network that outputs 0 when x is 2.
A simpler way to understand what the bias is: it is somehow similar to the constant b of a linear function
y = ax + b
It allows you to move the line up and down to fit the prediction with the data better.
Without b, the line always goes through the origin (0, 0) and you may get a poorer fit.
Here are some further illustrations showing the result of a simple 2-layer feed forward neural network with and without bias units on a two-variable regression problem. Weights are initialized randomly and standard ReLU activation is used. As the answers before me concluded, without the bias the ReLU-network is not able to deviate from zero at (0,0).
Two different kinds of parameters can
be adjusted during the training of an
ANN, the weights and the value in the
activation functions. This is
impractical and it would be easier if
only one of the parameters should be
adjusted. To cope with this problem a
bias neuron is invented. The bias
neuron lies in one layer, is connected
to all the neurons in the next layer,
but none in the previous layer and it
always emits 1. Since the bias neuron
emits 1 the weights, connected to the
bias neuron, are added directly to the
combined sum of the other weights
(equation 2.1), just like the t value
in the activation functions.1
The reason it's impractical is because you're simultaneously adjusting the weight and the value, so any change to the weight can neutralize the change to the value that was useful for a previous data instance... adding a bias neuron without a changing value allows you to control the behavior of the layer.
Furthermore the bias allows you to use a single neural net to represent similar cases. Consider the AND boolean function represented by the following neural network:
(source: aihorizon.com)
w0 corresponds to b.
w1 corresponds to x1.
w2 corresponds to x2.
A single perceptron can be used to
represent many boolean functions.
For example, if we assume boolean values
of 1 (true) and -1 (false), then one
way to use a two-input perceptron to
implement the AND function is to set
the weights w0 = -3, and w1 = w2 = .5.
This perceptron can be made to
represent the OR function instead by
altering the threshold to w0 = -.3. In
fact, AND and OR can be viewed as
special cases of m-of-n functions:
that is, functions where at least m of
the n inputs to the perceptron must be
true. The OR function corresponds to
m = 1 and the AND function to m = n.
Any m-of-n function is easily
represented using a perceptron by
setting all input weights to the same
value (e.g., 0.5) and then setting the
threshold w0 accordingly.
Perceptrons can represent all of the
primitive boolean functions AND, OR,
NAND ( 1 AND), and NOR ( 1 OR). Machine Learning- Tom Mitchell)
The threshold is the bias and w0 is the weight associated with the bias/threshold neuron.
The bias is not an NN term. It's a generic algebra term to consider.
Y = M*X + C (straight line equation)
Now if C(Bias) = 0 then, the line will always pass through the origin, i.e. (0,0), and depends on only one parameter, i.e. M, which is the slope so we have less things to play with.
C, which is the bias takes any number and has the activity to shift the graph, and hence able to represent more complex situations.
In a logistic regression, the expected value of the target is transformed by a link function to restrict its value to the unit interval. In this way, model predictions can be viewed as primary outcome probabilities as shown:
Sigmoid function on Wikipedia
This is the final activation layer in the NN map that turns on and off the neuron. Here also bias has a role to play and it shifts the curve flexibly to help us map the model.
A layer in a neural network without a bias is nothing more than the multiplication of an input vector with a matrix. (The output vector might be passed through a sigmoid function for normalisation and for use in multi-layered ANN afterwards, but that’s not important.)
This means that you’re using a linear function and thus an input of all zeros will always be mapped to an output of all zeros. This might be a reasonable solution for some systems but in general it is too restrictive.
Using a bias, you’re effectively adding another dimension to your input space, which always takes the value one, so you’re avoiding an input vector of all zeros. You don’t lose any generality by this because your trained weight matrix needs not be surjective, so it still can map to all values previously possible.
2D ANN:
For a ANN mapping two dimensions to one dimension, as in reproducing the AND or the OR (or XOR) functions, you can think of a neuronal network as doing the following:
On the 2D plane mark all positions of input vectors. So, for boolean values, you’d want to mark (-1,-1), (1,1), (-1,1), (1,-1). What your ANN now does is drawing a straight line on the 2d plane, separating the positive output from the negative output values.
Without bias, this straight line has to go through zero, whereas with bias, you’re free to put it anywhere.
So, you’ll see that without bias you’re facing a problem with the AND function, since you can’t put both (1,-1) and (-1,1) to the negative side. (They are not allowed to be on the line.) The problem is equal for the OR function. With a bias, however, it’s easy to draw the line.
Note that the XOR function in that situation can’t be solved even with bias.
When you use ANNs, you rarely know about the internals of the systems you want to learn. Some things cannot be learned without a bias. E.g., have a look at the following data: (0, 1), (1, 1), (2, 1), basically a function that maps any x to 1.
If you have a one layered network (or a linear mapping), you cannot find a solution. However, if you have a bias it's trivial!
In an ideal setting, a bias could also map all points to the mean of the target points and let the hidden neurons model the differences from that point.
Modification of neuron WEIGHTS alone only serves to manipulate the shape/curvature of your transfer function, and not its equilibrium/zero crossing point.
The introduction of bias neurons allows you to shift the transfer function curve horizontally (left/right) along the input axis while leaving the shape/curvature unaltered.
This will allow the network to produce arbitrary outputs different from the defaults and hence you can customize/shift the input-to-output mapping to suit your particular needs.
See here for graphical explanation:
http://www.heatonresearch.com/wiki/Bias
In a couple of experiments in my masters thesis (e.g. page 59), I found that the bias might be important for the first layer(s), but especially at the fully connected layers at the end it seems not to play a big role.
This might be highly dependent on the network architecture / dataset.
If you're working with images, you might actually prefer to not use a bias at all. In theory, that way your network will be more independent of data magnitude, as in whether the picture is dark, or bright and vivid. And the net is going to learn to do it's job through studying relativity inside your data. Lots of modern neural networks utilize this.
For other data having biases might be critical. It depends on what type of data you're dealing with. If your information is magnitude-invariant --- if inputting [1,0,0.1] should lead to the same result as if inputting [100,0,10], you might be better off without a bias.
Bias determines how much angle your weight will rotate.
In a two-dimensional chart, weight and bias can help us to find the decision boundary of outputs.
Say we need to build a AND function, the input(p)-output(t) pair should be
{p=[0,0], t=0},{p=[1,0], t=0},{p=[0,1], t=0},{p=[1,1], t=1}
Now we need to find a decision boundary, and the ideal boundary should be:
See? W is perpendicular to our boundary. Thus, we say W decided the direction of boundary.
However, it is hard to find correct W at first time. Mostly, we choose original W value randomly. Thus, the first boundary may be this:
Now the boundary is parallel to the y axis.
We want to rotate the boundary. How?
By changing the W.
So, we use the learning rule function: W'=W+P:
W'=W+P is equivalent to W' = W + bP, while b=1.
Therefore, by changing the value of b(bias), you can decide the angle between W' and W. That is "the learning rule of ANN".
You could also read Neural Network Design by Martin T. Hagan / Howard B. Demuth / Mark H. Beale, chapter 4 "Perceptron Learning Rule"
In simpler terms, biases allow for more and more variations of weights to be learnt/stored... (side-note: sometimes given some threshold). Anyway, more variations mean that biases add richer representation of the input space to the model's learnt/stored weights. (Where better weights can enhance the neural net’s guessing power)
For example, in learning models, the hypothesis/guess is desirably bounded by y=0 or y=1 given some input, in maybe some classification task... i.e some y=0 for some x=(1,1) and some y=1 for some x=(0,1). (The condition on the hypothesis/outcome is the threshold I talked about above. Note that my examples setup inputs X to be each x=a double or 2 valued-vector, instead of Nate's single valued x inputs of some collection X).
If we ignore the bias, many inputs may end up being represented by a lot of the same weights (i.e. the learnt weights mostly occur close to the origin (0,0).
The model would then be limited to poorer quantities of good weights, instead of the many many more good weights it could better learn with bias. (Where poorly learnt weights lead to poorer guesses or a decrease in the neural net’s guessing power)
So, it is optimal that the model learns both close to the origin, but also, in as many places as possible inside the threshold/decision boundary. With the bias we can enable degrees of freedom close to the origin, but not limited to origin's immediate region.
In neural networks:
Each neuron has a bias
You can view bias as a threshold (generally opposite values of threshold)
Weighted sum from input layers + bias decides activation of a neuron
Bias increases the flexibility of the model.
In absence of bias, the neuron may not be activated by considering only the weighted sum from the input layer. If the neuron is not activated, the information from this neuron is not passed through rest of the neural network.
The value of bias is learnable.
Effectively, bias = — threshold. You can think of bias as how easy it is to get the neuron to output a 1 — with a really big bias, it’s very easy for the neuron to output a 1, but if the bias is very negative, then it’s difficult.
In summary: bias helps in controlling the value at which the activation function will trigger.
Follow this video for more details.
Few more useful links:
geeksforgeeks
towardsdatascience
Expanding on zfy's explanation:
The equation for one input, one neuron, one output should look:
y = a * x + b * 1 and out = f(y)
where x is the value from the input node and 1 is the value of the bias node;
y can be directly your output or be passed into a function, often a sigmoid function. Also note that the bias could be any constant, but to make everything simpler we always pick 1 (and probably that's so common that zfy did it without showing & explaining it).
Your network is trying to learn coefficients a and b to adapt to your data.
So you can see why adding the element b * 1 allows it to fit better to more data: now you can change both slope and intercept.
If you have more than one input your equation will look like:
y = a0 * x0 + a1 * x1 + ... + aN * 1
Note that the equation still describes a one neuron, one output network; if you have more neurons you just add one dimension to the coefficient matrix, to multiplex the inputs to all nodes and sum back each node contribution.
That you can write in vectorized format as
A = [a0, a1, .., aN] , X = [x0, x1, ..., 1]
Y = A . XT
i.e. putting coefficients in one array and (inputs + bias) in another you have your desired solution as the dot product of the two vectors (you need to transpose X for the shape to be correct, I wrote XT a 'X transposed')
So in the end you can also see your bias as is just one more input to represent the part of the output that is actually independent of your input.
To think in a simple way, if you have y=w1*x where y is your output and w1 is the weight, imagine a condition where x=0 then y=w1*x equals to 0.
If you want to update your weight you have to compute how much change by delw=target-y where target is your target output. In this case 'delw' will not change since y is computed as 0. So, suppose if you can add some extra value it will help y = w1x + w01, where bias=1 and weight can be adjusted to get a correct bias. Consider the example below.
In terms of line slope, intercept is a specific form of linear equations.
y = mx + b
Check the image
image
Here b is (0,2)
If you want to increase it to (0,3) how will you do it by changing the value of b the bias.
For all the ML books I studied, the W is always defined as the connectivity index between two neurons, which means the higher connectivity between two neurons.
The stronger the signals will be transmitted from the firing neuron to the target neuron or Y = w * X as a result to maintain the biological character of neurons, we need to keep the 1 >=W >= -1, but in the real regression, the W will end up with |W| >=1 which contradicts how the neurons are working.
As a result, I propose W = cos(theta), while 1 >= |cos(theta)|, and Y= a * X = W * X + b while a = b + W = b + cos(theta), b is an integer.
Bias acts as our anchor. It's a way for us to have some kind of baseline where we don't go below that. In terms of a graph, think of like y=mx+b it's like a y-intercept of this function.
output = input times the weight value and added a bias value and then apply an activation function.
The term bias is used to adjust the final output matrix as the y-intercept does. For instance, in the classic equation, y = mx + c, if c = 0, then the line will always pass through 0. Adding the bias term provides more flexibility and better generalisation to our neural network model.
The bias helps to get a better equation.
Imagine the input and output like a function y = ax + b and you need to put the right line between the input(x) and output(y) to minimise the global error between each point and the line, if you keep the equation like this y = ax, you will have one parameter for adaptation only, even if you find the best a minimising the global error it will be kind of far from the wanted value.
You can say the bias makes the equation more flexible to adapt to the best values

Resources