You’re constructing a Keras mannequin. Should you haven’t been doing deep studying for thus lengthy, getting the output activations and price perform proper may contain some memorization (or lookup). You is perhaps attempting to recall the final tips like so:
So with my cats and canines, I’m doing 2-class classification, so I’ve to make use of sigmoid activation within the output layer, proper, after which, it’s binary crossentropy for the associated fee perform…
Or: I’m doing classification on ImageNet, that’s multi-class, in order that was softmax for activation, after which, value ought to be categorical crossentropy…
It’s high-quality to memorize stuff like this, however realizing a bit in regards to the causes behind typically makes issues simpler. So we ask: Why is it that these output activations and price capabilities go collectively? And, do they at all times should?
In a nutshell
Put merely, we select activations that make the community predict what we would like it to foretell.
The fee perform is then decided by the mannequin.
It is because neural networks are usually optimized utilizing most probability, and relying on the distribution we assume for the output models, most probability yields totally different optimization targets. All of those targets then decrease the cross entropy (pragmatically: mismatch) between the true distribution and the expected distribution.
Let’s begin with the best, the linear case.
Regression
For the botanists amongst us, right here’s a brilliant easy community meant to foretell sepal width from sepal size:
Our mannequin’s assumption right here is that sepal width is generally distributed, given sepal size. Most frequently, we’re attempting to foretell the imply of a conditional Gaussian distribution:
[p(y|mathbf{x} = N(y; mathbf{w}^tmathbf{h} + b)]
In that case, the associated fee perform that minimizes cross entropy (equivalently: optimizes most probability) is imply squared error.
And that’s precisely what we’re utilizing as a value perform above.
Alternatively, we’d want to predict the median of that conditional distribution. In that case, we’d change the associated fee perform to make use of imply absolute error:
mannequin %>% compile(
optimizer = "adam",
loss = "mean_absolute_error"
)
Now let’s transfer on past linearity.
Binary classification
We’re enthusiastic hen watchers and wish an software to inform us when there’s a hen in our backyard – not when the neighbors landed their airplane, although. We’ll thus practice a community to differentiate between two lessons: birds and airplanes.
# Utilizing the CIFAR-10 dataset that conveniently comes with Keras.
cifar10 <- dataset_cifar10()
x_train <- cifar10$practice$x / 255
y_train <- cifar10$practice$y
is_bird <- cifar10$practice$y == 2
x_bird <- x_train[is_bird, , ,]
y_bird <- rep(0, 5000)
is_plane <- cifar10$practice$y == 0
x_plane <- x_train[is_plane, , ,]
y_plane <- rep(1, 5000)
x <- abind::abind(x_bird, x_plane, alongside = 1)
y <- c(y_bird, y_plane)
mannequin <- keras_model_sequential() %>%
layer_conv_2d(
filter = 8,
kernel_size = c(3, 3),
padding = "similar",
input_shape = c(32, 32, 3),
activation = "relu"
) %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_conv_2d(
filter = 8,
kernel_size = c(3, 3),
padding = "similar",
activation = "relu"
) %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_flatten() %>%
layer_dense(models = 32, activation = "relu") %>%
layer_dense(models = 1, activation = "sigmoid")
mannequin %>% compile(
optimizer = "adam",
loss = "binary_crossentropy",
metrics = "accuracy"
)
mannequin %>% match(
x = x,
y = y,
epochs = 50
)
Though we usually discuss “binary classification,” the best way the end result is often modeled is as a Bernoulli random variable, conditioned on the enter information. So:
[P(y = 1|mathbf{x}) = p, 0leq pleq1]
A Bernoulli random variable takes on values between (0) and (1). In order that’s what our community ought to produce.
One thought is perhaps to simply clip all values of (mathbf{w}^tmathbf{h} + b) outdoors that interval. But when we do that, the gradient in these areas can be (0): The community can not be taught.
A greater method is to squish the entire incoming interval into the vary (0,1), utilizing the logistic sigmoid perform
[ sigma(x) = frac{1}{1 + e^{(-x)}} ]

As you possibly can see, the sigmoid perform saturates when its enter will get very massive, or very small. Is that this problematic?
It relies upon. In the long run, what we care about is that if the associated fee perform saturates. Have been we to decide on imply squared error right here, as within the regression process above, that’s certainly what may occur.
Nonetheless, if we observe the final precept of most probability/cross entropy, the loss can be
[- log P (y|mathbf{x})]
the place the (log) undoes the (exp) within the sigmoid.
In Keras, the corresponding loss perform is binary_crossentropy. For a single merchandise, the loss can be
- (- log(p)) when the bottom fact is 1
- (- log(1-p)) when the bottom fact is 0
Right here, you possibly can see that when for a person instance, the community predicts the incorrect class and is extremely assured about it, this instance will contributely very strongly to the loss.

What occurs once we distinguish between greater than two lessons?
Multi-class classification
CIFAR-10 has 10 lessons; so now we wish to resolve which of 10 object lessons is current within the picture.
Right here first is the code: Not many variations to the above, however observe the modifications in activation and price perform.
cifar10 <- dataset_cifar10()
x_train <- cifar10$practice$x / 255
y_train <- cifar10$practice$y
mannequin <- keras_model_sequential() %>%
layer_conv_2d(
filter = 8,
kernel_size = c(3, 3),
padding = "similar",
input_shape = c(32, 32, 3),
activation = "relu"
) %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_conv_2d(
filter = 8,
kernel_size = c(3, 3),
padding = "similar",
activation = "relu"
) %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_flatten() %>%
layer_dense(models = 32, activation = "relu") %>%
layer_dense(models = 10, activation = "softmax")
mannequin %>% compile(
optimizer = "adam",
loss = "sparse_categorical_crossentropy",
metrics = "accuracy"
)
mannequin %>% match(
x = x_train,
y = y_train,
epochs = 50
)
So now we now have softmax mixed with categorical crossentropy. Why?
Once more, we would like a sound chance distribution: Possibilities for all disjunct occasions ought to sum to 1.
CIFAR-10 has one object per picture; so occasions are disjunct. Then we now have a single-draw multinomial distribution (popularly often known as “Multinoulli,” principally attributable to Murphy’s Machine studying(Murphy 2012)) that may be modeled by the softmax activation:
[softmax(mathbf{z})_i = frac{e^{z_i}}{sum_j{e^{z_j}}}]
Simply because the sigmoid, the softmax can saturate. On this case, that may occur when variations between outputs turn into very massive.
Additionally like with the sigmoid, a (log) in the associated fee perform undoes the (exp) that’s answerable for saturation:
[log softmax(mathbf{z})_i = z_i – logsum_j{e^{z_j}}]
Right here (z_i) is the category we’re estimating the chance of – we see that its contribution to the loss is linear and thus, can by no means saturate.
In Keras, the loss perform that does this for us is known as categorical_crossentropy. We use sparse_categorical_crossentropy within the code which is identical as categorical_crossentropy however doesn’t want conversion of integer labels to one-hot vectors.
Let’s take a better take a look at what softmax does. Assume these are the uncooked outputs of our 10 output models:

Now that is what the normalized chance distribution seems like after taking the softmax:

Do you see the place the winner takes all within the title comes from? This is a crucial level to bear in mind: Activation capabilities are usually not simply there to provide sure desired distributions; they will additionally change relationships between values.
Conclusion
We began this put up alluding to frequent heuristics, corresponding to “for multi-class classification, we use softmax activation, mixed with categorical crossentropy because the loss perform.” Hopefully, we’ve succeeded in exhibiting why these heuristics make sense.
Nonetheless, realizing that background, you can even infer when these guidelines don’t apply. For instance, say you wish to detect a number of objects in a picture. In that case, the winner-takes-all technique is just not probably the most helpful, as we don’t wish to exaggerate variations between candidates. So right here, we’d use sigmoid on all output models as an alternative, to find out a chance of presence per object.
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Studying. MIT Press.
Murphy, Kevin. 2012. Machine Studying: A Probabilistic Perspective. MIT Press.
