R Loss Function

by admin
[This article was first published on R – Exegetic Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Different loss functions. The aim of this paper is to study the impact of choosing a different loss function from a purely theoretical viewpoint. By introducing a convexity assumption – which is met by all loss functions commonly used in the literature, we show that different loss functions lead to different theoretical behaviors. So, we can model the loss function knowing the tolerance (Δ) and the cost of poor quality of an individual item (L0). In above example: k= 0.001/0.5=0.002 Loss function for the process of making bolts is L(Y) = 0.002(Y −10)^2. We can plot this function in R with following code.

Logarithmic Loss, or simply Log Loss, is a classification loss function often used as an evaluation metric in kaggle competitions. Since success in these competitions hinges on effectively minimising the Log Loss, it makes sense to have some understanding of how this metric is calculated and how it should be interpreted.

Log Loss quantifies the accuracy of a classifier by penalising false classifications. Minimising the Log Loss is basically equivalent to maximising the accuracy of the classifier, but there is a subtle twist which we’ll get to in a moment.

Function

In order to calculate Log Loss the classifier must assign a probability to each class rather than simply yielding the most likely class. Mathematically Log Loss is defined as

where N is the number of samples or instances, M is the number of possible labels, is a binary indicator of whether or not label j is the correct classification for instance i, and is the model probability of assigning label j to instance i. A perfect classifier would have a Log Loss of precisely zero. Less ideal classifiers have progressively larger values of Log Loss. If there are only two classes then the expression above simplifies to

Note that for each instance only the term for the correct class actually contributes to the sum.

Log Loss Function

Let’s consider a simple implementation of a Log Loss function:

R Xgboost Loss Function

Suppose that we are training a binary classifier and consider an instance which is known to belong to the target class. We’ll have a look at the effect of various predictions for class membership probability.

In the first case the classification is neutral: it assigns equal probability to both classes, resulting in a Log Loss of 0.69315. In the second case the classifier is relatively confident in the first class. Since this is the correct classification the Log Loss is reduced to 0.10536. The third case is an equally confident classification, but this time for the wrong class. The resulting Log Loss escalates to 2.3026. Relative to the neutral classification, being confident in the wrong class resulted in a far greater change in Log Loss. Obviously the amount by which Log Loss can decrease is constrained, while increases are unbounded.

Looking Closer

Let’s take a closer look at this relationship. The plot below shows the Log Loss contribution from a single positive instance where the predicted probability ranges from 0 (the completely wrong prediction) to 1 (the correct prediction). It’s apparent from the gentle downward slope towards the right that the Log Loss gradually declines as the predicted probability improves. Moving in the opposite direction though, the Log Loss ramps up very rapidly as the predicted probability approaches 0. That’s the twist I mentioned earlier.

Log Loss heavily penalises classifiers that are confident about an incorrect classification. For example, if for a particular observation, the classifier assigns a very small probability to the correct class then the corresponding contribution to the Log Loss will be very large indeed. Naturally this is going to have a significant impact on the overall Log Loss for the classifier. The bottom line is that it’s better to be somewhat wrong than emphatically wrong. Of course it’s always better to be completely right, but that is seldom achievable in practice! There are at least two approaches to dealing with poor classifications:

  1. Examine the problematic observations relative to the full data set. Are they simply outliers? In this case, remove them from the data and re-train the classifier.
  2. Consider smoothing the predicted probabilities using, for example, Laplace Smoothing. This will result in a less “certain” classifier and might improve the overall Log Loss.

Code Support for Log Loss

Using Log Loss in your models is relatively simple. XGBoost has logloss and mlogloss options for the eval_metric parameter, which allow you to optimise your model with respect to binary and multiclass Log Loss respectively. Both metrics are available in caret‘s train() function as well. The Metrics package also implements a number of Machine Learning metrics including Log Loss.

The post Making Sense of Logarithmic Loss appeared first on Exegetic Analytics.

To leave a comment for the author, please follow the link and comment on their blog: R – Exegetic Analytics.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The purpose of loss functions is to compute the quantity that a model should seekto minimize during training.

Available losses

Note that all losses are available both via a class handle and via a function handle.The class handles enable you to pass configuration arguments to the constructor(e.g.loss_fn = CategoricalCrossentropy(from_logits=True)),and they perform reduction by default when used in a standalone way (see details below).

Probabilistic losses

Regression losses

Hinge losses for 'maximum-margin' classification

Function

Usage of losses with compile() & fit()

A loss function is one of the two arguments required for compiling a Keras model:

All built-in loss functions may also be passed via their string identifier:

Loss functions are typically created by instantiating a loss class (e.g. keras.losses.SparseCategoricalCrossentropy).All losses are also provided as function handles (e.g. keras.losses.sparse_categorical_crossentropy).

Using classes enables you to pass configuration arguments at instantiation time, e.g.:

Standalone usage of losses

A loss is a callable with arguments loss_fn(y_true, y_pred, sample_weight=None):

  • y_true: Ground truth values, of shape (batch_size, d0, ... dN). For sparse loss functions, such as sparse categorical crossentropy, the shape should be (batch_size, d0, ... dN-1)
  • y_pred: The predicted values, of shape (batch_size, d0, .. dN).
  • sample_weight: Optional sample_weight acts as reduction weighting coefficient for the per-sample losses. If a scalar is provided, then the loss is simply scaled by the given value. If sample_weight is a tensor of size [batch_size], then the total loss for each sample of the batch is rescaled by the corresponding element in the sample_weight vector. If the shape of sample_weight is (batch_size, d0, ... dN-1) (or can be broadcasted to this shape), then each loss element of y_pred is scaled by the corresponding value of sample_weight. (Note ondN-1: all loss functions reduce by 1 dimension, usually axis=-1.)

By default, loss functions return one scalar loss value per input sample, e.g.

However, loss class instances feature a reduction constructor argument,which defaults to 'sum_over_batch_size' (i.e. average). Allowable values are'sum_over_batch_size', 'sum', and 'none':

  • 'sum_over_batch_size' means the loss instance will return the average of the per-sample losses in the batch.
  • 'sum' means the loss instance will return the sum of the per-sample losses in the batch.
  • 'none' means the loss instance will return the full array of per-sample losses.

Note that this is an important difference between loss functions like tf.keras.losses.mean_squared_errorand default loss class instances like tf.keras.losses.MeanSquaredError: the function versiondoes not perform reduction, but by default the class instance does.

When using fit(), this difference is irrelevant since reduction is handled by the framework.

Here's how you would use a loss class instance as part of a simple training loop:

Creating custom losses

R Log Loss Function

Any callable with the signature loss_fn(y_true, y_pred)that returns an array of losses (one of sample in the input batch) can be passed to compile() as a loss.Note that sample weighting is automatically supported for any such loss.

Here's a simple example:

The add_loss() API

Loss functions applied to the output of a model aren't the only way tocreate losses.

When writing the call method of a custom layer or a subclassed model,you may want to compute scalar quantities that you want to minimize duringtraining (e.g. regularization losses). You can use the add_loss() layer methodto keep track of such loss terms.

Here's an example of a layer that adds a sparsity regularization loss based on the L2 norm of the inputs:

R Loss Function Formula

Loss values added via add_loss can be retrieved in the .losses list property of any Layer or Model(they are recursively retrieved from every underlying layer):

These losses are cleared by the top-level layer at the start of each forward pass -- they don't accumulate.So layer.losses always contain only the losses created during the last forward pass.You would typically use these losses by summing them before computing your gradients when writing a training loop.

R Loss Function Excel

When using model.fit(), such loss terms are handled automatically.

R Lm Loss Function

When writing a custom training loop, you should retrieve these termsby hand from model.losses, like this:

R Squared Loss Function

See the add_loss() documentation for more details.