What is Depth Map?

For every pixel, depth map gives the distance from the camera to the object

Predicting Depth Maps

We input an RGB image and want to predict the depth map. We achieve this by using a fully convolutional network to process RGB input image

If we put a cat $L$ away from me, it is the same as putting a cat twice as large as the original cat at $2 L$

The feature of scale invariant loss is that we only care if the ratio between pixels are the same.

Intuitively, if pixel A is twice as far as pixel B in the ground truth, the loss is zero if this 2:1 ratio is maintained in the prediction

Its mathematical expression is as follows:

D (y, y^{⋆}) = \frac{1}{2 n ^{2}} i, j \sum ((lo g y_{i} - lo g y_{j}) - (lo g y_{i}^{⋆} - lo g y_{j}^{⋆}))^{2} = \frac{1}{n} i \sum d_{i}^{2} - \frac{1}{n ^{2}} i, j \sum d_{i} d_{j} = \frac{1}{n} i \sum d_{i}^{2} - \frac{1}{n ^{2}} (i \sum d_{i})^{2}

where $d_{i} = lo g y_{i} - lo g y_{j}$