3D Representation

Depth Map

Depth map give the distance from the camera to the object

Surface Normals

Surface normals define a vector perpendicular to the object’s surface for every pixel in the image

Voxel Grid

Representing 3D objects as small blocks

Implicit Surface

We create a function that takes in the coordinates then output 1, 1/2, and 0.

  • 1 means the coordinate is inside the object
  • 0 means outside
  • 1/2 means on the object

The object surface can be expressed as

This function will be learned in training process

Implicit surface allows for multiscale outputs like Oct-Trees

Point Cloud

Just use points to depict the surface of the object

Triangle Mesh

Use points that are connected to each other to represent surface of objects


Shape Comparison Metrics

Intersection over Union (IoU)

Problems:

  • Struggles to capture thin structures
  • Cannot applied to point clouds since it doesn’t have volume
  • Small IoU difference doesn’t provide meaningful information
  • For triangle meshes, we need to first turn it into voxel grid before we can compute IoU

Chamfer Distance

Problem:

  • Very few badly placed points can dramatically skew the entire metric

F1 Score

Precision & Recall

Precision@ = fraction of predicted points within of some ground-truth point Recall@ = fraction of ground-truth points within of some predicted point

We compute the output by

F1 score is best shape prediction metric in common use


Camera System

Canonical Coordinates

Introduction

Definition: A fixed, standard coordinate system where objects are always oriented in the same way, regardless of how they appear in the input image.

Example: Regardless of how the chairs face in the image, we always predict it facing the same direction

Problem

Neural networks learn from associating input features with output predictions. However, when the spatial alignment is broken, it become harder for the network to learn consistent mappings

View Coordinates

Definition: A coordinate system aligned with the camera’s viewpoint - objects are oriented relative to how the camera sees them.


Datasets

ShapeNet

Cons:

  • Without context, isolated object

Pix3D

Pros:

  • Real images with context
  • Only 1 object per image

Mesh R-CNN

Motivation

Topology tells us we can’t create doughnut shape from ellipsoid. This becomes the restriction for Pixel2Mesh

Mesh R-CNN resolve this problem by changing the way we initialize the input of Pixel2Mesh

Implementation

  1. Predict 3D objects with voxel grid
  2. Sample on the surface of the object to create triangle mesh
  3. Run Pixel2Mesh to get more accurate object