Project 1: 3D Object Classification via MVCNN

CS 6501 -- 3D Reconstruction and Understanding

Due: Sun, Oct 8 (11:59 PM)

This project will involve fine-tuning a convolutional neural network (CNN) to perform 3D object classification, based upon the multi-view CNN (MVCNN) paper by Su et al. 2015. We recommend solving this project using Python (particularly, Anaconda Python, which has many useful libraries pre-installed) and the machine learning library Keras. These are open source projects.

For the training of CNNs, computers equipped with recent NVidia GPUs can be an order of magnitude or more more efficient than CPUs. Most people do not have such a high-end GPU, thus, we recommend to use a cloud computing service that gives convenient access to such GPUs. One example that we recommend is AWS Educate, which appears to provide $35 worth of free computing credits for UVa students. We suggest to sign up immediately (i.e. do not wait until the last minute or you will miss the assignment deadline), request also the GPU instance since this appears to require approval, and make sure you can load the deep learning instance. No extensions will be given due to inability to plan ahead for this. A GPU-equipped instance such as p2.xlarge (priced as $0.90 per hour as of Sept 2017) can be selected. Note that Google Cloud also provides some free credits ($300 as of Sept 2017) to get started, and has GPU instances, but one has to set up CUDA and the deep learning libraries oneself.

If you prefer to work on a local computer, one strategy to do this is to first get your program working on your local compute on the CPU, and then finally deploy it to the Amazon instance to run more efficiently on the GPU. You can do this easily by installing on your computer Anaconda Python, then installing the Keras library and its dependency TensorFlow (usually by conda install tensorflow followed by conda install keras, or if that does not work, replace conda by pip, and/or sudo pip if you installed Python as root).

A few useful sources of information:


For this project, feel free to collaborate on solving the problem but write your code individually. In particular, do not copy code from other students or from online resources.

Assignment Overview

Your goal for this assignment is to implement an MVCNN 3D object classifier. Based upon a figure from their paper, the full CNN architecture looks like this:

We break this task down into two parts. The first part is implementing a single-view classifier, which always takes view 1 as input and produces as output the class (e.g. airplane, bathtub, bed, etc). The second part is extending the single-view classifier to make a multi-view classifier. Please submit code for both the single view and the multi-view classifiers.

Part I: Single view classifier (50%)

  1. For the dataset, we will be using the same ModelNet-40 dataset as used in the MVCNN paper. Since the MVCNN paper requires models to be rendered to images, to make your life easier, we make available a ModelNet-40 dataset with all models pre-rendered to images. Please download this to your project 1 directory on your cloud compute and/or local compute.

  2. Again, to make your life easier, we have provided a Python ModelNet-40 loader for the above dataset. Please download this to your project directory and rename the extension to .py. The main function in this module is modelnet40_generator. This function returns a generator, which produces training/testing images and their corresponding classes for ModelNet-40. Please see the docstring for modelnet40_generator for extensive documentation on how it works, and the bottom of the module for an example of how to use it.

    If you are not familiar with the concept of a generator, this is important computer science knowledge, so please see the Wikipedia article on generators. You can also see this wiki article on how to implement generators in Python.

  3. For this project, we will fine-tune an existing CNN. We suggest to use ResNet-50 (described in the residual networks paper), which is already included with Keras in keras.applications, with weights that have been pre-trained on ImageNet classification. We suggest to start from this example fine-tuning code, although you may also want to look through this tutorial on fine-tuning if you are not very familiar with the concept of fine tuning.

  4. (25 points) Fine-tune your CNN (e.g. ResNet-50) on the ModelNet-40 dataset. You can do this by following along the fine-tuning example, and (1) instantiating a ResNet50 instance without the top layers, (2) adding a flat layer followed by a dense (fully connected) layer with 40 outputs that uses softmax, for the classification, (3) setting the first p fraction of layers to be not trainable (I used p=0.7, but you can experiment and see what gets the best results), and then (4) calling model.fit_generator() with arguments that are the ModelNet-40 training and testing set generators. Make sure this initial fine-tuning network runs (but you do not need to run it until convergence). Note: you will want to use the categorical_crossentropy loss.

  5. (10 points) The above network is not very efficient because the ModelNet-40 generator produces only a single image at a time, and therefore the model effectively uses only a batch size of 1. Implement a "batching" generator that takes as input an existing ModelNet-40 generator and produces mini-batches of size n. This could be done for example by collecting n elements from the ModelNet-40 generator, and concatenating them along the batch size dimension (dimension 0 in our case). The produced data should be in the same format as the original ModelNet-40 generator, just with a larger batch size. Use the "batching" generators for training and testing to improve the efficiency.

  6. (5 points) In your writeup text file, report (A) the test accuracy of this initial fine tuned network after convergence (you can stop training once the test accuracy levels off).

  7. (10 points) In the MVCNN paper, dataset augmentation is also used to improve generalization (the paper calls this "jittering"). The paper reports two kinds of augmentation were used: (A) rotations by -45 degrees, 0 degree, or 45 degrees (equal probability), and (B) horizontal flips (50% probability). Given that the models are already oriented to be upright, which of these two augmentations (or both) do you think would be beneficial? Report (B) this in your writeup. Implement one of the dataset augmentation methods (each element of the mini-batch should be augmented independently). Report (C) in your writeup the test accuracy after convergence when augmentation is used.

  8. Save your final code for the 1 view classifier as

    For reference, I obtained a test accuracy of about 89% with and without augmentation.

Part II: Multi view classifier (50%)

  1. For the second part, we will implement the full network. Please see the figure above for a reminder of how this works. There are two CNN components: CNN1, which has shared parameters along all views, and CNN2, which is a single network used to output the class scores.

  2. We suggest to create the CNN1 by truncating ResNet-50, so that you can easily modify your existing code from part I to solve part II. Make a copy of your part I code, and call it, which implements the multi-view classifier.

  3. (20 points). To construct this more complex network, we suggest to use the Keras functional API. In particular, you can create a truncated ResNet model that can become CNN1 by code such as the following:
        resnet = keras.applications.resnet50.ResNet50(include_top=False)
        resnet = keras.models.Model(resnet.input, resnet.layers[STOP_LAYER].output)
    Here STOP_LAYER is the layer at which to truncate CNN1. In your writeup, report (D) at which layer you decided to truncate CNN1. You may want to see the ResNet paper, Table 1 for the list of layers, and in your code you can also print resnet.layers for the list of layers. Next, you can construct a list of Input instances to represent the 12 input images for the multiple views, and pass each of these through your shared-parameter CNN1 (resnet(x) in the functional API will apply to tensor x). For details of how to do this, see the sections about multiple inputs in the Keras functional API.

  4. (20 points) Implement the CNN2 part of the network. The same as the MVCNN paper, use the Keras element-wise maximum function to take the maximum over all views, and reduce your 12 tensors to a single tensor. Run this through one or more CNN layers (you can experiment with the best architecture and kernel size). We suggest to use batch normalization after each CNN to prevent vanishing and exploding gradients. Then, the same as in part I, run the CNN result through layers that flatten and apply softmax to give 40 classes. As a first step, train and test on the ModelNet-40 generators directly (no augmentation nor batching yet), and make sure this works. The generators accept an argument single=False to indicate multi-view data is being generated.

  5. (10 points) Implement dataset augmentation and batching, the same as in part I. Report (E) in your writeup the final test accuracy achieved by your network, the architecture choices you made for CNN2, and how this accuracy compares with state of the art methods as reported in the O-CNN paper. For reference, my network achieved 92.5% test accuracy after 20 epochs.

Extra credit: Best performance

  1. We will give +7 points of extra credit to the student who achieves the highest test accuracy after 20 epochs, +5 points to the student who achieves second-highest test accuracy, and +3 points to the student who achieves third-highest test accuracy. We will compute the accuracies ourselves based upon a single run of your multi-view CNN code (code that does not run will be disqualified from the contest). This reward is designed to encourage you to experiment with different architectures and hyperparameter settings to obtain the best performance.


Submit your assignment in a zip file named Please include your source code for both of part 1 and part2, as,, and include the (so your submission can be run as-is). Please also include a text file writeup as writeup.txt briefly describing how to run the two programs (if any arguments are needed), and the bolded points that were to be reported (A-E), as mentioned above.

Finally submit your zip to UVA Collab.