CVPR 2018 Workshop and Challenge: Automated Analysis of Marine Video for Environmental Monitoring
Data Set Download Links
All data currently released for the challenge is listed below. The training data is meant for the performers to train algorithms on.
There are two tar balls: one containing the imagery and another containing the truth for those images.
The data releases are comprised of images and annotations from five different data sources, with six datasets in total.
- HabCam: abcam_seq0
- MOUSS: mouss_seq0, mouss_seq1
- AFSC DropCam: afsc_seq0
- MBARI: mbari_seq0
- NWFSC: nwfsc_seq0
Each dataset contains different types of imagery — different lighting conditions, camera angles, and wildlife. The data released depends on the nature of the data in the entire dataset. The following gallery contains an example image from each dataset.
The HabCam imagery is collected from a down-looking, RGB camera about 1.5m above the ocean floor. The platform includes synchronized flashes near the cameras, as there is no ambient illumination at the collection depths (>200m). The images tend to show scallops, sand dollars, rocks, sand and the occasional fish. The annotations include scallops and fish. The initial dataset has 10465 images. The images are 2720x1024. Note that habcam imagery was captured in stereo. Each image in this dataset contains both the left and right camera stacked horizontally. Annotations are only provided for the left camera.
The MOUSS data is collected from a stationary, horizontal, grayscale camera 1-2 meters above the ocean floor, using ambient illumination. Typically the camera is on the bottom for 15 minutes in each position. The initial data is from three such collections, consisting of 159 images with species labels on all fish close enough to the camera to be identified by a human expert. Fish that are smaller than this (about 30 pixels in length) will not be disregarded in scoring, i.e. missing them will not count against recall and detecting them will not count as false alarms. The test data without released annotations will include images from the training collections as well as novel collections. The images have a resolution of 968x728.
The AFSC data is collected from an underwater remotely operated vehicle (ROV) equipped with an RGB video camera looking horizontally. The overall dataset consists of a number of videos from different ROV missions. Because the platform moves slowly, some of the images have a fair amount of spatial overlap; such groups of images are called “clusters”. The released data is randomly sampled from some of the clusters, while other clusters are sequestered and will only be released in the test data. There are 571 images in this dataset. The image resolution is 2112x2816.
The MBARI dataset was collected by the Monterey Bay Aquariam Research Institute. It constains a a single video consisting of 666 RGB frames. Each image has a 1920x1080 resolution.
The NWFSC data was also collected from an ROV, but looking downward at the ocean floor. The spatial overlap in this collection is minimal, and the released data is randomly sampled from the overall set. There are 123 images in the initial release and these images have a resolution of 2448x2050. The annotations in this dataset are actually keypoints instead of bounding boxes.
Ground Truth Annotation Format
The annotations are formatted according to the MSCOCO standard. This is a JSON-compliant format which contains the images included, the annotations for those images, and the categories for those annotations.
See Sections 1 and 2 on the official documentation page for more information.
Results Annotation Format
The result annotations should be formatted according to the MSCOCO results standard. In particular, detections should be in either the object detection or keypoint detection formats. These are JSON-compliant format which, for each annotation, you include the image ID, category ID (both of which are defined in the input MSCOCO file), the score of your detector, and either the bounding box or key points of the detection.
The tar balls are structured as detailed below:
The imagery tarball contains 6 folders, each corresponding to a dataset. Each folder contains the images belonging to that dataset either in jpeg or png format. The image names have not been changed and should be considered arbitrary.
The root annotations tarball has 6 symlinks and 6 folders. Each folder corresopnds to a dataset and contains 5 different “flavors” of the dataset (see dataset details for more information on different flavors). The 6 symlinks link to the default “coarse-bbox-only” flavor of the dataset.
- afsc_seq0.mscoco.json -> afsc_seq0/afsc_seq0-coarse-bbox-only.mscoco.json
- habcam_seq0.mscoco.json ->habcam_seq0/habcam_seq0-coarse-bbox-only.mscoco.json
- mouss_seq0.mscoco.json ->mouss_seq0/mouss_seq0-coarse-bbox-only.mscoco.json
- mouss_seq1.mscoco.json ->mouss_seq1/mouss_seq1-coarse-bbox-only.mscoco.json
- mbari_seq0.mscoco.json ->mbari_seq0/mbari_seq0-coarse-bbox-only.mscoco.json
- nwfsc_seq0.mscoco.json ->nwfsc_seq0/nwfsc_seq0-coarse-bbox-only.mscoco.json
The six original datasets consist of several disparate annotation formats. Some objects were annotated using boxes, others using lines, and others using points. Furthermore, the raw class labelings were inconsistent between datasets. For these reasons we have taken steps to preprocess and standardize the data, which itself was a challenge. First, all datasets have been converted into the MSCOCO format.
To both both capture the original nature of the datasets and provide ready-to-use annotations we made the decision to create 4 flavors of the annotations. For each dataset we create a variants with either coarse or fine-trained categories (see notes about category standardization) and variants with or without the more challenging keypoint annotations (see notes about keypoint annotations). We also include the original raw categories.
Thus, for each dataset (afsc_seq0, habcam_seq0, mbari_seq0, mouss_seq0, mouss_seq1, nwfsc_seq0) there are 5 annotation files.
Category standardization and hierarchy
In an effort to standardize the class labels between the different datasets we have relabeled the categories in the original datasets by mapping each category to the appropriate and most specific scientific organism name. This mapping defines the fine-grained categorization.
Because many classes only had a few examples we made the choice to coarsen the categorization and merge related classes together (e.g. all types of rockfish, greenlings, etc. were merged into the Scorpaeniformes category). This reduction significantly increases the number of examples-per-class for most categories.
In both the coarse and fine-grained cases, we provide a category hierarchy, created using the NCBI taxonomic database. This encodes the information that annotations originally labeled as “Rockfish” might reasonably be labeled as a “Sebastes maliger” or “Sebastes ruberrimus” in the fine-grained case. An example in the coarse grained case are the categories: “Fish” and its children “Pleuronectiformes” (flat fish) and “NotPleuronectiformes” (round fish). The category hierarchy is encoded as a tree (actually a forest) in the MSCOCO data using the “supercategory” attribute. Note that each dataset contains annotations from both leaf and non-leaf categories.
Images without annotations
In some cases there may be an image that has no annotations, but contains objects of interest. In an effort to provide information about when this is the case we augment each image object in the MSCOCO json dataset with an attribute “has_annots”. If “has_annots” can be either true, false, or null. If it is true, then the image contains objects of interest even if there are no annotation objects associated with it (e.g. if they were removed keypoint annotations). If “has_annots” is false, then that image was explicitly labeled as having no objects of interest. Otherwise, if “has_annots” is null, then the image might or might not have objects of interest. However, in most circumstances if “has_annots” is null the image contains no objects of interest.
Bounding Box, Line, and Keypoint Annotations
Originally, the datasets contained annotations in the forms of boxes, lines, and points. In our preprocessing step, we have converted of line annotations into boxes by interpreting each line as the diameter of a circle and inscribing a box around that circle. The majority of annotations are provided with bounding box annotations. However, there are a significant number of images where each object is labeled with a keypoint.
For these keypoint annotations, the points are not always in consistent locations on the object. Often the point does not even directly touch the object of interest. The general rule used when creating the keypoint annotations is that each point should be able to be unambiguously associated with a single object. This means that using these keypoint annotations as groundtruth for training an object detector is a tricky challenge.
For these reasons, we provide keypoint annotations will not be count towards final scoring. We provide them for optional use in training a bounding box detector. For convenience, we also provide a flavor of each dataset where the keypoint annotations have been removed. Note, we do not remove the image from the dataset, only the annotations. This can cause an image to appear as if it has no objects in it when in fact it does (see next section for more details).
The following gallery illustrates images with different styles of annotations.
Phase1 Dataset Statistics
The following is a table summarizing statistics for each dataset. The “roi shapes” row indicates the number of annotations of each type (e.g. bbox, keypoints, line) in the original data. The “#negative images” indicates the number of images with no objects of interest. In the case of nwfsc_seq1, images were explicitly labeled as negative, but for habcam_seq1, images without annotations might contain unannotated objects of interest.
|# negative images||0||maybe: 31011||0||0||0||279|
|roi shapes||keypoints: 4587||line: 41752,
|bbox: 5611||bbox: 175||bbox: 1628||keypoints: 307|
For each dataset we summarize the number of annotations for each coarse category.
Finally, the following trees illustrate the coarse grained and fine-grained category heirachy. The suffix of each node is the number of annotations in the phase1 data with that label in all datasets. We then summarize this data on a per-dataset basis.