ACM Multimedia Systems Conference Dataset Archive

This page hosts traces from the 2013–2016 ACM Multimedia Systems Conferences (ACM MMSys).

Datasets from MMSys 2012 and MMSys 2011 are available for download here.


Use of the datasets in published work should be acknowledged by a full citation to the authors' papers at the MMSys conference:

Proceedings of ACM MMSys '16, Klagenfurt am Wörthersee, Austria, May 10-13, 2016

GSET Somi: A Game-Specific Eye Tracking Dataset for Somi

In this paper, we present an eye tracking dataset of computer game players who played the side-scrolling cloud game Somi. The game was streamed in the form of video from the cloud to the player. This dataset can be used for designing and testing game-specific visual attention models. The source code of the game is also available to facilitate further modifications and adjustments. For collecting this data, male and female candidates were asked to play the game in front of a remote eye-tracking device. For each player, we recorded gaze points, video frames of the gameplay, and mouse and keyboard commands. For each video frame, a list of its game objects with their locations and sizes was also recorded. This data, synchronized with eye-tracking data, allows one to calculate the amount of attention that each object or group of objects draw from each player. As a benchmark, we also show various attention patterns could be identified among players.

GeoUGV: User-Generated Mobile Video Dataset with Fine Granularity Spatial Metadata

When analyzing and processing videos, it has become increasingly important in many applications to also consider contextual information, in addition to the content. With the ubiquity of sensor-rich smartphones, acquiring a continuous stream of geo-spatial metadata that includes the location and orientation of a camera together with the video frames has become practical. However, no such detailed dataset is publicly available. In this paper we present an extensive geo-tagged video dataset named GeoUGV that has been collected as part of the MediaQ and GeoVid projects. The key features of the dataset are that each video file is accompanied by a metadata sequence of geo-tags consisting of GPS locations, compass directions, and spatial keywords at fine-grained intervals. The GeoUGV dataset has been collected by volunteer users and its statistics can be summarized as follows: 2,397 videos containing 208,976 video frames that are geo-tagged, collected by 289 users in more than 20 cities across the world over a period of 10 years (2007–2016). We hope that this dataset will be useful for researchers, scientists and practitioners alike in their work.

Comprehensive Mobile Bandwidth Traces from Vehicular Networks

Bandwidth fluctuation in mobile networks severely effects the quality of service (QoS) of bandwidth-sensitive applications such as video streaming. Using bandwidth statistics it is possible to predict the network behaviour and take proactive actions to counter network fluctuations, which in turn can improve the QoS. In this paper, we present comprehensive bandwidth datasets from extensive measurement campaigns conducted in Sydney on both 3G and 4G networks under vehicular driving conditions. A particularly distinguishing feature of our dataset is that we have collected data from repeated trips along a few routes. Thus our data can be useful to obtain statistically significant results on network performance in an urban setting. We outline the measurement methodology and present key insights obtained from the collected traces. We have made our dataset available to the wider research community.

  • Paper: Comprehensive Mobile Bandwidth Traces from Vehicular Networks
  • Authors: A. Bokani, M. Hassan, S. Kanhere, J. Yao, G. Zhong
  • Link:
  • Data set: Local copy pending

Right Inflight? A Dataset for Exploring the Automatic Prediction of Movies Suitable for a Watching Situation

The dataset Right Inflight was developed to support the exploration of the match between video content and the situation in which that content is watched. Specifically, we look at videos that are suitable to be watched on an airplane, where the main assumption is that that viewers watch movies with the intent of relaxing themselves and letting time pass quickly, despite the inconvenience and discomfort of flight. The aim of the dataset is to support the development of recommender systems, as well as computer vision and multimedia retrieval algorithms capable of automatically predicting which videos are suitable for inflight consumption. Our ultimate goal is to promote a deeper understanding of how people experience video content, and of how technology can support people in finding or selecting video content that supports them in regulating their internal states in certain situations. Right Inflight consists of 318 human-annotated movies, for which we provide links to trailers, a set of pre-computed low-level visual, audio and text features as well as user ratings. The annotation was performed by crowdsourcing workers, who were asked to judge the appropriateness of movies for inflight consumption.

Div150Multi: A Social Image Retrieval Result Diversification Dataset with Multi-topic Queries

This dataset is designed to support research in the areas of information retrieval that foster new technologies for improving both the relevance and the diversification of search results with explicit focus on the social media context. The dataset consists of Creative Commons data for around 153 one-concept Flickr queries and 45,375 images for development and 139 Flickr queries (69 one-concept - 70 multi-concept) and 41,394 images for testing; metadata, Wikipedia pages and content descriptors for text and visual modalities. Data is annotated for the relevance and the diversity of the photos. An additional dataset used to train the credibility descriptors (an automatic estimation of the quality (correctness) of a particular user's tags) provides information for ca. 685 Flickr users and metadata for more than 3.5M images. Important: much of the Information has been obtained by crawling the Internet and from Flickr. Every possible measure has been taken to ensure that the content has been released under a Creative Commons license that allow redistribution. However, the authors cannot fully guarantee that the collection contains absolutely no content without a Creative Commons license. Such content could potentially enter the collection if it was not correctly marked at the source. In what concerns the content descriptors, features are provided on an as-is basis with no guaranty of being correct. The dataset was validated during the 2015 Retrieving Diverse Social Images Task ( at the MediaEval Benchmarking Initiative for Multimedia Evaluation.

Heimdallr: A Dataset for Sport Analysis

Heimdallr is a dataset that aims to serve two different purposes. The first purpose is action recognition and pose estimation, which requires a dataset of annotated sequences of athlete skeletons. We employed a crowdsourcing platform where people around the world were asked to annotate frames and obtained more than 3000 fully annotated frames for 42 different sequences with a variety of poses and actions. The second purpose is an improved understanding of crowdworkers, and for this purpose, we collected over 10000 written feedbacks from 592 crowdworkers. This is valuable information for crowdsourcing researchers who explore algorithms for worker quality assessment. In addition to the complete dataset, we also provide the code for the application that has been used to collect the data as an open source software.

A new HD and UHD video eye tracking dataset

The emergence of UHD video format induces larger screens and involves a wider stimulated visual angle. Therefore, its effect on visual attention can be questioned since it can impact quality assessment, metrics but also the whole chain of video processing and creation. Moreover, changes in visual attention from different viewing conditions challenge visual attention models. In this paper, we present a new HD and UHD video eye tracking dataset composed of 37 high quality videos observed by more than 35 naive observers. This dataset can be used to compare viewing behavior and visual saliency in HD and UHD, as well as for any study on dynamic visual attention in videos.

SMART: a Light Field image quality dataset

In this article, the design of a Light Field image dataset is presented. The availability of an image dataset is useful for design, testing, and benchmarking Light Field image processing algorithms. As first step, the image content selection criteria have been defined based on selected image quality key-attributes, i.e. spatial information, colorfulness, texture key features, depth of field, etc. Next, image scenes have been selected and captured by using the Lytro Illum Light Field camera. Performed analysis shows that the considered set of images is sufficient for addressing a wide range of attributes relevant to assess Light Field image quality.

USED: A Large Scale Social Event Detection Dataset

Event discovery from single pictures is a challenging problem that has raised significant interest in the last decade. During this time, a number of interesting solutions have been proposed to tackle event discovery in still images. However, a large scale benchmarking image dataset for the evaluation and comparison of event discovery algorithms from single images is still lagging behind. To this aim, in this paper we provide a large-scale properly annotated and balanced dataset of 490,000 images, covering every aspect of 14 different types of social events, selected among the most shared ones in the social network. In the dataset we tried our best to cover every aspect of the considered social events by collecting images for the same event-types with diverse contents in terms of viewpoints, colors, group pictures vs. single portrait and outdoor vs. indoor images, where the high variability of the represented information can be effectively explored to ensure better performances in event classification. Such a large-scale collection of event-related images is intended to become a powerful support tool for the research community in multimedia analysis by providing a common benchmark for training, testing, validation and comparison of existing and novel algorithms.

Datasets for AVC (H.264) and HEVC (H.265) for Evaluating Dynamic Adaptive Streaming over HTTP (DASH)

In this work we present datasets for both trace-based simulation and real-time testbed evaluation of Dynamic Adaptive Streaming over HTTP (DASH). Our trace-based simulation dataset provides a means of evaluation in frameworks such as NS-2 and NS-3, while our testbed evaluation dataset offers a means of analysing the delivery of content over a physical network and associated adaptation mechanisms at the client. Our datasets are available in both H.264 and H.265 with encoding rates comparative to the representations and resolutions of content distribution providers such as Netflix, Hulu and YouTube.

The goal of our dataset is to provide researchers with a sufficiently large dataset, in both number, and duration, of clips which provides a comparison between both encoding schemes. We provide options for evaluating not only different content and genres, but also the underlying encoding metrics, such as transmission cost, segment distribution (the range of the oscillation of the segment sizes) and associated delivery issues such as jitter and re-buffering. Finally, we also offer our datasets in a header-only compressed format, which allows researchers to download the entire dataset and uncompress locally, thus ensuring that our datasets are accessible both online via remote and local servers.


Use of the datasets in published work should be acknowledged by a full citation to the authors' papers at the MMSys conference:

Proceedings of ACM MMSys '15, Portland, Oregon, March 18-20, 2015

Multi-sensor Concert Recording Dataset Including Professional and User-generated Content

We present a novel dataset for multi-view video and spatial audio. An ensemble of ten musicians from the BBC Philharmonic Orchestra performed in the orchestra's rehearsal studio in Salford, UK, on 25th March 2014. This presented a controlled environment in which to capture a dataset that could be used to simulate a large event, whilst allowing control over the conditions and performance. The dataset consists of hundreds of video and audio clips captured during 18 takes of performances, using a broad range of professional- and consumer-grade equipment, up to 4K video and high-end spatial microphones. In addition to the audiovisual essence, sensor metadata has been captured, and ground truth annotations, in particular for temporal synchronization and spatial alignment, have been created. A part of the dataset has also been prepared for adaptive content streaming. The dataset is released under a Creative Commons Attribution Non-Commercial Share Alike license and hosted on a specifically adapted content management platform.

Div150Cred: A Social Image Retrieval Result Diversification with User Tagging Credibility Dataset

In this paper we introduce a new dataset and its evaluation tools, Div150Cred, that was designed to support shared evaluation of diversification techniques in different areas of social media photo retrieval and related areas. The dataset comes with associated relevance and diversity assessments performed by human annotators. The data consists of 300 landmark locations represented via 45,375 Flickr photos, 16M photo links for around 3,000 users, metadata, Wikipedia pages and content descriptors for text and visual modalities. To facilitate distribution, only Creative Commons content was included in the dataset. The proposed dataset was validated during the 2014 Retrieving Diverse Social Images Task at the MediaEval Benchmarking Initiative.

A Scalable Video Coding Dataset and Toolchain for Dynamic Adaptive Streaming over HTTP

With video streaming becoming more and more popular, the number of devices that are capable of streaming videos over the Internet is growing. This leads to a heterogeneous device landscape with varying demands. Dynamic Adaptive Streaming over HTTP (DASH) offers an elegant solution to these demands. Smart adaptation logics are able to adjust the clients' streaming quality according to several (local) parameters. Recent research indicated benefits of blending Scalable Video Coding (SVC) with DASH, especially considering Future Internet architectures. However, except for the DASH Dataset with a single SVC encoded video, no other datasets are publicly available. The contribution of this paper is two-fold. First, a DASH/SVC dataset, containing multiple videos at varying bitrates and spatial resolutions including 1080p, is presented. Second, a toolchain for multiplexing SVC encoded videos is provided, therefore making our results reproducible and allowing researchers to generate their own datasets.

RAISE - A Raw Images Dataset for Digital Image Forensics

Digital forensics is a relatively new research area which aims at authenticating digital media by detecting possible digital forgeries. Indeed, the ever increasing availability of multimedia data on the web, coupled with the great advances reached by computer graphical tools, makes the modification of an image and the creation of visually compelling forgeries an easy task for any user. This in turns creates the need of reliable tools to validate the trustworthiness of the represented information. In such a context, we present here RAISE, a large dataset of 8156 high-resolution raw images, depicting various subjects and scenarios, properly annotated and available together with accompanying metadata. Such a wide collection of untouched and diverse data is intended to become a powerful resource for, but not limited to, forensic researchers by providing a common benchmark for a fair comparison, testing and evaluation of existing and next generation forensic algorithms. In this paper we describe how RAISE has been collected and organized, discuss how digital image forensics and many other multimedia research areas may benefit of this new publicly available benchmark dataset and test a very recent forensic technique for JPEG compression detection.

YouTube Live and Twitch: A Tour of User-Generated Live Streaming Systems

User-Generated live video streaming systems are services that allow anybody to broadcast a video stream over the Internet. These Over-The-Top services have recently gained popularity, in particular with e-sport, and can now be seen as competitors of the traditional cable TV. In this paper, we present a dataset for further works on these systems. This dataset contains data on the two main user-generated live streaming systems: Twitch and the live service of YouTube. We got three months of traces of these services from January to April 2014. Our dataset includes, at every five minutes, the identifier of the online broadcaster, the number of people watching the stream, and various other media information. In this paper, we introduce the dataset and we make a preliminary study to show the size of the dataset and its potentials. We first show that both systems generate a significant traffic with frequent peaks at more than 1 Tbps. Thanks to more than a million unique uploaders, Twitch is in particular able to offer a rich service at anytime. Our second main observation is that the popularity of these channels is more heterogeneous than what have been observed in other services gathering user-generated content.

The Toulouse Vanishing Points Dataset

In this paper we present the Toulouse Vanishing Points Dataset, a public photographs database of Manhattan scenes taken with an iPad Air 1. The purpose of this dataset is the evaluation of vanishing points estimation algorithms. Its originality is the addition of Inertial Measurement Unit (IMU) data synchronized with the camera under the form of rotation matrices. Moreover, contrary to existing works which provide vanishing points of reference in the form of single points, we computed uncertainty regions.

Stanford I2V: A News Video Dataset for Query-by-Image Experiments

Reproducible research in the area of visual search depends on the availability of large annotated datasets. In this paper, we address the problem of querying a video database by images that might share some contents with one or more video clips. We present a new large dataset, called Stanford I2V. We have collected more than 3,800 hours of newscast videos and annotated more than 200 ground-truth queries. In the following, the dataset is described in detail, the collection methodology is outlined and retrieval performance for a benchmark algorithm is presented. These results may serve as a baseline for future research and provide an example of the intended use of the Stanford I2V dataset.

Data Set of Fall Events and Daily Activities from Inertial Sensors

Wearable sensors are becoming popular for remote health monitoring as technology improves and cost reduces. One area in which wearable sensors are increasingly being used is falls monitoring. The elderly, in particular are vulnerable to falls and require continuous monitoring. Indeed, many attempts, with insufficient success have been made towards accurate, robust and generic falls and Activities of Daily Living (ADL) classi cation. A major challenge in developing solutions for fall detection is access to sufficiently large data set. This paper presents a description of the data set and the experimental protocols designed by the authors for the simulation of falls, near-falls and ADL. Forty-two volunteers were recruited to participate in an experiment that involved a set of scripted protocols. Four types of falls (forward, backward, lateral left and right) and several ADL were simulated. This data set is intended for the evaluation of fall detection algorithms by combining daily activities and transitions from one posture to another with falls. In our prior work, machine learning based fall detection algorithms were developed and evaluated. Results showed that our algorithm was able to discriminate between falls and ADL with an F-measure of 94%.

A Multi-Lens Stereoscopic Synthetic Video Dataset

This dataset paper describes a multi-lens stereoscopic synthetically generated video dataset and model. Creating a multi-lens video stream requires that the lens be placed at a spacing less than one inch. While such cameras exist on the market, they are not “professional” enough to allow for necessary things such as zoom-lens control or synchronization between cameras. This dataset provides 20 synthetic models, an associated multi-lens walkthrough, and the uncompressed video from its generation. This dataset can be used for multi-view compression research, view-interpolation, or other computer graphics related research.


Use of the datasets in published work should be acknowledged by a full citation to the authors' papers at the MMSys conference:

Proceedings of ACM MMSys '14, March 19 - March 21, 2014, Singapore, Singapore

Ultra high definition HEVC DASH data set

This is a Ultra High Definition HEVC DASH dataset ranging from HD to UHD in different bit rates. This data set may be used to simulate UHD DASH services, whether on-demand or live, using real-life professional quality content.

LaRED: A Large RGB-D Extensible Hand Gesture Dataset

This is a Large RGB-D Extensible hand gesture data set, recorded with an Intel's newly-developed short range depth camera.

Div400: A Social Image Retrieval Result Diversification Dataset

This data set, Div400, that was designed to support shared evaluation in different areas of social media photo retrieval, e.g., machine analysis (re-ranking, machine learning), human-based computation (crowdsourcing) or hybrid approaches (relevance feedback, machine-crowd integration).

Measuring DASH Streaming Performance from the End Users Perspective using Neubot

This data set provides data, which collected by a DASH module built on top of Neubot, an open source tool for the collection of network measurements.

World-Wide Scale Geotagged Image Dataset for Automatic Image Annotation and Reverse Geotagging

This is a dataset of geotagged photos on a world-wide scale. The dataset contains a sample of more than 14 million geotagged photos crawled from Flickr with the corresponding metadata.

ReSEED: Social Event dEtection Dataset

This set consists of about 430,000 photos from Flickr together with the underlying ground truth consisting of about 21,000 social events. All the photos are accompanied by their textual metadata. The ground truth for the event groupings has been derived from event calendars on the Web that have been created collaboratively by people.

Fashion 10000: An Enriched Social Image Dataset for Fashion and Clothing

The dataset contains more than 32000 images, their context and social metadata, related to the fashion and clothing domain.

The EBU MIM-SCAIE Content Set for Automatic Information Extraction on Broadcast Media

This data set that has been made available by the European Broadcasting Union (EBU). The content in the set consists of broadcast media content collected from different broadcasters around the world. This content set is made available to the research community in order to evaluate automatic information extraction tools on this broadcast media. The set also contains ground truth data and annotations for several automatic information extraction tasks.

Soccer Video and Player Position Dataset

This is a dataset of body-sensor traces and corresponding videos from several professional soccer games captured in late 2013 at the Alfheim Stadium in Tromsø, Norway. Player data, including field position, heading, and speed are sampled at 20Hz using the highly accurate ZXY Sport Tracking system

YawDD: A Yawning Detection Dataset

YawDD provides two video datasets of drivers with various facial characteristics, to be used for designing and testing algorithms and models for yawning detection.


Use of the datasets in published work should be acknowledged by a full citation to the authors' papers at the MMSys conference:

Proceedings of ACM MMSys '13, February 27 - March 1, 2013, Oslo, Norway

The 2012 Social Event Detection Dataset

More than 160 thousand Flickr photos and their accompanying metadata, as well as a list of 149 manually selected and annotated target events, each of which is defined as a set of relevant photos.

A Professionally Annotated and Enriched Multimodal Data Set on Popular Music

A multimodal data set of professionally annotated music, including editorial metadata about songs, albums, and artists, as well as MusicBrainz identifiers to facilitate linking to other data sets.

Commute Path Bandwidth Traces from 3G Networks: Analysis and Applications

Real-world measurements of throughput achieved at the application layer when adaptive HTTP streaming was performed over 3G networks using mobile devices.

Video Surveillance Online Repository (ViSOR)

An open platform for collecting, annotating, and sharing surveillance videos. Most of the included videos are annotated, based on a reference ontology which integrates hundreds of concepts, some of them coming from the LSCOM and MediaMill ontologies.

Fashion-focused Creative Commons Social dataset

A mix of general images as well as images that are focused on fashion (i.e., relevant to particular clothing items or fashion accessories). The dataset contains 4810 images and related metadata.

Blip10000: A social Video Dataset containing SPUG Content for Tagging and Retrieval

A dataset containing comprehensive semi-professional user-generated (SPUG) content, including audiovisual content, user-contributed metadata, automatic speech recognition transcripts, automatic shot boundary files, and social information for multiple 'social levels'.

The Jiku Mobile Video Dataset

A dataset containing videos that could represent characteristics of mobile videos captured in realistic scenarios, consisting of videos simultaneously recorded using mobile devices by multiple users attending performance events.

SopCast P2P Live Streaming Traces

Logs from a very popular P2P live streaming application, the SopCast.

Monitoring Mobile Video Delivery to Android Devices

A dataset of wireless network behavior, geo-coordinates, and packet traces for popular streaming applications on Android certified devices, gathered in a 3G network for both HTTP and peer-to-peer video streaming applications.

Distributed DASH Dataset

D-DASH is a dataset of content for the Dynamic Adaptive Streaming over HTTP (DASH) standard from MPEG.

Consumer video dataset with marked head trajectories

A dataset gathered using a handheld camcorder and a mobile phone that includes ground truth data on person head trajectories and other people marked in the background in MPEG-7-based metadata model.

login · print
Page last modified on September 27, 2016, at 11:22 AM