This poster describes a reinterpretation of Samuel Beckett’s theatrical text Play for virtual reality (VR). It is an aesthetic reflection on practice that follows up an a technical project description submitted to ISMAR 2017 [O’Dwyer et al. 2017]. Actors are captured in a green screen environment using free-viewpoint video (FVV) techniques, and the scene is built in a game engine, complete with binaural spatial audio and six degrees of freedom of movement. The project explores how ludic qualities in the original text help elicit the conversational and interactive specificities of the digital medium. The work affirms the potential for interactive narrative in VR, opens new experiences of the text, and highlights the reorganisation of the author-audience dynamic.
We present the first generative adversarial network (GAN) for natural image matting. Our novel generator network is trained to predict visually appealing alphas with the addition of the adversarial loss from the discriminator that is trained to classify well-composited images. Further, we improve existing encoder-decoder architectures to better deal with the spatial localization issues inherited in convolutional neural networks (CNN) by using dilated convolutions to capture global context information without downscaling feature maps and losing spatial information. We present state-of-the-art results on the alphamatting online benchmark for the gradient error and give comparable results in others. Our method is particularly well suited for fine structures like hair, which is of great importance in practical matting applications, e.g. in film/TV production.
We present a scalable pipeline for Free-Viewpoint Video (FVV) content creation, considering also visualisation in Augmented Reality (AR) and Virtual Reality (VR). We support a range of scenarios where there may be a limited number of handheld consumer cameras, but also demonstrate how our method can be applied in professional multi-camera setups. Our novel pipeline extends many state-of-the-art techniques (such as structure-from-motion, shape-from-silhouette and multi-view stereo) and incorporates bio-mechanical constraints through 3D skeletal information as well as efficient camera pose estimation algorithms. We introduce multi-source shape-from-silhouette (MS-SfS) combined with fusion of different geometry data as crucial components for accurate reconstruction in sparse camera settings. Our approach is highly flexible and our results indicate suitability either for affordable content creation for VR/AR or for interactive FVV visualisation where a user can choose an arbitrary viewpoint or sweep between known views using view synthesis.
Since the early years of the twenty-first century, the performing arts have been party to an increasing number of digital media projects bring renewed attention to questions about, on one hand, new working processes involving capture and distribution techniques, and on the other hand, how particular works – with bespoke hard and software – can exert an efficacy over how work is created by the artist/producer or received by the audience. The evolution of author/audience criteria demand that digital arts practice modify aesthetic and storytelling strategies, to types that are more appropriate to communicating ideas over interactive digital networks, wherein AR/VR technologies are rapidly becoming the dominant interface. This project explores these redefined criteria through a reimagining of Samuel Becketts Play (1963) for digital culture. This paper offers an account of the working processes, the aesthetic and technical considerations that guide artistic decisions and how we attempt to place the overall work in the state of the art.
Video surveillance always had a negative connotation, among others because of the loss of privacy and because it may not automatically increase public safety. If it was able to detect atypical (i.e. dangerous) situations in real time, autonomously and anonymously, this could change. A prerequisite for this is an automatic detection of possibly dangerous situations from video data. From the derived trajectories we then want to determine dangerous situations by detecting atypical trajectories. However, it is better to develop such a system without people being threatened or even harmed, plus with having them know that there is such a tracking system installed. In the artistic project leave a trace the tracked people become actor and thus part of the installation. Visualization in real-time allows interaction by these actors, which in turn creates many atypical interaction situations on which we could develop our situation detection.
Video surveillance always had a negative connotation, among others because of the loss of privacy and because it may not automatically increase public safety. If it was able to detect atypical (i.e. dangerous) situations in real time, autonomously and anonymously, this could change. A prerequisite for this is a reliable automatic detection of possibly dangerous situations from video data. This is done classically by object extraction and tracking. From the derived trajectories, we then want to determine dangerous situations by detecting atypical trajectories. However, due to ethical considerations it is better to develop such a system on data without people being threatened or even harmed, plus with having them know that there is such a tracking system installed. Another important point is that these situations do not occur very often in real, public CCTV areas and may be captured properly even less. In the artistic project leave a trace the tracked objects, people in an atrium of a institutional building, become actor and thus part of the installation. Visualisation in real-time allows interaction by these actors, which in turn creates many atypical interaction situations on which we can develop our situation detection. The data set has evolved over three years and hence, is huge. In this article we describe the tracking system and several approaches for the detection of atypical trajectories.
The “digital Michelangelo project” was a seminal computer vision project in the early 2000’s that pushed the capabilities of acquisition systems and involved multiple people from diverse fields, many of whom are now leaders in industry and academia. Reviewing this project with modern eyes provides us with the opportunity to reflect on several issues, relevant now as then to the field of computer vision and research in general, that go beyond the technical aspects of the work.
This article was written in the context of a reading group competition at the week-long International Computer Vision Summer School 2017 (ICVSS) on Sicily, Italy. To deepen the participants understanding of computer vision and to foster a sense of community, various reading groups were tasked to highlight important lessons which may be learned from provided literature, going beyond the contents of the paper. This report is the winning entry of this guided discourse (Fig. 1). The authors closely examined the origins, fruits and most importantly lessons about research in general which may be distilled from the “digital Michelangelo project”. Discussions leading to this report were held within the group as well as with Hao Li, the group mentor.
This paper addresses the problem of detecting and segmenting human instances in a point cloud. Both fields have been well studied during the last decades showing impressive results, not only in accuracy but also in computational performance. With the rapid use of depth sensors, a resurgent need for improving existing state-of-the-art algorithms, integrating depth information as an additional constraint became more ostensible. Current challenges involve combining RGB and depth information for reasoning about location and spatial extend of the object of interest. We make use of an improved deformable part model algorithm, allowing to deform the individual parts across multiple scales, approximating the location of the person in the scene and a conditional random field energy function for specifying the object’s spatial extent. Our proposed energy function models up to pairwise relations defined in the RGBD domain, enforcing label consistency for regions sharing similar unary and pairwise measurements. Experimental results show that our proposed energy function provides a fairly precise segmentation even when the resulting detection box is imprecise. Reasoning about the detection algorithm could potentially enhance the quality of the detection box allowing capturing the object of interest as a whole.
Deformable Part Models, RGBD Data, Conditional Random Fields, Graph Cuts, Human Recognition
Human monitoring and tracking has been a prominent research area for many scientists around the globe. Several algorithms have been introduced and improved over the years, eliminating false positives and enhancing monitoring quality. While the majority of approaches are restricted to the 2D and 2.5D domain, 3D still remains an unexplored field. Microsoft Kinect is a low cost commodity sensor extensively used by the industry and research community for several indoor applications. Within this framework, an accurate and fast-to-implement pipeline is introduced working in two main directions: pure 3D foreground extraction of moving people in the scene and interpretation of the human movement using an ellipsoid as a mathematical reference model. The proposed work is part of an industrial transportation research project whose aim is to monitor the behavior of people and make a distinction between normal and abnormal behaviors in public train wagons. Ground truth was generated by the OpenNI human skeleton tracker and used for evaluating the performance of the proposed method.
Calibration, Bundle Adjustment, 3D Object Extraction, Ellipsoid, 3D Human, Tracking
Human Tracking in Computer Vision is a very active up-going research area. Previous works analyze this topic by applying algorithms and features extraction in 2D, while 3D tracking is quite an unexplored filed, especially concerning multi–camera systems. Our approach discussed in this paper is focused on the detection and tracking of human postures using multiple RGB–D data together with stereo cameras. We use low–cost devices, such as Microsoft Kinect and a people counter, based on a stereo system. The novelty of our technique concerns the synchronization of multiple devices and the determination of their exterior and relative orientation in space, based on a common world coordinate system. Furthermore, this is used for applying Bundle Adjustment to obtain a unique 3D scene, which is then used as a starting point for the detection and tracking of humans and extract significant metrics from the datasets acquired. In this article, the approaches are described for the determination of the exterior and absolute orientation. Subsequently, it is shown how a common point cloud is formed. Finally, some results for object detection and tracking, based on 3D point clouds, are presented.
Multi Camera System, Bundle Adjustment, 3D Similarity Transformation, 3D Fused Human Cloud
Human detection and tracking has been a prominent research area for several scientists around the globe. State of the art algorithms have been implemented, refined and accelerated to significantly improve the detection rate and eliminate false positives. While 2D approaches are well investigated, 3D human detection and tracking is still an unexplored research field. In both 2D/3D cases, introducing a multi camera system could vastly expand the accuracy and confidence of the tracking process. Within this work, a quality evaluation is performed on a multi RGB-D camera indoor tracking system for examining how camera calibration and pose can affect the quality of human tracks in the scene, independently from the detection and tracking approach used.
After performing a calibration step on every Kinect sensor, state of the art single camera pose estimators were evaluated for checking how good the quality of the poses is estimated using planar objects such as an ordinate chessboard. With this information, a bundle block adjustment and ICP were performed for verifying the accuracy of the single pose estimators in a multi camera configuration system. Results have shown that single camera estimators provide high accuracy results of less than half a pixel forcing the bundle to converge after very few iterations. In relation to ICP, relative information between cloud pairs is more or less preserved giving a low score of fitting between concatenated pairs. Finally, sensor calibration proved to be an essential step for achieving maximum accuracy in the generated point clouds, and therefore in the accuracy of the produced 3D trajectories, from each sensor.
Calibration, Single Camera Orientation, Multi Camera System, Bundle Block Adjustment, ICP, 3D human tracking
Human extraction and tracking is an undergoing field were many researchers have been working on for more than 20 years. Although several approaches in the 2D domain have been introduced, 3D literature is limited, requiring further investigation. Within this framework, an accurate and fast-to-implement pipeline is introduced working in two main directions: pure 3D foreground extraction of moving people in the scene and interpretation of the human movement using an ellipsoid as a mathematical reference model. The proposed work is part of an industrial transportation research project whose aim is to monitor the behaviour of people and make a distinction between normal and abnormal behaviours in public train wagons using a network of low cost commodity sensors such as Microsor Kinect sensor.
Die Erkennung und Verfolgung von Menschen mit Kamerasystemen ist ein sehr interessantes und sich schnell entwickelndes Forschungsgebiet und spielt gerade für die Sicherheitsforschung eine große Rolle. Bisherige Arbeiten konzentrieren sich auf 2D-Algorithmen, wobei die Erkennung, Extraktion und Verfolgung in 3D ein noch ziemlich unerforschtes Gebiet, vor allem in Bezug auf Multi-Kamera-Systeme ist. Unser Ansatz konzentriert sich auf die Erkennung und Verfolgung von Personen im öffentlichen Nahverkehr aus den Daten mehrerer Stereo und RGB-D-Systeme (RGB-D bezeichnet die Kombination aus Grau-/Farb- und Distanzinformationen, wie z. B. bei der Microsoft Kinect). Wesentliche Punkte des hier beschriebenen Ansatzes beziehen sich auf die Synchronisierung mehrerer Aufnahmesysteme und die Bestimmung ihrer Orientierungsparameter im Raum. Darüber hinaus wird mit Hilfe eines Bündelblockausgleichs geometrisch eine einheitliche 3D-Szene erzeugt, die dann einen Ausgangspunkt für die Erkennung und Verfolgung von Menschen im Beobachtungsraum bildet. Dazu werden signifikante Kennzahlen aus den erfassten Datensätzen ermittelt. In dem Beitrag wird eine Übersicht über die von mehreren RGB-D und Stereosensoren erzeugten Punktwolken und daraus abgeleiteten Daten erläutert und diskutierted.