We present Forge, a volumetric video processing platform runningon the cloud that works with different camera configurations andadapts to the user needs. From high-end studios with dozens of syn-chronised cameras to scenes casually captured with smartphones,Forge allows creators to produce high-quality 3D human contentautonomously with the equipment they already have, in an afford-able way, unlocking new and accessible ways to create 3D humancontent at scale.
Building on a poster presentation at Siggraph 2018 , this article describes an investigation of interactive narrative in virtual reality (VR) through Samuel Beckett’s theatrical text Play. Actors are captured in a green screen environment using free-viewpoint video (FVV). Built in a game engine, the scene is complete with binaural spatial audio and six degrees of freedom of movement. The project explores how ludic qualities in the original text elicit the conversational and interactive specificities of the digital medium. The work affirms potential for interactive narrative in VR, opens new experiences of the text, and highlights the reorganisation of the author–audience dynamic.
The method comprising providing a plurality of images of a scene captured by a plurality of image capturing devices (101); providing silhouette information of at least one object in the scene (102); generating a point cloud for the scene in 3D space using the plurality of images (103); extracting an object point cloud from the generated point cloud, the object point cloud being a point cloud associated with the at least one object in the scene (104); estimating a 3D shape volume of the at least one object from the silhouette information (105); and combining the object point cloud and the shape volume of the at least one object to generate a three- dimensional model (106). An apparatus for generating a 3D model, and a computer readable medium for generating the 3D model.
This demo paper describes a project that engages cutting-edge free viewpoint video (FVV) techniques for developing content for an augmented reality prototype. The article traces the evolutionary process from concept, through narrative development, to completed AR prototypes for the HoloLens and handheld mobile devices. It concludes with some reflections on the affordances of the various hardware formats and posits future directions for the research.
This poster describes a reinterpretation of Samuel Beckett’s theatrical text Play for virtual reality (VR). It is an aesthetic reflection on practice that follows up an a technical project description submitted to ISMAR 2017 [O’Dwyer et al. 2017]. Actors are captured in a green screen environment using free-viewpoint video (FVV) techniques, and the scene is built in a game engine, complete with binaural spatial audio and six degrees of freedom of movement. The project explores how ludic qualities in the original text help elicit the conversational and interactive specificities of the digital medium. The work affirms the potential for interactive narrative in VR, opens new experiences of the text, and highlights the reorganisation of the author-audience dynamic.
We present the first generative adversarial network (GAN) for natural image matting. Our novel generator network is trained to predict visually appealing alphas with the addition of the adversarial loss from the discriminator that is trained to classify well-composited images. Further, we improve existing encoder-decoder architectures to better deal with the spatial localization issues inherited in convolutional neural networks (CNN) by using dilated convolutions to capture global context information without downscaling feature maps and losing spatial information. We present state-of-the-art results on the alphamatting online benchmark for the gradient error and give comparable results in others. Our method is particularly well suited for fine structures like hair, which is of great importance in practical matting applications, e.g. in film/TV production.
We present a scalable pipeline for Free-Viewpoint Video (FVV) content creation, considering also visualisation in Augmented Reality (AR) and Virtual Reality (VR). We support a range of scenarios where there may be a limited number of handheld consumer cameras, but also demonstrate how our method can be applied in professional multi-camera setups. Our novel pipeline extends many state-of-the-art techniques (such as structure-from-motion, shape-from-silhouette and multi-view stereo) and incorporates bio-mechanical constraints through 3D skeletal information as well as efficient camera pose estimation algorithms. We introduce multi-source shape-from-silhouette (MS-SfS) combined with fusion of different geometry data as crucial components for accurate reconstruction in sparse camera settings. Our approach is highly flexible and our results indicate suitability either for affordable content creation for VR/AR or for interactive FVV visualisation where a user can choose an arbitrary viewpoint or sweep between known views using view synthesis.
Since the early years of the twenty-first century, the performing arts have been party to an increasing number of digital media projects bring renewed attention to questions about, on one hand, new working processes involving capture and distribution techniques, and on the other hand, how particular works – with bespoke hard and software – can exert an efficacy over how work is created by the artist/producer or received by the audience. The evolution of author/audience criteria demand that digital arts practice modify aesthetic and storytelling strategies, to types that are more appropriate to communicating ideas over interactive digital networks, wherein AR/VR technologies are rapidly becoming the dominant interface. This project explores these redefined criteria through a reimagining of Samuel Becketts Play (1963) for digital culture. This paper offers an account of the working processes, the aesthetic and technical considerations that guide artistic decisions and how we attempt to place the overall work in the state of the art.
Video surveillance always had a negative connotation, among others because of the loss of privacy and because it may not automatically increase public safety. If it was able to detect atypical (i.e. dangerous) situations in real time, autonomously and anonymously, this could change. A prerequisite for this is an automatic detection of possibly dangerous situations from video data. From the derived trajectories we then want to determine dangerous situations by detecting atypical trajectories. However, it is better to develop such a system without people being threatened or even harmed, plus with having them know that there is such a tracking system installed. In the artistic project leave a trace the tracked people become actor and thus part of the installation. Visualization in real-time allows interaction by these actors, which in turn creates many atypical interaction situations on which we could develop our situation detection.
Video surveillance always had a negative connotation, among others because of the loss of privacy and because it may not automatically increase public safety. If it was able to detect atypical (i.e. dangerous) situations in real time, autonomously and anonymously, this could change. A prerequisite for this is a reliable automatic detection of possibly dangerous situations from video data. This is done classically by object extraction and tracking. From the derived trajectories, we then want to determine dangerous situations by detecting atypical trajectories. However, due to ethical considerations it is better to develop such a system on data without people being threatened or even harmed, plus with having them know that there is such a tracking system installed. Another important point is that these situations do not occur very often in real, public CCTV areas and may be captured properly even less. In the artistic project leave a trace the tracked objects, people in an atrium of a institutional building, become actor and thus part of the installation. Visualisation in real-time allows interaction by these actors, which in turn creates many atypical interaction situations on which we can develop our situation detection. The data set has evolved over three years and hence, is huge. In this article we describe the tracking system and several approaches for the detection of atypical trajectories.
The “digital Michelangelo project” was a seminal computer vision project in the early 2000’s that pushed the capabilities of acquisition systems and involved multiple people from diverse fields, many of whom are now leaders in industry and academia. Reviewing this project with modern eyes provides us with the opportunity to reflect on several issues, relevant now as then to the field of computer vision and research in general, that go beyond the technical aspects of the work.
This article was written in the context of a reading group competition at the week-long International Computer Vision Summer School 2017 (ICVSS) on Sicily, Italy. To deepen the participants understanding of computer vision and to foster a sense of community, various reading groups were tasked to highlight important lessons which may be learned from provided literature, going beyond the contents of the paper. This report is the winning entry of this guided discourse (Fig. 1). The authors closely examined the origins, fruits and most importantly lessons about research in general which may be distilled from the “digital Michelangelo project”. Discussions leading to this report were held within the group as well as with Hao Li, the group mentor.
Object recognition is a natural process of the human brain performed in the visual cortex and relies on a binocular depth perception system that renders a three-dimensional representation of the objects in a scene. Hitherto, computer and software systems are been used to simulate the perception of three-dimensional environments with the aid of sensors to capture real-time images. In the process, such images are used as input data for further analysis and development of algorithms, an essential ingredient for simulating the complexity of human vision, so as to achieve scene interpretation for object recognition, similar to the way the human brain perceives it. The rapid pace of technological advancements in hardware and software, are continuously bringing the machine-based process for object recognition nearer to the inhuman vision prototype. The key in this field, is the development of algorithms in order to achieve robust scene interpretation. A lot of recognisable and significant effort has been successfully carried out over the years in 2D object recognition, as opposed to 3D. It is therefore, within this context and scope of this dissertation, to contribute towards the enhancement of 3D object recognition; a better interpretation and understanding of reality and the relationship between objects in a scene. Through the use and application of low-cost commodity sensors, such as Microsoft Kinect, RGB and depth data of a scene have been retrieved and manipulated in order to generate human-like visual perception data. The goal herein is to show how RGB and depth information can be utilised in order to develop a new class of 3D object recognition algorithms, analogous to the perception processed by the human brain.
3D Human Segmentation, Object Detection, Conditional Random Fields, Kinect Sensor, RGBD Data, 3D Reconstructions, Bundle Adjustment, ICP registration
This paper addresses the problem of detecting and segmenting human instances in a point cloud. Both fields have been well studied during the last decades showing impressive results, not only in accuracy but also in computational performance. With the rapid use of depth sensors, a resurgent need for improving existing state-of-the-art algorithms, integrating depth information as an additional constraint became more ostensible. Current challenges involve combining RGB and depth information for reasoning about location and spatial extend of the object of interest. We make use of an improved deformable part model algorithm, allowing to deform the individual parts across multiple scales, approximating the location of the person in the scene and a conditional random field energy function for specifying the object’s spatial extent. Our proposed energy function models up to pairwise relations defined in the RGBD domain, enforcing label consistency for regions sharing similar unary and pairwise measurements. Experimental results show that our proposed energy function provides a fairly precise segmentation even when the resulting detection box is imprecise. Reasoning about the detection algorithm could potentially enhance the quality of the detection box allowing capturing the object of interest as a whole.
Deformable Part Models, RGBD Data, Conditional Random Fields, Graph Cuts, Human Recognition
Human monitoring and tracking has been a prominent research area for many scientists around the globe. Several algorithms have been introduced and improved over the years, eliminating false positives and enhancing monitoring quality. While the majority of approaches are restricted to the 2D and 2.5D domain, 3D still remains an unexplored field. Microsoft Kinect is a low cost commodity sensor extensively used by the industry and research community for several indoor applications. Within this framework, an accurate and fast-to-implement pipeline is introduced working in two main directions: pure 3D foreground extraction of moving people in the scene and interpretation of the human movement using an ellipsoid as a mathematical reference model. The proposed work is part of an industrial transportation research project whose aim is to monitor the behavior of people and make a distinction between normal and abnormal behaviors in public train wagons. Ground truth was generated by the OpenNI human skeleton tracker and used for evaluating the performance of the proposed method.
Calibration, Bundle Adjustment, 3D Object Extraction, Ellipsoid, 3D Human, Tracking
Human Tracking in Computer Vision is a very active up-going research area. Previous works analyze this topic by applying algorithms and features extraction in 2D, while 3D tracking is quite an unexplored filed, especially concerning multi–camera systems. Our approach discussed in this paper is focused on the detection and tracking of human postures using multiple RGB–D data together with stereo cameras. We use low–cost devices, such as Microsoft Kinect and a people counter, based on a stereo system. The novelty of our technique concerns the synchronization of multiple devices and the determination of their exterior and relative orientation in space, based on a common world coordinate system. Furthermore, this is used for applying Bundle Adjustment to obtain a unique 3D scene, which is then used as a starting point for the detection and tracking of humans and extract significant metrics from the datasets acquired. In this article, the approaches are described for the determination of the exterior and absolute orientation. Subsequently, it is shown how a common point cloud is formed. Finally, some results for object detection and tracking, based on 3D point clouds, are presented.
Multi Camera System, Bundle Adjustment, 3D Similarity Transformation, 3D Fused Human Cloud
Human detection and tracking has been a prominent research area for several scientists around the globe. State of the art algorithms have been implemented, refined and accelerated to significantly improve the detection rate and eliminate false positives. While 2D approaches are well investigated, 3D human detection and tracking is still an unexplored research field. In both 2D/3D cases, introducing a multi camera system could vastly expand the accuracy and confidence of the tracking process. Within this work, a quality evaluation is performed on a multi RGB-D camera indoor tracking system for examining how camera calibration and pose can affect the quality of human tracks in the scene, independently from the detection and tracking approach used.
After performing a calibration step on every Kinect sensor, state of the art single camera pose estimators were evaluated for checking how good the quality of the poses is estimated using planar objects such as an ordinate chessboard. With this information, a bundle block adjustment and ICP were performed for verifying the accuracy of the single pose estimators in a multi camera configuration system. Results have shown that single camera estimators provide high accuracy results of less than half a pixel forcing the bundle to converge after very few iterations. In relation to ICP, relative information between cloud pairs is more or less preserved giving a low score of fitting between concatenated pairs. Finally, sensor calibration proved to be an essential step for achieving maximum accuracy in the generated point clouds, and therefore in the accuracy of the produced 3D trajectories, from each sensor.
Calibration, Single Camera Orientation, Multi Camera System, Bundle Block Adjustment, ICP, 3D human tracking
Human extraction and tracking is an undergoing field were many researchers have been working on for more than 20 years. Although several approaches in the 2D domain have been introduced, 3D literature is limited, requiring further investigation. Within this framework, an accurate and fast-to-implement pipeline is introduced working in two main directions: pure 3D foreground extraction of moving people in the scene and interpretation of the human movement using an ellipsoid as a mathematical reference model. The proposed work is part of an industrial transportation research project whose aim is to monitor the behaviour of people and make a distinction between normal and abnormal behaviours in public train wagons using a network of low cost commodity sensors such as Microsor Kinect sensor.
Die Erkennung und Verfolgung von Menschen mit Kamerasystemen ist ein sehr interessantes und sich schnell entwickelndes Forschungsgebiet und spielt gerade für die Sicherheitsforschung eine große Rolle. Bisherige Arbeiten konzentrieren sich auf 2D-Algorithmen, wobei die Erkennung, Extraktion und Verfolgung in 3D ein noch ziemlich unerforschtes Gebiet, vor allem in Bezug auf Multi-Kamera-Systeme ist. Unser Ansatz konzentriert sich auf die Erkennung und Verfolgung von Personen im öffentlichen Nahverkehr aus den Daten mehrerer Stereo und RGB-D-Systeme (RGB-D bezeichnet die Kombination aus Grau-/Farb- und Distanzinformationen, wie z. B. bei der Microsoft Kinect). Wesentliche Punkte des hier beschriebenen Ansatzes beziehen sich auf die Synchronisierung mehrerer Aufnahmesysteme und die Bestimmung ihrer Orientierungsparameter im Raum. Darüber hinaus wird mit Hilfe eines Bündelblockausgleichs geometrisch eine einheitliche 3D-Szene erzeugt, die dann einen Ausgangspunkt für die Erkennung und Verfolgung von Menschen im Beobachtungsraum bildet. Dazu werden signifikante Kennzahlen aus den erfassten Datensätzen ermittelt. In dem Beitrag wird eine Übersicht über die von mehreren RGB-D und Stereosensoren erzeugten Punktwolken und daraus abgeleiteten Daten erläutert und diskutierted.
This master thesis examines the use of a multi resolution Active Shape Model (ASM) applied on facial features, utilizing the Viola/Jones face detector. The method, initially introduced by Cootes, et al., requires good initial pose parameter values for placing a face model from its local system to the image’s system. This is one of the most critical parts of the process from which the convergence of the method depends on. For this reason, the Viola/Jones detector kicked in, to initially detect the face and subsequently estimate the initial pose parameters for positioning the face model in the search image. The testing of the face detector as well as the quality of the model’s initial position was executed on face images provided by the Milborrow University of Cape Town (MUCT) online database. For building a face model, a set of training images provided by Cootes was used and the search images were chosen randomly from the same training set.
Experiments made initially on some frontal upright images, showed that the face detector succeeded in all images and the placement of the face model was quite accurate in most cases. Subsequently, the quality of the model fit using the multi resolution active shape model approach, showed that the method converged quite well for the inner part of the face but in the outer part, in some cases, was not that precise.
This thesis is about creating a number of projective reconstructions of objects, with input data a set of point correspondences measured in a pair of images. Through these points, the Fundamental Matrix is first determined and subsequently, the Essential Matrix through the former. In both cases, 4 different types of reconstructions are created (2 with linear DLT intersections and 2 with non-linear), with the main difference that the reconstructions arising from the Fundamental Matrix lack the internal (calibration) information of the cameras, hence are projectively distorted. In contrast, the reconstruction arising from the Essential Matrix is similar to the real object, since the pair of the projection camera matrices computed includes the interior orientation of the images. To this purpose, algorithms in MatLab have been developed which take as input the coordinates of the corresponding points in two images, calculate first the Fundamental Matrix and subsequently the Essential Matrix from the former. Then, a number of pairs of projective camera matrices are computed, using the Fundamental and Essential Matrices, whereby, in the photogrammetric sense, the projective camera matrices reflect the interior and relative orientations of the cameras. Next, an intersection is programmed using the linear and non linear method of the Direct Linear Triangulation (DLT). In this thesis, the fundamental theoretical background is presented as well as the analysis of the algorithms developed for this purpose. Finally, a presentation, analysis and testing of the application algorithms with simulated and real data is included.