This WP aims to build audiovisual synthetic objects using camera and “phone” views of natural objects to be used in virtual fully immersive environment. The critical challenges will be naturalness of composed immersive audiovisual scenes including audiovisual synchronisation accuracy.
Natural visual 3D object modelling: automatic modelling using 2D and 3D, mono, stereo and multiple camera views will support manual modelling within graphics packages. Static and dynamic 3D data from multiple cameras will be used to get reference models in the given class of objects (e.g. faces) while 2D views of the individual object will be processed to produce the individual 3D model ready for animation. The concepts of feature detection, 3D correspondence field, principal subspace and discriminating models for 3D shape and texture will serve as the starting point of our developments in parallel to image based rendering techniques (IBR). Next, the neural networks will be explored for creating the 3D models by the use of neuro-informative zones which can define polygon vertices, e.g. lips, mouth extremities, noses main vertices. Finally, the method based on camera defocusing will be investigated in 3D object modelling. The feasibility of the approaches will be verified for human faces using head views with general pose, illumination, and face decorations (moustache, glasses, beard).
Manufactured objects (e.g. vehicles, roads, amusement parks, buildings) will be modelled in computer graphics environments (optionally augmented by 3D scanner, laser range finder, and video camera) using 3D natural textures and confronted with automatic modelling by 2D views. Moreover, the integration of video and laser complementary sensors in the development of techniques for 3D reconstruction will be further explored in order to get both, the 3D model accuracy and high 3D spatial resolution. By this means we will create high quality, photo-realistic, dimensionally accurate 3D models for a wide range of distances. Finally, we will investigate feasibility of radial basis functions in modelling of architectural objects using images of real monuments.
Natural audio object modelling: Automatic modelling based on sample audio records will embrace works on: individual human voice model; musical instruments sound model; birds’ voices models. The accuracy of audio models will be verified by psychoacoustics experiments and additionally for human voice modelling on text to speech application. In particular musical sound synthesis based on physical modeling will be used to create virtual instruments. Interfaces to control engine will be designed and implemented.
Audiovisual synchronisation: Dynamic morphing models will be compared and most efficient for immersive environments will be chosen in order to synchronise audio object voice or sound with object visual appearance. Tools for audiovisual synchronisation will be elaborated and verified in several generic applications: reading head application integrated with text to speech functionality;
Audiovisual scene rendering and interaction
This WP aims to implement hybrid audiovisual scene rendering using hybrid (natural/synthetic) camera and “microphone” 3D models. Moreover, the animation and interaction control engine for the fully virtual interactive immersive environment will be added. The critical challenges will be merging of natural audiovisual objects represented by audiovisual data streams with synthetic audiovisual objects to avoid spatial and temporal collisions while creating the realistic behaviour of objects in 3D audiovisual scene.
Rendering of hybrid audiovisual scene: The hybrid (natural/synthetic) camera and microphone models will be designed and implemented. We will also investigate the problem of image based rendering for complex scenes when individual objects were previously viewed from incompatible (to each other) viewpoints. 3D sound rendering module will be elaborated. In particular the efficient spatialization of sound sources in a virtual environment using information on the geometry of the environment itself will be of concern at design of virtual microphones. Finally the single and multi view audiovisual rendering engine will be built. Web3D format will be used for model representation in storage and transmission while for rendering initially Web3D viewers will be used and next replaced by our own renderer based on OpenGL (to get the final interfacing with the animation and interaction control engine of the immersive audiovisual environment).
Animation and interactivity in hybrid audiovisual scene: The animation and interaction control engine to be used in immersive environment will be designed and implemented. The synthetic object control (by text, voice and/or sensorial devices) will be incorporated together with interfacing to natural object tracker (WP4) to avoid collisions and enhance naturalness of the immersive environment.
AV content coding
In WP2.1, the coding of AV material will include the development of object-based scalable and error resilient methods aiming at optimizing the perceptual quality through the efficient utilization of available resources. Joint source-channel coding will also be used with fine granular scalability (FGS). Adaptive error control will be investigated with a view to improve the performance of the H.264/MPEG-4 AVC standard under error-prone conditions. WP2.1 will also focus on the scalable and error-resilient compression of 3D AV content. Research will also explore the use of metadata information in the encoding of still pictures and video sequences in order to improve the coding efficiency of standard compression algorithms.
AV content transcoding
WP2.2 addresses content adaptation for inter-network communications. In this WP, compressed-domain downscaling techniques for object-based video data will be developed aiming at scalability of shape, motion and texture data. New transcoding algorithms will be developed with view to convert object-based AV data from DVB quality (e.g. MPEG-2) to 3G quality (e.g. MPEG-4, H.264). The research activities will also focus on developing new strategies for 2D/3D transcoding. 3D to 2D content conversion will be used to enable users with limited display and mobile capabilities to access a 2D version of the high-detail 3D video scene. Also, techniques for converting 2D to 3D using single and multiple views will be developed. WP2.2 will also investigate new transcoding methods supported by content and environment descriptions. These metadata-based transcoding strategies will be instrumental in achieving content adaptation based on the network and user environment characteristics.
Transmission over heterogeneous networks
This workpackage addresses the problem of QoS optimisation for inter-network audiovisual communications. The work will be carried out along two main lines, one based on the development of channel simulators and media adaptation gateways and the other one based on the deployment of a testbed, with real systems and applications and comprising heterogeneous networks. Development and evaluation of QoS optimisation tools is common to both approaches and therefore constitute a basis for convergence and integration.
QoS optimisation: The end-to-end media transfer is accomplished using adaptation gateways at the edges of the networks. The gateways carry out the adaptation process using the required QoS levels obtained from the network, and hence provide a number of AV streams at different error resilience and bit/frame rate levels. The network produces periodic reports for the media gateways using available transport protocols, such as RTCP. The media gateways take necessary actions to:
- Adapt the media according to network conditions
- Perform parameter mapping from one network to another according to reported networks conditions and users profiles and requirements.
Networking protocols: In the future, transport of AV information will likely occur in heterogeneous environments, which include different network technologies and protocols. It is therefore necessary to test and implement interoperable solutions with provision of QoS. This will be carried out over a real test-bed, according to a stepped approach:
- Specification of the architecture of a heterogeneous network and associated protocols capable of supporting the transmission of A/V content with QoS;
- Deployment of the heterogeneous test-bed;
- Evaluation of different interoperability solutions (measure their effectiveness and conduct experiments of A/V services over the test-bed).
Efficient storage and search schemes
This workpackage studies and develops efficient storage and search schemes and harmonises among different standardised formats to promote interoperability.
Efficient storage and search schemes: The main focus of the work will be on:
- The development of a query-by-humming system for audio data.
- The optimisation of indexing data structures for visual descriptors. Index representation and its parameters will be matched to particular visual descriptions such as dominant colour, texture, face recognition index, etc.
- the introduction of robust shot cut detection algorithms using both static and dynamic features. The problem will be treated as a classification problem and audio information will be used in addition to video data, when appropriate. Research will focus on techniques that automatically recognize camera effects like zoom ins/outs, fades and dissolves or camera motion.
- the generation of hierarchical audio-visual summaries , using representative keyframes, salient stills and associated audio/textual keywords in order to present content to the end-user in an intelligent way.
Interoperability among different formats and data models: The work towards this objective will focus on:
- The use of middleware layers and distributed technologies to allow the access and retrieval of multimedia content and associated descriptions regardless of their location and to promote the adaptation among different formats;
- Harmonisation of formats for the storage and exchange of multimedia content and associated descriptions, in particular between MPEG-21, MPEG-7 and MXF.
- Interoperability between different metadata formats regarding: descriptions for AV content protection and rights management and interoperability with other metadata formats different from MPEG-7 and MPEG-21.
Development of audio analysis tools. The work proposed will consolidate the integration of VISNET by an effective exchange of knowledge between the cooperating institutions, aiming the efficient analysis of audio and speech, and creating a potential basis for further integration and joint dissemination activities. To enhance the cooperation, exchanges of researchers (e.g. PhD students) are foreseen. Part of the work will be used as input for the multimodal analysis workpackage.
Audio analysis: Development of automatic audio analysis tools. The main work will focus on:
- Audio segmentation: Audio recordings are classified and segmented into voice, music, various kinds of environmental sounds, and silence. Morphological and statistical analysis of temporal curves of the low short-time energy ratio, high zero-crossing ratio, high centroid, and high harmonicity of audio signals, are among the techniques to exploit.
- Sound Recognition: Classification with MPEG-7 Descriptors can be used for sound recognition. The media can be automatically indexed using trained sound classes in a pattern recognition framework. For this goal, a generalized sound recognition system using reduced-dimension features, based on independent component analysis and a hidden Markov model classifier are considered.
- Scene classification using audio information: Audio analysis will be used for scene classification. For instance, the audio in a sports scene is very different from that in a news report, or from an action scene, and even various sports programs may have very different background sounds, which may allow identifying them.
Speech analysis: Development of automatic speech analysis tools. The main work will focus on:
- Speech modelling: Use of nonlinear techniques, e.g. AM-FM modulation to model speech resonances, and/or speech dynamics (e.g. nonlinear predictors).
- Speaker segmentation, identification, recognition and verification: Use of robust clustering techniques for speaker segmentation; MPEG-7 low-level audio feature descriptors using spectral basis representations are used to model and identify different speakers; exploitation of higher-levels of information such as speaker statistical language modelling, aiming to exploit his/her word usage, or usage of prosodic features and coping with spontaneous speech, as well as long-term signal measures; verification can use discriminating power of commonly used features, and exploit relevance feedback techniques to boost the systems performance.
- Spoken content extraction and retrieval: Automatic speech recognition systems are used to extract MPEG-7 Spoken Content Descriptors from speech inputs. These provide compact representations of speech content, consisting of recognition hypotheses lattices (possibly mixing word and phone hypotheses). The extracted MPEG-7 Descriptors can be used to index audio-visual databases. Depending on the desired application, these Descriptors are extracted either from some spoken annotations or directly from the audio stream. Spoken queries can then be supported.
Video analysis and processing of human faces
Develop a set of video analysis tools, with special emphasis on the processing of human faces. The work proposed will consolidate the integration of VISNET by an effective exchange of knowledge between the cooperating institutions, aiming the efficient human faces in video content, and creating a potential basis for further integration and joint dissemination activities. To enhance the cooperation exchanges of researchers (e.g. PhD students) are foreseen. The work developed in this workpackage will receive input from the segmentation and tracking workpackage and will be used as input for the multimodal analysis workpackage.
Face analysis: Development of video analysis tools for processing information related to human faces. The main work will focus on:
- Face detection and tracking: Probabilistic frameworks for face detection and tracking will be investigated. Probabilistic models of facial features will be built using a training stage and will be used in the subsequent detection and tracking operations. Invariance to illumination changes and robustness to occlusions will be pursued.
- Face recognition: Algorithmic structure refinement and performance enhancement of the advanced face recognition descriptor of MPEG-7 will be considered, namely: removing PCA pre-processing step for linear discriminative analysis, reducing the number of channels in the Fourier pyramid for the face, refinement of the iterative query concept, novel pose estimation, and novel mapping of arbitrary pose to front pose using online extracted approximated 3D model of face. In addition, the concept of 3D shape for face recognition will be explored and combined with the 2D approach.
- Facial expression analysis: The dynamics of the facial expressions will be used to facilitate the expression recognition task. Automatic video based facial expression recognition will be pursued using the face deformations (FACS), relating the face action units with node deformations of generic face models and learning them through statistical techniques. Model registration, exploiting video information will be studied and the derived facial animation parameters will be made available in an MPEG-4 syntax (FAPs, FDPs) for animation of 3D face models.
- Facial feature tracking: A novel algorithm for eyes tracking will be developed, assuming that a face has been detected in the sequence. This is also useful for image normalisation before face recognition, where a multiresolution Gabor filtering in novel colour space will be used.
- Detection of video shots including people: This may be a preliminary processing step for some applications, such as news sequences processing, where a subset of shots containing people is selected, to reduce the computational burden of the subsequent face processing steps
Semantic video segmentation and tracking
Work will focus on developing intelligent semantic segmentation and tracking techniques that are robust and efficient for natural video. The segmentation and tracking results will be used as input for the facial analysis workpackage. The folowing tasks will be investigated:
Segmentation system: Development of generic and robust video analysis system for supervised and unsupervised segmentation and object tracking. Different advanced techniques will be integrated to obtain a robust, efficient and modular segmentation and tracking system for natural video. Novel robust segmentation/tracking algorithms will be proposed. The video analysis system will entail several independent modules. Topics that will be investigated include soft detection techniques for object segmentation, fusion of intermediate results, statistical machine learning and classification techniques for object segmentation, etc.
2-D articulated person tracking: Dynamic 2-D articulated models of the human body will be used to achieve efficient, accurate and low computational complexity person segmentation and tracking.
3-D articulated person tracking: Techniques for 3-D articulated person tracking from video sequences will be investigated. The topics that will be treated include selection of suitable 3D models , use of multiple image cues, incorporation of appropriate constraints (physiological, anatomical - kinematic), etc.
Develop a set of multimodal analysis tools, with special emphasis on the detection and recognition of people in audiovisual sequences. The work proposed will consolidate the integration of VISNET by an effective exchange of knowledge between the cooperating institutions, aiming the efficient integration of audio and visual analysis techniques, and creating a potential basis for further integration and joint dissemination activities. To enhance the cooperation exchanges of researchers (e.g. PhD students) are foreseen. The work developed in this workpackage will receive input from the audio and video analysis workpackages.
Shot selection: A module for the detection of shots where the person on the scene are speaking will be developed. This module allows to greatly reduce the computational load of the analysis system, since the number of shots to be processed will be much smaller. Also the recognition results will benefit from this module, since a verification of the correspondence between the detected speech and the faces on the scene can be performed.
Audio and video extraction for each selected shot: The audio and video information will be processed in parallel and confidence values for the speaker and face recognition techniques will be extracted for the selected shots.
People location and recognition: Based on the audio and video confidence values, it is possible to determine if a particular person is present on the scene. The multimodal analysis for detection of people in environments will also be conducted using the generalized probability theory. This work exploits several independent information sources to allow the achievement of better classification results than when considering each source independently.
Secure transmission and encryption of AV data
This workpackage deals with the definition of protocols and creation of software platforms for the secure distribution of multimedia material using encryption, robust watermarks, fingerprinting. Moreover, the adaptation of security techniques to real-time AV transmissions and the definition of metadata schemes suitable for the representation of access control information in AV contents is addressed.
- Design and analysis of protocols for secure distribution, in order to identify the levels of protection afforded. Definition of encryption requirements of the protocols. Determination of the role of the various technological components (encryption, secure transmission, watermarking) in the distribution chain.
- Adaptation of security functions currently supported in general communications to AV transmission. Adaptation will address the issues of protection of multicast traffic, encryption schemes adapted to streaming with varying QoS and information loss, and efficient algorithms for securing real-time communications.
- Specification of an appropriate key distribution infrastructure for the needs of AV secure transmissions. This infrastructure, together with the corresponding security policies and certification authorities, will provide the basis for the secure transmission of audio-visual contents.
- Use of metadata to provide information related to access control. The metadata that will be defined will be used for conveying properties that will enable rights management, protection against unauthorized copy or retransmission, conditional access enforcement, authorization methods, and other security functions. These metadata will be duly protected with the appropriate security mechanisms.
- Participation in standardization bodies (mainly CEN/ETSI), contributing the audiovisual-oriented security mechanisms, distribution protocols and AV content protection metadata developed in VISNET.
Watermarking and Fingerprinting Techniques for DRM
Devise new methods and build software for robust watermarking and fingerprinting techniques for copy control, traitor/transaction tracing, anonymous purchase. The effectiveness of the new methods using suitable test procedures and benchmarking platforms is tested.
- Design new methods and protocols for robust watermarking of AV data. Robustness to geometric attacks and collusion attacks will be investigated. The methods will be capable of embedding sufficient information (payload) to trace distributors & purchasers of pirated material. Joint watermarking of audio and video data and fusion of detection results will be pursued. Information-theoretic techniques will be used in order to increase watermark payload. The methods will be also capable of withstanding multiple watermarking so as to enable watermarking at each step in the distribution process.
- Design and implement new methods for fingerprinting of AV data. Selection of robust feature vectors, implementation of techniques for the efficient matching of fingerprints and the organization of the fingerprint database
- Design protocols for anonymous purchase based on asymmetric watermarks.
- Devise and test asymmetric watermarking schemes.
- Specification of benchmarking procedures for testing the robustness, security, imperceptibility, capacity of the new audio & video watermarking techniques.
- Specification of benchmarking procedures for testing the robustness & discriminative power of fingerprinting techniques.