In a technology space moving at an accelerated pace like the one MPEG has the task to develop standards for, it is difficult to have a clear plan for the future (MPEG has a 5-year plan, though).
Still, when MPEG was developing the Multimedia Linking Application Format (MLAF), it “discovered” that it had developed or was developing several standards – MPEG-7, Compact descriptors for visual search (CDVS), Compact descriptors for video analysis (CDVA) and Media Orchestration.
The collection of these standards (and of others in the early phases of conception or development, e.g. Neural Network Compression and Video Coding for Machines) that help create the Multimedia Linking Environment, i.e. an environment where it is possible to create a link between a given spatio-temporal region of a media object and spatio-temporal regions in other media objects.
This article explains the benfits brought by the MLAF “multimedia linking” standard also for very concrete applications.
Multimedia Linking Environment
Until a quarter of century ago, virtually the only device that could establish relationships between different media items was the brain. A very poor substitute was a note on a book to record a possible relationship of the place in the book where the note was written with content in the same or different books.
The possibility to link a place in a web page to another place in another web page, or to a media object, was the great innovation brought by the web. However, a quarter of century after a billion web sites and quadrillions of linked web pages, we must recognise that the notion of linking is pervasive one and not necessarily connected with the web.
MPEG has dedicated significant resources to the problem described by the sentence “I have a media object and I want to know which other related media objects exist in a multimedia data base” and represented in the MPEG-7 model depicted the figure below.
However, MPEG-7 is an instance of the more general problem of linking a given spatio-temporal region of a media object to spatio-temporal regions in other media objects.
These are some examples:
- A synthetic object is created out of a number of pictures of an object. There is a relationship between the pictures and the synthetic object;
- There is a virtual replica of a physical place. There is a relationship between the physical place and the virtual replica;
- A User is experiencing as virtual place in a virtual reality application. There is a relationship between the two virtual places;
- A user creates a media object by mashing up a set of media items coming from different sources. There is a relationship between the media items and the mashed up media object.
MPEG has produced MPEG-A part 16 (Media Linking Application Format – MLAF) specifies a data format called bridget that can be used to link any kinds of media. MPEG has also developed a number of standards that play an auxiliary role in the “ media linking” context outlined by the examples above.
- MPEG-7 parts 1 (Systems), 3 (Visual), 4 (Audio) and 5 (Multimedia) provide the systems elements, and the visual (image and video), audio and multimedia descriptions.
- MPEG-7 parts 13 (Compact descriptors for visual search) and 15 (Compact descriptors for video analysis) provide new generation image and video descriptors
- MPEG-B part 13 (Media Orchestration) provides the means to mash up media items and other data to create personal user experiences.
The MLAF standard
A bridget is a link between a “source” content and a “destination” content. It contains information on
- The source and the destination content
- The link between the two
- The information in the bridget is presented to the users who consume the source content.
The last information is the most relevant to the users because it is the one that enables them to decide whether the destination content is of interest to them.
The structure of the MLAF representation (points 1 and 2) is based on the MPEG-21 Digital Item Container implemented as a specialised MPEG-21 Annotation. The spatio-temporal scope is represented by the expressive power of two MPEG-7 tools and the general descriptive capability of the MPEG-21 Digital Item. They allow a bridget author to specify a wide range of possible associations and to be as precise and granular as needed.
The native format to present bridget information is based on MPEG-4 Scene description and application engine. Nevertheless, a bridget can be directly linked to any external presentation resource (e.g., an HTML page, an SVG graphics or others).
Bridgets for companion screen content
An interesting application of the MLAF standard is described in the figure below describing the entire bridget workflow.
- A TV program, scheduled to be broadcast at a future time, is uploaded to the broadcast server  and to the bridget Authoring Tool (BAT) .
- BAT computes and stores the program’s audio fingerprints to the Audio Fingerprint Server (AFS) .
- The bridget editor uses BAT to create bridgets .
- When the editor is done all bridgets of the program and the referenced media objects are uploaded to the Publishing Server .
- At the scheduled time, the TV program is broadcast .
- The end user’s app computes the audio fingerprint and sends it to the Audio Fingerprint Server .
- AFS sends to the user’s app ID and time of the program the user is watching .
- When the app alerts the user that a bridget is available, the viewer may decide to:
- Turn his eyes away from the TV set to her handset
- Play the content in the bridget 
- Share the bridget to a social network .
This is the workflow of a recorded TV program. A similar scenario can be implemented for live programs. In this case bridgets must be prepared in advance so that the publisher can select and broadcast a specific bridget when needed.
Standards are powerful tools that facilitate the introduction of new services, such as companion screen content. In this example, the bridget standard can stimulate the creation of independent authoring tools and end-user applications.
The bridget creation workflow depends on the types of media object the bridget represents.
Let’s assume that the bridget contains different media types such as an image, a textual description, an independently selectable sound track (e.g. an ad) and a video. Let’s also assume that the layout of the bridget has been produced beforehand.
This is the sequence of steps performed by the bridget editor:
- Select a time segment on the TV program timeline and a suitable layout
- Enter the appropriate text
- Provide a reference image (possibly taken from the video itself)
- Find a suitable image by using an automatic images search tool (e.g. based on the CDVS standard)
- Provide a reference video clip (possibly taken from the video itself)
- Find a suitable video clip, possibly taken from the video itself, by using an automatic video search tool (e.g. based on the CDVA standard)
- Add an audio file.
The resulting bridget will appears to the end user like this.
When all bridgets are created, the editor saves the bridgets and the media to the publishing server.
It is clear that the “success” of a bridget (in terms of number of users who open it) depends to a large extent on how the bridget is presented.
Bridget was the title of a research project funded by the 7th Framework Research Program of the European Commission. The MLAF standard (ISO/IEC 23000-16) was developped at the instigation and with participation of members of the Bridget project.
At this page you will find more information on how the TVBridge application can be used to create, publish and consume bridgets for recorded and live TV programs.