Media Analysis: Workflows and Modules
Analysis Workflows
Here follow the descriptions of Press, TV and Internet workflows.
The workflow depicted in Figure 1 outlines the analysis process for spot images. The ImageOCR tries to detect text lines in the spot and the text analysis finds out if these texts could be bound to another semantic content. The ImageFingerprint compares the image to existing ones and reports the probability that the given spot has a similarity to already processed ones. Based on these results the CreativeDetector makes a decision if the spot has been already seen or it is an existing one. An operator of the AdComparer application makes the final decision if the spot belongs to an exiting creative or it is a new one. In case of new creatives the ImageFingerprint extracts the spots features and saves in the database for future comparisons.

Figure 1: The Press Workflow
The flow depicted in Figure 2 outlines the analysis process for spots retrieved from tv broadcasters. The AudiosSegmentation creates a frame structure for the audio analysis modules. The JingleRecognition searches for jingles of known companies. The SpeechToText transcribes the audio content into text and the TextAnalysisASR looks for further semantic binding of this text. The Wordspotting scans the audio content for know words The VideoOCR scans the content to detect text lines in the spot and the TextAnalysisOCR finds out if these texts could be bound to other semantic content. The VideoFingerprint compares the content to existing ones and reports the probability that the given spot has a similarity to already processed ones. The LogoRecognition detects known company logos in the spot. Based on these results the CreativeDetector makes a decision if the spot has been already seen or it is an existing one. An operator of the TVComparer application makes the final decision if the spot belongs to an exiting creative or it is a new one. In case of new creatives the VideoFingerprint extracts the spot’s features and saves in the database for future comparisons.

Figure 2: The TV Workflow
The flow depicted in Figure 3 outlines the analysis process for spots downloaded from internet other than images (e.g. flash, video, animated GIF). The VideoOCR scans the content to detect text lines in the spot and the text analysis finds out if these texts could be bound to another semantic content. The VideoFingerprint compares the content to existing ones and reports the probability that the given spot has a similarity to already processed ones. The LogoRecognition detects known company logos in the spot. Based on these results the CreativeDetector makes a decision if the spot has been already seen or it is an existing one. An operator of the AdComparer application makes the final decision if the spot belongs to an exiting creative or it is a new one. In case of new creatives the VideoFingerprint extracts the spots features and saves in the database for future comparisons.

Figure 3 : Internet workflow
Collection Processing Manager
The Collection Processing Manager (CPM) manages the entire workflow for the analysis modules. The main tasks of the CPM are to choose the proper workflow for each essence and, support diagnostic and state control of the analysis process for the operator. Each time the acquisition module (with EMS) notifies (or maybe asynchronously fetches a new spot) the CPM issues a new job for each spot and controls its processing thought the whole analysis workflow. For each analysis step the CPM issues a task. These tasks are submitted to the analysis modules and when all of the tasks of a job are processed the job is marked as analysed. In addition CPM provides web services interface for the communication with the analysis modules and the Essence Management Store.

Figure 4: Administration and status GUI of the CPM
The Creative Detector is used to combine the results of different analysis modules in all MediaCampaign workflows. Therefore, the creative detector extracts relevant information from all available analysis modules belonging to an investigated spot. This information is weighted according to the quality (precision/recall) of the module before it is combined to a merged analysis result. Within this merged result the suggested advertiser and brand as well as a list of identical and similar creatives are given. Furthermore, summarized results of all analysis modules are included in the creative detector output.
Analysis Modules
Fingerprinting
In MediaCampaign image fingerprinting is used to compare an input spot with the database of existing advertisements (creatives) to find out if it is a new creative or if it is a spot that belongs to an already existing creative. More specific, it is the goal of fingerprinting to find out if there are identical (or similar) ads in the database of existing ads compared to the investigated input ad. This task is similar to near duplicate detection and content based image retrieval (CBIR). The fingerprinting algorithm uses two global image features, the MPEG-7 Color Layout features and Orientation Histograms texture feature. For exact matching a pairwise image comparison is performed to find out if ads are identically.

Figure 5: Examples of image fingerprinting results
In Figure 5 an investigated Press advertisement (upper left image) is shown together with the fingerprinting results. The best matching result is shown on the upper right and all returned ads are shown in the bottom row. The algorithm for TV ads works in the same way as described above but on keyframes extracted at shot boundaries. Furthermore the order and amount of keyframes is taken into account to recognize identical and similar TV ads.
Audio segmentation module
TV commercial spots contain speech, music, sound effects and background noises. Especially in this domain these audio classes often overlap. Automatic Speech Recognition (ASR) and Jingle Recognition rely on the knowledge, what audio classes are contained in the audio stream. In order to cope with acoustic mismatches and to optimize the quality of the output of the ASR and Jingle Recognition tools, an Audio Segmentation tool is needed. The segmentation objective is to determine an optimal temporal segmentation dividing the audio stream into homogeneous segments of one of the classes “speech”, “music”, “other” (sounds), and “silence” and segments containing mixtures of these classes. Typical mixture segments contain “speech” AND “music”, “speech” AND “other”, “music” AND “other” or even mixtures of “speech” AND “music” AND “other (sounds)”.
The Audio Segmentation Module consists of a feature extraction front end, probabilistic classifiers and class change detection. For feature extraction, the audio stream is divided into overlapping small frames. These frames are extracted every 100ms. Out of these segments, temporal and spectral features are extracted to transform the segments into the feature domain.
The feature vectors are presented to 4 classifiers, which map these vectors to the audio classes “speech”, “music”, “other” and to mixture classes. Each classifier maps the frames containing the pure class (e.g. “speech”) and also mixtures with other classes (e.g. “speech” AND “music”) to a “positive” class and all other frames to a “negative” class (e.g. non-“speech”). “Silence” detection is based on a calculation of the psychoacoustic loudness. A threshold decision is taken to distinguish between “silence” and all other classes.
The classification outputs are combined and smoothed by a post processing stage, which finds the most probable class assignments and optimizes the class change regions. The Class change positions build up labelled segments which are output to an MPEG-7 document.
Jingle Recognition module
A jingle, also called sound or audio logo, is a short memorable sequence of tones or a melody. Jingles are possible cues for detecting campaigns. The jingle recognition module detects known jingles in TV audio by matching fingerprints of reference jingles stored in a database with the fingerprint of the TV audio. Jingle recognition is an audio identification task similar to music or general audio fingerprinting tasks. The main difference to music fingerprinting is the short duration of a jingle which is only 2 to 3 seconds.
The Jingle Recognition module makes use of chroma features, which are twelve tonal features representing the played keys of a musical piece and are extracted for short audio frames whereas the keys from different octaves are mapped to one octave. By comparing the chroma features of two different recordings, similarities in melody can be detected even if the instruments of the recordings are different or the melodies are played in different key.
The chroma features extracted from reference jingles are stored in the jingle database. When a new spot is presented to the module, all chroma features of the reference jingles are compared to the chroma features extracted from the spot. If the distance measure between a jingle and a part of the spot is below a learned threshold, the temporal position of the jingle in the spot is marked and metadata is added.
Logo Recognition
The goal of logo recognition in images and videos is to find known logos that have been learned before the analysis in an offline step. In fact this task is a specialisation of general object recognition/identification task. Our logo recognition algorithm works with a state of the art object recognition approach, called SIFT (Scale Invariant Feature Transform) and additionally makes use of temporal information contained in video ads from TV. Thereby logos which are recognized in frame at position x, y are particularly investigated in the following frames (a + 1, a + 2, etc.) to accurately decide if they occur in the video or not.

Figure 6: Examples of two logos in on different objects
In Figure 6 some examples of two logos situated on different real-world objects in different ads are shown. The logo recognition module should be able to recognize all logos contained in image and video ads that exceed a certain size. In practice logos which contain a lot of textures (e.g. images and text) are easier to recognize than logos without much textures (e.g. the NIKE sign). Furthermore it is easier to recognize logos that are shown as overlays because they are not prone to perspective distortion, occlusion, illumination changes, and other light effects.
OCR
The goal of optical character recognition (OCR) is it to recognize the text contained in an image as good as possible. This visual analysis problem is almost solved for cases where uni-coloured machine printed text is situated on uni-coloured background in a well formatted way like it is usual for newspapers, books and most documents. In cases where the text is situated on coloured background (e.g. as overlay in a photograph) or on real world objects OCR is more difficult. Unfortunately most image advertisements belong to the second group. One example Press ad is shown in the left image for Figure 7.

Figure 7: Examples of text contained in Press (left) and TV (right) ads
For video OCR it is even more challenging task to recognize words and characters that are contained in a video, because characters can occur any where in a video, in different sizes and formats, under perspective distortions, or worse in overlays and with background clutter. Furthermore text can be artificially displayed as overlay, see the right image of Figure 7 as example of different text occurrences in video frames.
