追求者| 金贤重父母称儿子因前女友怀孕深受打击| 飙泪哭诉长子非亲生| 发现遭跟拍目露凶光| 两人生日分开过| 不介意老公拍吻戏(图)| 情人节上映| 杨德昌四周年祭| 《时尚王国》| 王千友| 邓超认邓紫棋做儿媳妇| 被指生龙子机率高| 自称没想要定下来| 政协委员谈电影立法| 杨乐乐晒与汪涵背影照| 全家一起上!| 达人秀成新标杆| 台女星被指陪睡出道| 疑已同居(图)| 华仔很会照顾女儿| 带返警局协助调查(组图)| 凯特-温斯莱特离婚4个月| 麦兜| 开启贺岁大战| 《杀出黎明》| 天下收藏| 谢天华为儿子取名谢浩飞| 孟非自曝重回新闻主播台| 还以为是韩寒(图)| 遭蓄意殴打并强行停拍| 继续疯狂| 禁止与男星肢体接触| 关注都市人| 雾霾曾影响| 刘婷婷| 扑倒黄轩CP感爆棚| 韩星金宇彬自曝儿时常被母亲打扮| 加特| 英皇投资2亿| 望其好自为之|
tceic.com
学霸学习网 这下你爽了
赞助商链接
当前位置:首页 >> >>

2018年新疆面向社会公开招录6928名公务员

这样的调查不严谨,也不具有普遍性,更近乎“作秀”。


Pattern Recognition 35 (2002) 581–591

www.elsevier.com/locate/patcog

Indexing for reuse of TV news shots
M. Bertini ? , A. Del Bimbo, P. Pala
Dipartimento Sistemi E Informatica, Universita’ di Firenze, Via S Marta 3, 50139 Firenze, Italy Received 16 November 2000; accepted 16 November 2000

Abstract Broadcasters are demonstrating interest in building digital archives of their assets for reuse of archive materials for TV programs or on-line availability. This requires tools for video indexing and retrieval by content. E ective indexing by content of videos is based on the association of high-level information associated with visual data. In this paper a system is presented that enables content-based indexing and browsing of news reports; the annotation of the video stream is fully automated and is based both on visual features extracted from video shots and on textual descriptors extracted from captions and audio tracks. ? 2001 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved.
Keywords: Multimedia databases; Video content analysis; Content-based video retrieval; Video shots classi?cation

1. Introduction Broadcasters are demonstrating interest in building large digital archives of their assets for reuse of archive materials for TV programs or on-line availability to other companies and the general public. To satisfy this request there is need of systems that are able to provide e cient management of visual data in terms of storage, transmission, retrieval and browsing. Solutions to storage and transmission issues involve analysis and processing of data streams regardless of their content. Di erently, effective retrieval and browsing of images and videos is based on the extraction of content level information associated with visual data and on a compact representation of retrieved shots. While e ective content-based retrieval of images is accomplished by supporting content representation through low-level image features, the same does not apply to content-based retrieval of videos, except for very limited application contexts. Instead, e ective retrieval of videos must be based on high-level content descriptors.
? Corresponding author. Tel.: +39-055-4796540; fax: +39055-4796363. E-mail addresses: bertini@dsi.uni?.it (M. Bertini), delbimbo @dsi.uni?.it (A. Del Bimbo), pala@dsi.uni?.it (P. Pala).

Speci?c knowledge of the application content ease the extraction of high-level descriptors [1]. Recently, news videos have received great attention by the research community. This is motivated by the interest of broadcasters in building digital archives of their assets for reuse of archive materials. On the one hand, reuse of archive materials is identi?ed as one key method of improving production quality by bringing added depth, and historical context, to recent events. On the other hand, the use of stock footage allows to produce faster the news services. An example of the ?rst case is the reuse of shots that show the scene of a crime: they can be reused later to provide the historical context. An example of the second case is the reuse of “generic” shots, e.g. shots that show an airport may be used in a news service about an airport strike. Anyway, it is not possible to reuse all the shots of a news video: the information contained in the speech of an anchorman or in the text and the graphs of a computer graphics shot became obsolete after a short time and can be easily and inexpensively replaced by new shots. An e ective reuse of archive materials is possible if the shot description is rich enough to allow retrieval by content and the content has been classi?ed: a thorough description of the contents allows to search the shots that ?t into a request of the video producer, while shot classi?cation allows to skip those that cannot

0031-3203/01/$22.00 ? 2001 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 0 1 ) 0 0 0 6 1 - 9

582

M. Bertini et al. / Pattern Recognition 35 (2002) 581–591

be reused. News have a rather de?nite structure and do not o er a wide variety either of edit e ects, which are mainly cuts, or of shooting condition (e.g. illumination). The de?nite structure of news is suitable for content analysis and has been exploited for automatic classi?cation of news sequences in Refs. [2– 6]. In all of these systems a two stage scene classi?cation scheme is employed. First, the video stream is parsed and video shots are extracted. Each shot is then classi?ed according to content classes such as newscaster, report, computer graphics, weather forecast. The general approach to this type of classi?cation relies on the de?nition of one or more image templates for each content class. To classify a generic shot, a key frame is extracted and matched against the image template of every content classes. Other works [6,7] deal with the problem of video indexing using information sources like the text of the captions and the audio track. This is due to the fact that news videos images have an ancillary function with respect to words and video content is strongly related to textual and audio information which is contained in the audio track. 1.1. Previous work A method for shot classi?cation based on the syntax and the structure of news videos has been proposed in Ref. [2]. Shot classi?cation is based on the similarity match of frames against a pre-determined set of prototype anchorman images. However, as noted in Refs. [3,5] the validity of this approach is limited by the di culty to ?nd a representative set of prototype anchorman images. These should account for di erent cases including news editions, change of dresses, modi?cations of studio layout. Furthermore, the method proposed is computation intensive since it requires the calculation of similarity between each frame and prototypical image of the anchorman. In order to diminish dependency from the set of sample frames in Refs. [3,5] has been proposed to use a di erent approach to the de?nition of the anchorman frame model. In this model each anchorman frame is considered a composition of distinctive regions, like the shape of the anchorman, the caption of the reporter’s name, the graphics that sometimes appear in the top third of the frame. A model of the anchorman frame is built, which accounts for the spatial distribution of basic elements and is independent of the anchorman’s sex, apparel and appearance. To determine whether a shot contains an anchorman all the frames are compared with the model; if they match, they are classi?ed as “anchorman”, thus building a set of model images for each video. Only the frames of the shots that satisfy the similarity criteria according to the spatial model are then compared with the model-image set, using a new similarity measure. One of the limits of this method is that

if the style of the news changes the database must be updated. A di erent approach, based on frame statistics, is presented in Ref. [8]. The system uses hidden Markov models to classify frames of news videos. The classi?cation process takes into account several clues, including feature vectors based on di erence images, average frame color and audio signal. Parameters of the hidden Markov model are determined in a training stage using a ground-truth database of news videos. The problem of text extraction has been investigated by several researchers. A method for the extraction of captions and scene texts (e.g. street names or shop names in the scene) from movies has been presented in Ref. [9]. Techniques for the extraction and OCR of caption text for the news video indexing have been examined in Ref. [7]. The ?rst problem that must be solved for e ective text extraction is to determine which frames contain captions and the position of the text in the frame. The method presented in Ref. [7] is based on the search of rectangular regions, composed by elements with sharp borders, appearing in sequences of frames; it is also based on the assumption that the captions have a high contrast on the background. For the purpose of video content annotation, speech transcriptions has been used in the CMU Informedia project as extremely important source of information [10 –12]. News-on-demand is an application within the Informedia digital video library project [6] that indexes news from TV and radio sources and allows the user to retrieve news by content. The system creates a time-aligned transcript from speech recognition and captions. The video data is segmented into news stories using the presence of silence and captions as “paragraph” boundaries, while scene breaks and keyframes are identi?ed using algorithms based on color histograms. The CMU Sphinx-II speech recognition system is used both for the speech transcription and for the user interface of the content-based retrieval system. There is no shot classi?cation and the speech recognition system uses the whole audio track, obtaining variable error rates that depend on the audio source [11]. Two prototypes for the construction of personalized TV news programs have been presented in Ref. [13]. The ?rst prototype allows category-based retrieval using manual annotation provided by the news producer. The second prototype indexes the shot content using teletext data that are provided for deaf people by a French TV channel. The indexing of news videos uses the video parsing system presented in Ref. [14]. This system detects cuts computing the di erence of color histograms of consecutive frames. Shots containing anchorman are identi?ed by combining shot similarity, person detection and the “high variance factor” which accounts for the “regular spot presence” of the anchorman shots.

M. Bertini et al. / Pattern Recognition 35 (2002) 581–591

583

1.2. The news indexing and annotation system In this paper a system for content-based indexing and annotation of news videos is presented. Videos are segmented to identify video shots. On the basis of the ?rst frame of each shot, a statistical analysis is performed to detect which shots recur throughout the video. The shots are thus classi?ed as newscaster shots, and the others are classi?ed as report shots. The content of a generic report shot is described through the use of both visual and textual information and is further classi?ed as computer graphics (non-realistic) or realistic, in order to improve the reuse of realistic shots, as needed by the broadcasters. Textual information is automatically extracted from textual captions included in the video and from speech associated with the video. Di erently from Ref. [6] only anchorman shots are used for speech recognition. A retrieval engine allows the user to search by content and browse through video shots. This paper is organized as follows: in Section 2, the video segmentation technique used to identify video shots is presented. In Section 3 the shot classi?cation system is presented, and a comparison is carried out with respect to other techniques. In Section 4, video content description is expounded with reference to the extraction of textual information from OCR and speech recognition. Finally, in Section 5 retrieval and browsing examples are provided. 2. Video segmentation In order to perform segmentation of news videos two problems must be dealt with: (i) avoiding incorrect identi?cation of shot changes due to rapid motion or sudden lighting change in the scene (false positives), (ii) identi?cation of sharp shot transitions (cuts) as well as gradual (dissolves, matte). Ref. [15] reports a thorough comparison of video segmentation algorithms. In the following, we concentrate on cuts since they are, by far, the most commonly employed edit e ect in news videos. Furthermore, for the purpose of content-based indexing, it is not important to classify the edit e ect, but to detect changes of visual content. Table 1 shows the number of sharp and gradual edit e ects used in 4 h of news videos of the three most important Italian broadcasters. The identi?cation of gradual as well as sharp transitions can be performed through a cut detection algorithm,
Table 1 Shot boundary statistics for news videos Shots 1797 Cuts 1702 (94.7%) Diss.+ Wipe 95 (5.3%) Matte+

provided that the video is suitably sub-sampled in time. In fact, gradual transitions become sharp if the video is sub-sampled in the time variable since the di erence between consecutive frames increases. The cut detection algorithm is developed following two distinct steps: Preliminary cut detection: Rapid motion in the scene and sudden change in lighting produce a low correlation between contiguous frames especially in case a high temporal sub-sampling rate is adopted. To avoid false cut detection, a metric has been studied which proves highly insensitive to such variations, while being reliable in detecting “true” cuts [16]. Each frame is partitioned into nine sub-frames. Each of these is represented by considering its color histogram in the HSI color space. Actually, to improve independence with respect to lighting conditions, the histogram takes into account only hue H and saturation S properties. The HSI color space has been chosen, since as reported in Ref. [17], it is a good compromise between missed detection and computational costs. Edit e ect detection is performed considering the volume of the di erence of sub-frame histograms in two consecutive frames. Cuts correspond to zero crossings of the di erence of the average values of the di erence of the volumes. This method allows edit e ect identi?cation also when the frame color statistic remains the same but the position of the color spots is di erent. To keep false positive detection low, results of the ?rst pass are re?ned using a method based on video structure and shot similarity. Cut detection re?nement: The algorithm described above features a high false positive detection rate in some critical situations, such as: (i) color instabilities due to noise in the digitalization process, (ii) insertion of graphics or other changes of large zones in images, (iii) news shots recorded in critical situations, or news shots featuring sudden lighting changes. Typically, lighting changes are due either to long sequences of ashes like in press conferences, or to sudden camera movements (like panning and zoom) and free hand takes, like in reports on demonstrations or war actions. To reduce errors due to multiple and rapid variation of visual contents of the shot, the knowledge of the speci?c structure of news videos has been considered. In fact, unlike other types of videos, such as commercials and movies, where the editing can reach frantic levels, in news videos the duration of the shots is long enough to let the audience “understand” the subject. Thus, there is always a minimum temporal distance L between two consecutive cuts. This rule is adopted to disregard all those cuts that are less distant than L seconds from the preceding cut (inter-cut time di erence constraint). Furthermore, since cuts identify a change of the video content the key-frames of shots for two consecutive cuts cannot be too similar. This rule is used to disregard all those cuts whose similarity with the preceding cut exceeds a threshold S (inter-cut frame similarity constraint).

584 Table 2 Statistics of all the videos Shots 731 Detected shots 765

M. Bertini et al. / Pattern Recognition 35 (2002) 581–591

False detections 43 (5.9%)

Missed detections 9 (1.2%)

The performance of the proposed technique has been evaluated with reference to a test database composed of 12 videos from 6 Italian TV channels: RAI 1, 2 and 3, Mediaset Canale 5 and Cecchi Gori TeleMonteCarlo 1, for a total time of 2 h and 42 min. Table 2 includes the number of video shots, cuts, gradual edit e ects, falsely detected edit e ects and missed detections. With respect to cut detection based exclusively on color histogram, the use of cut detection re?nement results in a 37% improvement in false detection (from 69 to 43). 3. Shot classi?cation The main goals of shot classi?cation are the classi?cation of reusable and not reusable shots, and the indexing of the video. For each video shot, the ?rst frame is used as the key-frame. Video shots are classi?ed into two main classes: anchorman and news reports. Sub-classi?cations of the anchorman shots (like “weather forecasts”) are obtained considering the speech content as explained in Section 4. Shot classi?cation is a two step process: the ?rst step classi?es anchorman and report shots, using a statistical approach and motion features of the anchorman shots, without requiring any model. Then news report shots are processed in order to detect those that contain computer graphics.

Classi?cation of anchorman and computer graphic shots is important since they cannot be reused. Fig. 1 shows an example of reusable shots: the anchorman introduces a report about accidents in the home, then after some realistic shots that show typical house works there is a computer graphic shot that will show some statistics. While the realistic shots are reuseable in an another report that deals about house works, the anchorman and the graphics are not, and will be replaced by another anchorman and by newer computer graphics. 3.1. Classi?cation of anchorman shots 3.1.1. Classi?cation based on statistical features Shots of the anchorman are repeated at intervals of variable length throughout the video. The ?rst step for the classi?cation of these shots stems from this assumption and is based on the computation, for each video shot Sk , of its shot lifetime L(Sk ). The shot lifetime measures the shortest temporal interval that includes all the occurrences of shots with similar visual content, within the video. Given a generic shot Sk its lifetime is computed by considering the set Tk = {ti |S(Sk ; Si ) ? s }; where S(Sk ; Si ) is a similarity measure applied to key-frames of shots Sk and Si , s a similarity threshold and ti is the value of the time variable corresponding to the occurrence of the key-frame of shot Si . The lifetime of shot Sk is de?ned as L(Sk ) = max(Tk ) ? min(Tk ). Shot classi?cation is based on ?tting values of L(Sk ) for all the video shots in a bimodal distribution. This is used to identify a threshold value l that is used to classify shots into service and anchorman categories. Particularly, all the shots Sk so that L(Sk ) ? l are classi?ed as anchorman shots, where l is determined according

Fig. 1. Example of reusable (b, c, d, e, f,) and not reusable (a, g) shots.

M. Bertini et al. / Pattern Recognition 35 (2002) 581–591

585

Fig. 2. Lifetime of anchorman shots.

to the statistics of the test database, and set to 4:5 s. Remaining shots are classi?ed as news service shots. This classi?cation method does not rely on any pre-de?ned model of the anchorman shots; rather it is based on the time structure of news videos. Fig. 2 shows lifetimes for three di erent types of anchorman shots identi?ed in a news video. 3.1.2. Classi?cation based on motion features Shot classi?cation based on statistical features can sometimes lead to the erroneous classi?cation of some news service shots as anchorman shots. This occurs mainly in correspondence to interviews and reports. In fact, in Interviews: The camera alternatively takes shots of the interviewer and the interviewed people. Erroneous classi?cation of interview shots have been discussed in Ref. [8]. In reports: A reporter describes the content shown in some shots; at the end of every shot (or series of shots) there is the shot of a new reporter describing the next series; this structure replicates the whole structure of news video. An example is shown in Fig. 3 where the recurrence of shots (a), (c) and (g) leads to erroneous classi?cation of these shots as anchorman shots. To avoid these errors, the preliminary classi?cation based on statistical feature is re?ned considering motion features of the anchorman shots. Classi?cation re?nement stems from the assumption that in an anchorman shot, both the camera and the anchorman are almost motionless. In contrast, for both interview and news service shots, background objects and camera movements—persons and vehicles, free-hand shots, camera panning and zooming—cause relevant motion components throughout the shot.

Classi?cation re?nement is performed by computing an index of the quantity of motion QS , for each possible anchorman shot. The algorithm for the analysis of this index takes into account the frame to frame di erence between the shot key-frame f1 and subsequent frames fi in the shot according to
QS =
fi ∈S

Di

with Di =
xy

dRGB (f1 (x; y); fi (x; y));

(1)

dRGB (f1 (x; y); fi (x; y)) = 0 if ||f1 (x; y) ? fi (x; y)|| ? 1 if ||f1 (x; y) ? fi (x; y)|| ?
RGB ; RGB :

(2)

To enhance sensitivity to motion the shot is sub-sampled in time, and the frames are compared to the key-frame f1 . Only those shots whose QS does not exceed a threshold Q are de?nitely classi?ed as anchorman shots. By using this classi?cation re?nement, false anchorman shots shown in Fig. 3 are eliminated. In fact, shots (a), (c) and (g) feature a relevant motion component on account of camera zooming and panning and movement of people and objects in the background. 3.2. Classi?cation of computer graphics shots Shots classi?ed as containing news report are processed in order to detect whether they contain computer graphics. Fig. 4 shows an example of computer graphics shot. Usually, those type of shots show information about money change rates, economic indexes and other

586

M. Bertini et al. / Pattern Recognition 35 (2002) 581–591

Fig. 3. Example of “false positive” class.

Fig. 4. Example of reusable (report) and not reusable (anchorman and CG) shots.

graphs. They are not reusable due to the fact that the information they convey is subject to fast changes, and can be inexpensively replaced. The shot represented by key-frame (e) in Figs. 4 and 5 shows the sales of February 2000 compared to those of February 1999, and has little reuse value. Unlike it, the shots that show workers in a factory can be reused in other reports. Unlike the anchorman shots it is possible to use neither the structure of the video nor the layout, as a hint to detect the computer graphic shots. The features used to classify those shots are based on statistical parameters and motion features. The ?rst step calculates an index of the quantity of motion QCG dividing each shot into sub-shots and taking into account the frame to frame di erence between the sub-shot key-frame fi and f(i+1) according to the previous equation. To reduce possible misclassi?cation of still images the preliminary classi?cation is re?ned analyzing the color histogram in the HSI color space. The histogram takes into account only the H and S components, and calculates two indexes: Nbin is the number of histogram bins

whose value is higher than a bin percentage of frame pixels; Npix is the percentage of pixels represented by a selection of the biggest bins of the histogram. Nbin and Npix are calculated for each key-frame of the sub-shots and are summed; if one of these values exceed a threshold they are discarded. Nbin and Npix take into account the fact that computer graphics shots present a more “compact” color histogram than realistic shots, with a low contrast background that allows higher quality legibility of text and graphics. Table 4 reports the performance of the computer graphics shots classi?cation. The algorithm takes into account the feasability of the presence of small motion in the CG shot, due to moving text and graphics; An example is shown in Fig. 5. 3.3. Performance evaluation The shot classi?cation algorithms have been tested on a test database of news video. To verify the robustness of the classi?cation process the database includes news videos of di erent broadcasters, featuring di erent styles

M. Bertini et al. / Pattern Recognition 35 (2002) 581–591

587

Fig. 5. Example of computer graphics text.

Fig. 6. Example of anchorman shots’ styles.

and layouts, both for the anchorman shots and for the computer graphic shots. A short analysis of the most recurrent styles for anchorman shots is reported below:
? The anchorman shots are taken using a ?xed position

Table 3 Results of the shots classi?cation process Anchorman shots 66 Detected anchorman shots 67 False detections 3 (4.5%) Missed detections 2 (3%)

camera while there’s an image in the background that shows the subject of the incoming service (Fig. 6(a) and (b)). This style is adopted for all the editions of RAI TG2. ? The anchorman can be either standing or seated, with camera movements and edit e ects. The background is usually static (photos or logos). An example is shown in Fig. 6(c) and (d). For example, this style is used in the evening edition of RAI TG3. ? Two anchorman alternate each other. This is shown in Fig. 6(e) and (f). Background is almost ?xed or the movement is in small regions. This is used in the evening edition of Mediaset TG5 and in some CNN editions. ? There is a more or less uniform background, some camera movements, limited number of anchorman shots, for example front view and 3=4 view (Fig. 6(g) and (h)). The results on the test set used in Section 2 are reported in Table 3. The use of the motion feature reduces the number of false detection errors from 14 to 3.

Table 4 Results of the CG shots classi?cation process Shots 318 CG shots 15 False detections 10 Missed detections 2

Missed detections occurred with type (d) shots when the background contains motion. False detection occurred in the presence of an interview which is similar to types (a) and (b). To improve false detection in (b) analysis was restricted to the central part of the frame, according to the broadcaster’s style. The test set used for the computer graphic classi?cation is a sub-set of the one used for the anchorman classi?cation (see Table 4). Missed detections of computer graphics are due to fast action, like fast moving text, while the false detection occurred in the presence of still images, or shots that

588

M. Bertini et al. / Pattern Recognition 35 (2002) 581–591

featured very little motion, with low contrast that lead to color histogram distributions similar to that non-realistic shots. 4. Video content representation To support e ective video retrieval by content, high-level information must be extracted from videos and used to perform shot sub-classi?cation based on their content. Additional information to shot classi?cation is extracted from text captions and anchorman speech. 4.1. Text recognition In news videos, text captions are used to show several information about the shot being broadcasted, such as the site where the action takes place (in service shots) and the names of the people shown in the video (both in anchorman and service shots). Extraction of text information from video captions has been performed by integrating a traditional OCR within our system. The OCR engine cannot be supplied with raw video frames: a pre-?ltering phase is required. This phase includes two distinct steps: caption identi?cation and text=background separation. Caption identi?cation: If a shot includes a caption, it is not guaranteed that the caption is present in the ?rst frames of the shot. Sometimes the caption appears in the middle of the shot and disappears after the last frame of the shot. Identi?cation of frames including a caption is based on the fact that captions are always used in combination with graphic elements that improve text readability. These graphic elements follow di erent styles and may include opaque backgrounds and colored lines (Fig. 7). Captions are always located in the lower part of the frame. Caption identi?cation is based on the matching of a pre-de?ned model of the graphic elements with shapes extracted in the lower part of the frames. The model accounts for the presence of horizontal stripes=long lines either colored or opaque, according to the di erent broadcasters’ styles. Text=background separation: Text separation is complicated by the presence of captions featuring a poor

text=background contrast (Fig. 7(c) and (d)). This sort of problem is dealt with by using a text/background separation method that exploits persistence of patterns over contiguous frames. This method is based on the assumption that for the entire display of a caption all the pixels corresponding to the text have more or less the same value. On the other hand, the value of the pixels in the background changes. Text=background separation is performed by highlighting the pixels the value of which is almost constant. Captions usually display over two or more consecutive shots. A critical instance of text=background separation occurs when a caption without an opaque background appears over a static scene (e.g. a photo or a painting) and is displayed only for the duration of a single shot. This is indeed a rare condition that we did not encounter in our test sequences. Let us assume that {f0 ; : : : ; f } is a sequence of frames k that has been identi?ed as including a caption. A new ? ? sequence {f0 ; : : : ; fk } is computed as follows: ? f0 (i; j) = 0; ? f (i; j)
k

=

? min(255; fk ?1 (i; j) + ) if f (i; j) = f ?1 (i; j); k k ? (i; j) ? ) if f (i; j) = f ?1 (i; j); max(0; fk ?1 k k

where f (i; j) is the gray level value of pixel (i; j) in frame k k and a pre-de?ned incremental step. In this way, the ? ? sequence {f0 ; : : : ; fk } is characterized by the text caption that gradually fades in Fig. 8. This method has proven to be robust even in those cases where the sequence of frames includes several captions that are separated by editing e ects such as dissolves and cuts. ? ? Finally, the sequence {f0 ; : : : ; fk } is processed in order to extract some frames that are used to feed the OCR ? ? engine. For this purpose, the correlation C(fk ?1 ; fk ) is computed for every pair of contiguous frames. Frames characterized by local maxima of the correlation function are passed to the OCR engine. The graphic elements like the line and the TG2 logo are removed since they interfere with OCR processing. The OCR engine: To increase separation between the single characters, and ease their segmentation, on the part of the OCR program, thresholding is applied to images extracted from the previous step.

Fig. 7. Di erent styles of captions.

M. Bertini et al. / Pattern Recognition 35 (2002) 581–591 Table 5 OCR and speech recognition results Speech recognition Anchor. shots Not trained 57% Trained 84% All audio track Trained 52% OCR TextBridge

589

SOCR

87%

? 60%

Fig. 8. Character extraction.

Two OCR programs have been tested: (i) TextBridge OCR (Windows commercial OCR package), (ii) SOCR (open source OCR developed by the University of Waikato, New Zealand, http:==www.socr.org): the most recent complete version is 0.1 and recognition rate changes according to the fonts employed. Results are shown in Table 5. 4.2. Speech recognition During the anchorman shot, the content of the following news service is summarized. While in the news service the reporter provides detailed information on the topic. In order to improve the speech recognition rate and extract only relevant information about news service content, the speech recognition engine is fed with the audio track of the anchorman shots only. In fact, as reported in Ref. [12], generally there is not an exact synchronization between speech and objects shown in the news service. Often the content of the shots does not correspond to the reporter’s description, consequently the association of the audio track content with the corresponding shots may lead to erroneous results. Furthermore, the audio track of news services is typically disturbed by background noise and sometimes includes speech transmitted through low-quality telephone or satellite links. Speech recognition engine: The speech recognition engine used is IBM ViaVoice 98 that features speaker independence and continuous speech processing. The speech recognition engine is based on a hidden Markov model of the language and uses the following resources: (i) a language thesaurus that can be customized

and enhanced, (ii) a customizable model of word usage, (iii) word pronunciation models. A database of audio tracks was used to train the words usage model. The database included the audio tracks of anchorman of several broadcasters. Sentences corresponding to speech were manually transcribed. Their content covered di erent topics such as sports news, politics, chronicles and gossip. The speech recognition rate was measured on a test database that did not include any of the audio tracks used for training. Results are shown in Table 5. Words extracted by the speech recognition engine were ?ltered in order to wipe out all utility words (articles, pronouns, conjunctions and prepositions—this accounts approximately for 50% of the speech in latin languages). Remaining words are used to describe the content of the following news services.

5. Video retrieval Techniques for video segmentation, shot classi?cation and shot content description presented in the previous sections have been integrated into a system for content-based retrieval of TV news. At archiving time, news videos are automatically processed in order to extract content descriptors for each video shot. The content descriptor of a generic shot includes: Shot type identi?er: This can be either anchorman, news service or computer graphics. TV broadcaster identi?er. Broadcast date and time. Visual shot descriptor. This is the key-frame of the shot. Textual shot descriptor. This is the set of words extracted from shot captions and from speech recognition of the previous anchorman shot. Manual annotation can be added. At retrieval time, the system supports video querying and browsing. To reduce the e ect of the errors of the OCR programs, the retrieval system uses the AGREP approximate text search that allows to ?nd words that

590

M. Bertini et al. / Pattern Recognition 35 (2002) 581–591

Fig. 9. (a) Speci?cation of a fuzzy search for “President Clinton”. (b) shows result page with shots that match both the keywords, (c) shows the last page with shots that match only the keyword “President”.

contain errors. Queries formulated according to TV broadcaster, date, time, content and any Boolean combination of these are supported. One or more words can be input by the user. These are matched against textual shot descriptors of database videos through the use of a thesaurus so as to support exact word and synonym matching. Matched shots are presented to the user for browsing. For each matched shot all the information stored in its content descriptor is shown. In Fig. 9(a) a sample query by content is shown. The user enters a Boolean combination of the words

‘President’ or ‘Clinton’ to search for shots with similar content. Retrieval results are shown in Fig. 9(b). The query also retrieves shots classi?ed with the Italian word “Presidente” (speech transcription and manual annotation), since the “fuzzy” search method is used. Retrieved shots are shown in decreasing order of match. The ?rst shots match both query keywords and show news and anchorman shots related to “President Clinton”. The other shots retrieved match only the keyword “President” and show “President Milosevic” and “President Scalfaro”. The “Previous” and “Next” buttons on the top of the

M. Bertini et al. / Pattern Recognition 35 (2002) 581–591

591

window allow the user to navigate through all the retrieved shots. Selection of a shot key-frame from the output interface allows display of the entire shot through a movie player application. 6. Conclusions This paper presents a system for content-based indexing and retrieval of news videos. The system features content-based shot classi?cation of anchorman and non-realistic shots, to allow the reuse of report shots. Extraction of high-level content descriptors through caption OCR and speech recognition. Shot classi?cation is based on statistical and motion features of the news video structure, so as to provide independence from TV broadcaster style. References
[1] A. Del Bimbo, C. Colombo, P. Pala, Semantics in visual information retrival, IEEE Multimedia 6 (3) (1999) 38–53. [2] D. Swanberg, C. Shu, R. Jain, Knowledge guided parsing in video databases, Spie 1908 (13) (1993) 13–24. [3] S. Smoliar, H.J. Zhang, Y. Gong, Automatic parsing of news video, Proceedings of the IEEE Conference on Multimedia Computing and Systems, May 1994, pp. 45 –54. [4] T. Kanade, Y. Nakamura, Semantic analysis for video contents extraction spotting by association in news video, ACM Multimedia 97 (1997) 393–401. [5] H. Zhang, B. Furht, S.W. Smoliar, Video and Image Processing in Multimedia Systems, Kluwer Academic Publishers, Dordrecht, 1995. [6] A.G. Hauptmann, M.J. Witbrock, Informedia: news-ondemand multimedia information acquisition and re-

[7]

[8]

[9] [10] [11] [12] [13] [14] [15]

[16]

[17]

trieval, in: M.T. Maybury (Ed.), Intelligent Multimedia Information Retrieval, AAAI Press, Cambridge MA, 1997, pp. 215–239. T. Sato, T. Kanade, E.K. Hughes, M.A. Smith, Video ocr for digital news archieve, IEEE International Workshop on Content-Based Access of Image and Video Databases CAIVD’ 98, 1998, pp. 52– 60. S. Eickeler, S. Muller, Content-based video indexing of tv broadcast news using hidden Markov models, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), March 1999, pp. 2997–3000. R. Lienhart, Indexing and retrieval of digital video sequences based on automatic text recognition, Fourth ACM International Multimedia Conference, 1996. A.G. Hauptmann, Speech recognition in the informedia digital video library: uses and limitations, ICTAI 95, 1995. A.G. Hauptmann, H.D. Wactlar, M.J. Witbrock, Informedia: News-on-demand experiments in speech recognition, ARPA Speech Recognition Workshop, 1996. M.J. Witbrock, A.G. Hauptmann, Speech recognition for a digital video library, JASIS, 1996. D. Luparello, B. Merialdo, K.T. Lee, J. Roudaire, Automatic construction of personalized tv news programs, Proceedings of ACM Multimedia 99, 1999, pp. 323–330. B. Merialdo, Automatic indexing of tv news, WIAMIS ’97, June 1997. J.S. Boreczky, L.A. Rowe, Comparison of video shot boundary detection techniques, Technical Report, Computer Science Division-EECS, University of Calfornia Berkeley. M. Caliani, C. Colombo, A. Del Bimbo, Commercial video retrieval by induced semantics, IEEE International Workshop on Content-Based Access of Image and Video Databases CAIVD ’98, 1998, pp. 72–80. U. Gargi, R. Kasturi, An evaluation of color histogram based methods in video indexing, First International Workshop on Image Database and Multi-media Search, 1996, pp. 75 –82.

About the Author—MARCO BERTINI has a research grant and carries out his research activity at the Department of Systems and Informatics at the University of Florence, Italy. He received a MS in electronic engineering from the University of Florence in 1999. His main research interest is content-based indexing and retrieval of videos. About the Author—ALBERTO DEL BIMBO graduated in 1978 from the University of Florence, Italy, where he is presently Full Professor of Computer Engineering. Presently he is the Director of the Master in Multimedia at the University of Florence and the Deputy Rector for Research and Innovation Transfer of the University of Florence. His scienti?c interests and activities have addressed the subjects of Image Technology and Multimedia, with particular reference to object recognition and image sequence analysis, content-based retrieval for image and video databases and advanced man-machine interaction. Prof. Del Bimbo is the author of over 150 publications, that have appeared in the most distinguished international journals and conference proceedings and is the author of the monography “Visual Information Retrieval” edited by Morgan Kaufman in 1999. From 1996 to 2000, he has been the President (formerly Vice-President) of the Italian Chapter of IAPR, the International Association for Pattern Recognition. He obtained the IAPR fellowship in 2000. He presently serves as Associate Editor of IEEE Transactions on Multimedia, IEEE Transaction on Pattern Analysis and Machine Intelligence, Pattern Recognition, Pattern Analysis and Applications Journal, Journal of Visual Languages and Computing and Multimedia Tools and Applications Journal. Since 1999 he has been Member at Large of the IEEE Publications Board. About the Author—PIETRO PALA is an assistant professor in the Department of Systems and Informatics at the University of Florecence, Italy. He received his MS in electronic engineering at the University of Florence in 1994. He received a Ph.D. in information science from the same university in 1998. His current research interests include pattern recognition, image and video retrieval by content, and related applications.


推荐相关:

Indexing for reuse of TV news shots_图文.pdf

Indexing for reuse of TV news shots - Br


...video indexing of TV broadcast news using Hidden....pdf

Content-based video indexing of TV broadcast news...shots and a key-frame is extracted for each ...Workshop on Indexing and Reuse in Multimedia ...


The indexing of persons in news sequences using aud....pdf

The indexing of persons in news sequences using ...TV news sequences stored in our video database ...dence values are extracted for each selected shot...


Image Processing Techniques for Video Content Extra....pdf

In our research group, we are currently developing tools for indexing video archives for later reuse, a system for content analysis of TV news [1], ...


...classes for automatic news video indexing.pdf

video shot scene frame Automatic indexing of video data, especially news ...but most Asian TV news programs have quite a few number of captions with...


ANNODEX-ing Broadcast TV News for Semantic Browsing....pdf

ANNODEX-ing Broadcast TV News for Semantic ...Indexing) is an open source family of ...shots and keyframes is encoded as an MPEG-7 description...


...On-line System for Indexing and Browsing of Broa....pdf

Físchlár An On-line System for Indexing and ...keyframe extraction, shot clustering and news story...of DVD and digital TV, as well as the ...


...representation using optimal extraction of frame....pdf

of TV news recordings has been analyzed in [6] where shots containing ...This approach not only provides a more efficient way for video indexing, ...


Integrating visual, audio and text analysis for news video_....pdf

1. INTRODUCTION TV is one of the most important...Shot n Figure 1 Hierarchical structure for news ...indexing and search is a much more challenging ...


Video shot segmentation using singular value decomp....pdf

1. INTRODUCTION The indexing and retrieval of ...Transitions between shots are widely used in TV ...II - 302 ? video basketball news teste football...


MediaMill Exploring news video archives based on le....pdf

nder is its ability for generic video indexing, ...It attaches to every video shot, represented as ...ered a surprising exploration of news video ...


Activity based video shot retrieval and ranking.pdf

Let For ranking shots based on its level of ...segmented using the SWIM video indexing system[10...“A Automatic news video parsing, indexing and ...


Content based indexing of image and video databases....pdf

Content based indexing of image and video ...search for shots containing a certain news speaker...severely tested on real data grabbed from TV ...


Identification of Scenes in News Video from Image F....pdf

In order to enable high quality indexing for ...[6] have realized shot classi cation based ...\Extraction of TV News Articles Based on Scene ...


Combining multiple experts for classifying shot cha....pdf

indexing system, and is a prerequisite to any ...for two consecutive frames of di erent shots. Each...The \Sky" sequence is a news program and the ...


Video Segment Indexing Through Classification and I....pdf

We distinguish two categories of indexing: (i) ...The second technique accounts for quanti cation e...anchor-person shots and episodes in a news video...


A Generic Tool for Content-Based Multimedia Browsing.pdf

tool in an application based on TV News ...combination and integration of various indexing techniques...of the video into consecutive shots using a cut...


A Two-Level Multi-Modal Approach for Story Segmenta....pdf

These story units are then used for indexing to support further browsing ...of news categories and features to be used in the shot classification and ...


Emmanuel Etivent (corresponding author).pdf

Assisted video sequences indexing: shot detection ...(films, reports, news, TV programs), we have ...Generally speaking, for other types of semi-...


Artificial Intelligence Techniques in the Interface to a Di....pdf

news stories from TV broadcasts, the system ...view of the story, where every shot is represented...recognition results are used directly for indexing....

网站首页 | 网站地图
All rights reserved Powered by 学霸学习网 www.tceic.com
copyright ©right 2010-2021。
文档资料库内容来自网络,如有侵犯请联系客服。zhit325@126.com
嘴角歪 林忆莲满文军或补位 被指与未成年少女发生关系 西安高校内大秀球技 赵本山搭私人飞机赴台宣传 《恋者》 梁家辉因爱行凶 谢霆锋陈伟霆比拼帅爆了 中央社 点名卓伟将起诉维权
不输TVB花旦(图) Fans 《夏天十九岁》 Lucky小肉手出镜 有望在京展映 双腿交叉颤抖 美国第2届喜剧奖提名出炉 身材纤细染红发(图) 2014福布斯中国名人榜 潘霜霜第三波下载 但双方合作依旧 2016提前上映 含泪吻黄觉