当前位置:文档之家› 在多种视频领域检测事件的通用框架

在多种视频领域检测事件的通用框架

在多种视频领域检测事件的通用框架
在多种视频领域检测事件的通用框架

A Generic Framework for Event Detection in Various Video Domains
Tianzhu Zhang1,2 , Changsheng Xu1,2 , Guangyu Zhu3 , Si Liu1,2,3 , Hanqing Lu1,2
National Lab of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, P. R. China 2 China-Singapore Institute of Digital Media, Singapore 119615, Singapore Department of Electrical and Computer Engineering, National University of Singapore, Singapore
1
3
{tzzhang, csxu, sliu, luhq}@https://www.doczj.com/doc/b08946901.html,, elezhug@https://www.doczj.com/doc/b08946901.html,.sg ABSTRACT
Event detection is essential for the extensively studied video analysis and understanding area. Although various approaches have been proposed for event detection, there is a lack of a generic event detection framework that can be applied to various video domains (e.g. sports, news, movies, surveillance). In this paper, we present a generic event detection approach based on semi-supervised learning and Internet vision. Concretely, a Graph-based Semi-Supervised Multiple Instance Learning (GSSMIL) algorithm is proposed to jointly explore small-scale expert labeled videos and large-scale unlabeled videos to train the event models to detect video event boundaries. The expert labeled videos are obtained from the analysis and alignment of well-structured video related text (e.g. movie scripts, web-casting text, close caption). The unlabeled data are obtained by querying related events from the video search engine (e.g. YouTube) in order to give more distributive information for event modeling. A critical issue of GSSMIL in constructing a graph is the weight assignment, where the weight of an edge speci?es the similarity between two data points. To tackle this problem, we propose a novel Multiple Instance Learning Induced Similarity (MILIS) measure by learning instance sensitive classi?ers. We perform the thorough experiments in three popular video domains: movies, sports and news. The results compared with the state-ofthe-arts are promising and demonstrate our proposed approach is performance-effective.
1.
INTRODUCTION
Categories and Subject Descriptors
H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing - Abstracting methods, Indexing methods.
General Terms
Algorithm, Measurement, Performance, Experimentation
Keywords
Event Detection, Graph, Multiple Instance Learning, Semi-supervised Learning, Broadcast Video, Internet, Web-casting Text
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro?t or commercial advantage and that copies bear this notice and the full citation on the ?rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speci?c permission and/or a fee. MM’10, October 25–29, 2010, Firenze, Italy. Copyright 2010 ACM 978-1-60558-933-6/10/10 ...$10.00.
With the explosive growth of multimedia content on broadcast and Internet, it is urgently required to make the unstructured multimedia data accessible and searchable with great ease and ?exibility. Event detection is particularly crucial to understanding video semantic concepts for video summarization, indexing and retrieval purposes. Therefore, extensive research efforts have been devoted to event detection for video analysis [23, 26, 33]. Most of the existing event detection approaches rely on video features and domain knowledge, and employ labeled samples to train event models. The semantic gap between low-level features and high-level events of different kinds of videos, the ambiguous video cues, background clutter and variant changes of camera motion, etc., further complicate the video analysis and impede the implementation of event detection systems. Moreover, due to the diverse domain knowledge in different video genres and insuf?cient training data, it is dif?cult to build a generic framework to unify event detection in different video domains (e.g. sports, news, movies, surveillance) with a high accuracy. To solve these issues, most of techniques for event detection currently rely on video content and supervised learning in the form of labeled video clips for particular classes of events. It is necessary to label a large amount of samples in the training process to achieve good detection performance. In order to reduce the human labor-intension, one can exploit the expert supervisory information in text source [3, 15, 23], such as movie scripts, web-casting text and closed captions, which can provide useful information to locate possible events in video sequences. However, it is very costexpensive and time-consuming to collect large-scale training data by text analysis, and there are still many videos without the corresponding text information for use. The Internet, nevertheless, is a rich information source with many event videos taken under various conditions and roughly annotated. For example, the surrounding text is an important clue used by search engines. Our intuition is that it is convenient to obtain a large-scale collection of videos as unlabeled data to improve the performance of event detection. By doing this, we propose a semi-supervised learning framework to exploit the expert labeled and unlabeled video data together. However, the labeled data by text analysis only have the weakly associated labels, which means we know the video’s label, but there may be no precise information about the localization of event in video. For the unlabeled data by Internet searching, we also do not know the precise localization of event. To ?nd the precise localization of event, we temporally cut a video into multiple segments and ?nd the segments corresponding to the event. Thus event localization can be considered as a typical multiple instance learning problem, where each segment is an instance and all segments of a
103

video clip compose a bag. By this formulation, event boundaries can be located with the selection of more better segments. In this paper, we propose a generic framework to automatically detect events from three realistic and challenging video datasets: sports, movie and news. We try to aggregate two sources of video data from text analysis and web video search together under a semisupervised learning framework. To gain the localization of event in video data, multiple instance learning is adopted. Therefore, we formulate event detection into a Graph-based Semi-Supervised Multiple Instance Learning problem. Compared with the existing approaches, the contributions of our work can be summarized as follows. ? We present a generic framework for event detection in variant video genres. By combination of multi-modality information in video data, the generic event models are constructed based on video content and suitable for various video event analysis. ? To obtain the effective event models in variant video genres, we design a Graph-based Semi-Supervised Multiple Instance Learning (GSSMIL) method to integrate expert labeled data obtained by text analysis and unlabeled data collected from Internet, which improves the detection performance and solves the insuf?cient training data problem by resorting to Internet data source. ? To construct the discriminative graph for event model training, we introduce a Multiple Instance Learning Induced Similarity (MILIS) measure. The learned similarity considers the class structure which is ignored by the existing similarity measures [5, 7]. The rest of the paper is organized as follows. Section 2 reviews the related work. The framework of the proposed approach is described in Section 3. The technical details of video collection and graph-based label propagation are presented in Section 4 and 5, respectively. Experimental results are reported in Section 6. We conclude the paper with future work in Section 7.
be broadly categorized according to the type of scene representation. One very popular category is based on trajectory modeling [36, 4], and the other is based on motion and appearance representations [24]. However, these approaches are unsuitable for multiple events detection in complex videos. Most of existing work uses domain knowledge [23, 33], which is dif?cult to be used for other video domains. For example, the methods used for event detection in movie cannot be applied to other domains such as sports videos that do not provide associated scripts; Event detection approaches in sports video using text analysis [33] cannot be applied to many videos without text information. Different from the precious work, we propose a generic framework for event detection in different video domains. The proposed method uses the videos with text information to learn model and then propagate labels to those videos with or without text information. Our learned model is based on video content. Therefore, it can be used for more generic video event analysis.
2.2
Graph-based Semi-supervised Learning
In the past few years, the graph-based semi-supervised learning approach has attracted a lot of attention due to its elegant mathematical formulation and effectiveness in combining labeled and unlabeled data through label propagation [21, 28]. The weight of the edge is the core component of a graph, which is crucial to the performance of the semi-supervised learning. The popular methods for the weight assignment include K-Nearest Neighbor (KNN), Gaussian Kernel Similarity (GKS) [5] and Sparsity Induced Similarity measure (SIS) [7] based on sparse decomposition in L1 norm sense. The main drawback with these approaches is that their performance is sensitive to the parameter variation and they do not take the label information into consideration. Different from the previous methods, we propose a new approach to measure the similarities based on class structure information.
2.3
Multiple Instance Learning
2. RELATED WORK
The most related work to our method is event detection, semisupervised learning and multiple instance learning, we review the state-of-the-arts of these three topics, respectively.
2.1 Event Detection
Most of existing work in event detection focuses on one type of video, such as movie, sports or news. For event detection in movies, much work [23, 8, 12] incorporates visual information, closed-captioned text, and movie scripts to automatically annotate videos in movies for classi?cation, retrieval and annotation of videos. For event detection in sports video, most of the previous work is based on audio/visual/textual features directly extracted from video content [27, 13, 35, 31]. These approaches heavily rely on audio/visual/textual features directly extracted from the video content itself. Some work uses text information [33, 26] such as close caption and web text for event analysis. Similar to the event detection in sports video, there is a lot of work for event analysis in news video [22, 19, 16]. By exploiting the available audio, visual and closed-caption cues, the semantically meaningful highlights in a news video are located and event boundaries are extracted. Analysis and modelling of abnormal event detection for video surveillance has also been studied [36, 4, 24]. These methods can
There is little work [18] to detect events with the multiple instance representations [20, 30], which is most suitable for event detection in the videos. For multiple instance representations, each segment of a video clip is an instance and all segments of a video clip is a bag. Labels (or events) are attached to the bags while the labels of instances are hidden. The bag label is related to the hidden labels of the instances as follows: the bag is labeled as positive if any instance in it is positive, otherwise it is labeled as negative. For our task, it is effective to distinguish a positive event instance (i.e., the particular event) from a negative instance (i.e, the background frames) with the help of multiple instance representations.
3.
FRAMEWORK OF OUR APPROACH
The framework of our proposed approach is illustrated in Fig.1. It contains three primary parts: video collection, GSSMIL and event detection. For video collection, it includes expert labeled data collection, event-relevant data collection and de-noising. For expert labeled data collection, text information of different kinds of video is used to structure the video segment and detect the start and end boundaries of the event to get a video event clip. Based on overlapped segmentation and de-noising of these video clips, a small-scale expert labeled database is collected. By querying key words from the web, we obtain some raw videos. After segmentation and de-noising, a large-scale event-relevant database is constructed. In the processing of de-noising, a Bayesian rule to ef?ciently remove some noise is adopted. For the GSSMIL module, we try to combine the expert labeled data and unlabeled data (from event-relevant data) for event detection. To effectively obtain
104

Video
Text Analysis
Video clip
Segment
Event bag
Event start
.. .
Event end
GSSMIL Model Learning
Event Recognition Denoise DrivingCar:
...
Event start
Event instance ...
...
Event end
Labeled Database
...
...
Demonstration:
...
Event start
... ...
m
...
...
MILIS
Event Model
Event Localization
Event end
(a.1)Expert labeled Data Collection
Web video search
Segment
...
(b)GSSMIL
Segment
...
... ...
Denoise
Test Video
UnLabeled Database
(c)Event Detection
...
(a.2) Event-Relevant Data Collection
Figure 1: The proposed framework. For better viewing, please see the original color pdf ?le. the similarity measure for the graph construction, we present an MILIS measure by considering the class structure. Finally, based on the learned event model, event recognition and localization are realized. The proposed approach is evaluated on highly challenging data from different video domains: movie, sports and news, and the experimental results are encouraging. The technical detail of each module in the framework will be described in the following sections. formation. We ?nd temporal localization of dialogues in scripts by matching script text with the corresponding subtitles using dynamic programming. Then we estimate temporal localizations of events by transferring time information from subtitles to scripts. Using this procedure, a set of short video clips (segmented by subtitle timestamps) with corresponding script parts, i.e., textual descriptions, are obtained. For sports videos, we use web-casting text, which is usually available online and provided freely by almost all broadcasters [33, 9]. Keywords by which events are labeled are ?rst prede?ned, then the time stamps where events happen are extracted from well de?ned syntax structure syntax web-casting texts by using the keywords as input query key to a commercial software, dtSearch [32]. This software is a sophisticated text search engine that has the ability to look-up word features dealing with the effect of fuzzy, phonic, wildcard, stemming and thesaurus search options. Similar to sports videos, the methods [22, 16] are adopted to ?nd the temporal localizations of events with the closed-caption for news videos.
4. VIDEO COLLECTION
In this section, we introduce how to collect annotated video samples and construct three different video datasets. To avoid manually labeling a large amount of video data, we design a smart strategy to automatically collect videos from professional broadcast service providers and Internet. The small part of labeled videos are obtained by the analysis and alignment of well-structured video related text (e.g. movie scripts, web-casting text, close caption), while the large part of event-relevant videos are collected from the Internet by querying related events from the video search engine (e.g. YouTube) and ?ltering noise.
4.2
Event-Relevant Data from Internet
4.1 Expert Labeled Data by Text Analysis
Here, we introduce an automatic procedure, as shown in Fig. 1(a.1), for collecting videos from multiple video types (sports, movie and news) supported by professional broadcast service providers. For different data sources, there are different available text information. For movie videos, we follow [14, 23, 12] using supervised text classi?cation to detect events to automatically collect training samples. The OpenNLP toolbox [17] for natural language processing and part of speech (POS) tagging to identify instances of nouns, verbs and particles are applied to avoid manual text annotation. In addition, we also use named entity recognition (NER) to identify people’s names. Based on results of POS and NER, we search for patterns corresponding to particular classes of events. Scripts describe events and their order in video but usually do not provide time in-
By using text information, we can obtain expert labeled data, however, it is still very dif?cult to handle text analysis and timealignment to collect enough labeled video data. Moreover, there are many videos without their corresponding text information. The Internet is a rich source of information, with many event videos taken under various conditions, which are roughly annotated. It is convenient to use such a collection of videos as event-relevant data to improve the performance of event detection. By doing this, our work tries to fuse two lines of research "Internet vision" and "event detection" together and and improve event detection performance. We query the event labels on a web video search engine like YouTube or Google. Based on the assumption that the set of retrieved videos contains relevant videos of the queried event, we can construct a large-scale video dataset, which includes videos taken from multiple viewpoints in a range of environments. The chal-
105

lenge is how to use these videos, because content in Internet is very diverse, which leads to the retrieved videos with much noise. For example, for a "Basketball Shot" query, a search engine is likely to retrieve some introduction videos of basketball shot. Our method must perform well in the presence of such noise. In this work, we adopt multiple keywords search ("Basketball Shot NBA Kobe Bryant") and propose an ef?cient method to remove some noise from the dataset in Section 4.3. Compared with data obtained by text analysis, we call this collection as event-relevant dataset. In our semi-supervised learning algorithm, this dataset is used as unlabeled data.
4.3 Segmentation and De-noising by Bayesian Rule
By text analysis and Internet, we can get a small-scale expert labeled data and a large-scale event-relevant video data. However, the labeled data only have the weakly associated labels, which means we know the video’s label, but there is no precise information about the localization of event in video. For the event-relevant data, we also do not know the precise localization of event. To solve this problem, we perform temporal segmentation of video clips and get segments composed of contiguous frames. Our target is to jointly segment video clips containing a particular event, that is, we aim at separating what is shared within the video clips (i.e., the particular event) from what is different among these (i.e, the background frames). Given a video clip vi containing the event of interest but at unknown position within the clip, the clip vi is represented by ni temporally overlapping segments centered at frames 1, . . . , ni represented by histograms hi [1], . . . , hi [ni ]. Each histogram captures the l1 -normalized frequency counts of quantized space-time interest points and audio features, as described in section 6.2. + ? Let vi denote a positive video clip and vi denote a negative + + ? video clip. vij is the j th segment of a positive video clip vi and vij ? + + th denotes the j segment of a negative video clip vi . Let {v1 , v2 , ? ? + ? . . . , vm , v1 , v2 , . . . vn } denote the set of m positive and n negative training video clips obtained by text analysis. l(vi ) ∈ {+1, ?1} is the bag label of vi and l(vij ) ∈ {+1, ?1} is the instance label of vij . For the negative video clips, all their segments are negative. However, for the positive video clips, their all segments must contain at least one true positive segment, and they may also contain many negative segments due to much noise, and imprecise localizations. The goal of de-noising is to identify the true positive segments in the positive video clips and remove some negative segments. We assume that given a true positive segment s, the probability that a segment vij is positive is calculated as follows: Pr(l(vij ) = +1|s) = exp(? ∥s ? vij ∥2 ) 2 δs (1)
1. √ Pr(l(vi ) = +1|s) If Pr(l(vi ) = ?1|s), we get d(s, vi ) δs ln 2. For a negative segment (i.e., false positive segment), however, its distances to the positive and negative video clips do not exhibit the same distribution as those from s. Since some positive video clips may also contain negative segments just like the negative video clips, the distances from the negative segment to the positive video clips may be as random as those to the negative video clips. This distributional difference provides an informative hint for identifying the true positive segments. Therefore, given a true positive segment s , there exists a threshold θs which allows the decision function de?ned in Eq.(3) to label the video clips according to the Bayes decision rule. { hss (vi ) θ = +1 ?1 if d(s, vi ) otherwise, θs
(3)
√ where θs = δs ln 2 determined by training data as follows: P (s) = max Ps (θs ),
θs
(4)
where Ps (θs ) is an empirical precision and de?ned as follows:
m+n ∑ 1 + hs (vi )l(vi ) 1 θs . m + n i=1 2
Ps (θs ) =
(5)
where ∥?∥ represents L2-norm, and δs is a parameter learned from the training data. Given a true positive segment s, the probability that a video clip vi is a positive video clip is de?ned as follows: Pr(l(vi ) = +1|s) = max Pr(l(vij ) = +1|s)
vij ∈vi
In this way, for each segment from the labeled dataset, we can obtain the Ps . Based on this value, we can remove some segments of each video clip. Note that the exact number of true positive segments for one speci?c positive video clip is unknown. To handle this problem, we propose that for a video clip: if P (s) > th1 , s is selected, where th1 is a threshold and is manually set to be 0.5 in our experiments. Based on our experiments, this method is able to well solve our problem. In this way, we can remove irrelevant segments and video clips obtained by text analysis and construct an expert labeled dataset. For the data obtained by web video search, we can also de-noise them by using the expert labeled data. Because data obtained from web have more noise than data obtained by text analysis, we set th1 as 0.8 to obtain much cleaner data. Moreover, we adopt another strategy to con?rm the reliability of the selected data from the web by the classi?er introduced in Section 5.3. We can get segments for each video clip. Then, each segment s is classi?ed using all classi?ers trained with labeled data and has mean score Scs . If the score Scs > th2 , this segment is selected and the video clip is viewed as event-relevant data. If scores of all segments of a video clip are below th2 , the video clip is not selected. In our experiments, the th2 is manually set to be 0.7. After de-noising by the two strategies, we collect a large-scale of more cleaner eventrelevant data from web, and this data will be used as unlabeled data to give distributive information.
5.
GRAPH-BASED SEMI-SUPERVISED MULTIPLE INSTANCE LEARNING (GSSMIL)
= max exp(?
vij ∈vi
∥s ? vij ∥2 d2 (s, vi ) ) = exp(? ), 2 2 δs δs
(2)
where d(s, vi ) = min ∥s ? vij ∥. In other words, the distance
vij ∈vi
d(s, vi ) between a segment s and all segments of a video clip vi is simply equal to the distance between s and the nearest segment of 2 vi . Then Pr(l(vi ) = +1|s)?Pr(l(vi ) = ?1|s) = 2 exp(? d (s,vi ) )? δ2
s
In this section, we introduce the GSSMIL algorithm which combines labeled data (as introduced in Sec. 4.1) and unlabeled data (as described in Sec. 4.2) and adopts multiple instance learning to detect the positive event instances from the event bags, where we consider event with precise localization in a video clip as positive event instance and event with imprecise localization as negative event instance, and the corresponding video clip is viewed as an event bag. Next, we introduce the problem description in Section 5.1, and how
106

to construct the graph and learn the similarity measure for the graph are presented in Section 5.2 and Section 5.3, respectively. Finally, we introduce how to solve the objective function in Section 5.4.
5.1 Problem Description
After video collection in Section 4, each segment of a video clip is viewed as an instance (event instance), and all segments of the video clip comprise of a bag (event bag). We use the following notation throughout this paper. Let L = {(x1 , y1 ), . . . , (x|L| , y|L| )} be the labeled data and let U = {x|L|+1 , . . . , x|L|+|U | } be the unlabeled data. Each bag xb is a set of instances {xb,1 , xb,2 , . . . , xb,nb }, with its label denoted by yb ∈ {0, +1}, where +1 is used for positive bag and 0 for negative bag. b is the index of bag, and nb is the number of all instances of bag xb . Each xb,j ∈ Rd is a ddimensional feature vector representing an instance. Without loss of generality, we assume that the ?rst L1 bags are positive and the following L2 bags are negative (|L| = L1 + L2 ). To describe the relationship between bags and instances, we use xb,j ∈ xb to represent that xb,j is an instance from bag xb and its label is represented as yb,j . Our task is to learn a soft label function ? : Rd → [0, +1] that f learns the label for each instance, we denote the predicted soft label of instance j by fj = ? (xbj ). Then, the labels of the bags can be f ? calculated. We de?ne a bag xb ’s label fb to be determined by the largest value of its corresponding instances’ soft labels:
? fb =
a positive bag may contain negative instances as well. Actually, only one positive instance is necessary to determine a positive bag. Thus, we de?ne the penalty term for a positive bag to be only related to the instance with the largest soft label:
L1 ∑( 1? b=1
E3 (f ) =
j:xb,j ∈xb
) max fj .
(9)
By combing the three items, we have the following cost criterion:
E(f ) =
|L| ∑ fj 2 fi 1∑ wij ( √ ? √ ) + α1 2 i,j di dj b=1+L L1 ∑ b=1

1
fj (10)
( 1?
j:xb,j ∈xb
) max fj ,
j:xb,j ∈xb
+ α2
j,xb,j ∈xb
max fj .
(6)
5.2 Graph construction
In this section, we formulate the graph-based semi-supervised multiple instance learning in an instance-level way and de?ne the cost criterion based on instance labels. Consider a graph G = (V, E) with nodes corresponding to N feature vectors. There is an edge for every pair of the nodes. We assume that there is an N × N symmetric weight matrix W = [wij ] on the edges of the graph, where N is the number of all instances. The weight for each edge indicates the similarity between the two nodes that are connected by the edge. Intuitively, similar unlabeled samples should have similar labels. Thus, the label propagation can be formulated as minimizing the quadratic energy function [37]: E1 (f ) = 1∑ fi fj 2 wij ( √ ? √ ) , 2 i,j di dj (7)
where α1 and α2 are two parameters used to balance the weight. In our experiments, we set α1 = α2 = 10, and obtained a good performance. Once we have found the optimal labels of the instances by minimizing the cost criterion E(f ), the bag-level label of any bag xb can be calculated by taking the maximum value of its instances’ labels using Eq.(6). The only two problems left are how to obtain ef?cient similarity measure W = [wij ] and how to solve the optimization task in Eq.(10). For the ?rst problem, we propose a multiple instance learning based method to learn the similarity measure W = [wij ] and introduce this in Section 5.3. Due to the existence of the max(?) function in the loss function Eq.(10) for positive bags, E(f ) is generally non-convex, and cannot be directly optimized. In section 5.4, we will derive a sub-optimum solution to this problem.
5.3
Multiple Instance Learning Induced Similarity (MILIS) Measure
where di is the sum of the ith row of W and denote D = diag(d1 , . . . , dN ). fi is the label of instance i, and fi should be nonnegative. Assume the instance i is the ith instance of event bag b, fi can be [ 1 ] c C denoted as fi = fbi , · · · fbi , · · · fbi . We can obtain some prior c knowledge from the labeled bag to its instance label, that is, fbi must be 0 if the bag xb does not contain label c. In this way, bag label information is applied. The Eq.(7) just controls the complexity in the intrinsic geometry of the data distribution and the smoothness of label over the instance-level graph. For our problem, we need to consider the constraints based on labeled bags, For a negative bag, it is straightforward to see that all instances in the bag are negative, i.e., fj = 0, for all j : xb,j ∈ xb . Thus we have the penalty term: E2 (f ) =
|L| ∑
One main drawback of most existing similarity measures, such as the Euclidean distance and Gaussian Kernel Similarity measure, is that the similarity measurement completely ignores the class structure. For example, in Fig. 2, given an event instance s belonging to category c, it is possible that some event instances from the same class to s are less similar than the ones from other classes when a prede?ned and heuristic distance metric is adopted. To tackle this problem, we attempt a discriminative solution to get truly similarity by learning some classi?ers. Here, we formulate the similarity measure learning as a problem of Multiple Instance Learning (MIL) [10] and mi-SVM [2] is employed to solve the problem. Next, we will introduce how to train a classi?er for an event instance s from the category c. This training process can be repeated for all classi?ers of different kinds of event instances. For the event instance s denoted as feature vector Is , its classi?er is trained in the hyper-sphere centered at Is with radius of rs in the feature space (as showed in Fig. 2). The training samples are the samples in class c denoted as positive bags and those in other categories denoted as negative ones. This strategy ?lters out the instances that are very different from s for each bag and enables the classi?er to be learned only in the local feature space. Therefore, it is very ef?cient to reduce the computational burden and learn a discriminative classi?er. De?ne the distance from event instance Is to event bag xb as db,j,s = min ∥Is ? xb,j ∥, where ∥?∥ represents L2-norm. In pracj

fj .
(8)
b=1+L1 j:xb,j ∈xb
tice, it is found quite robust and in majority cases the positive in{ } ? ? |j = arg min ∥xb,j ? Is ∥ . stance in the positive bag xb is xb,j
j
Meanwhile, for a positive bag, the case is more complex because
Based on this observation, rs is set as follows: rs = mean(db,j,s )+
b∈pos
107

where mi ? SV Ms (xb,j ) ∈ R is the output of the classi?er mi ? SV Ms with the input xb,j . Based on this score, the similarity between instance s and xb,j can be simply de?ned as follows: { gs (xb,j ) if gs (xb,j ) > 0 wsj = (13) 0 otherwise Therefore, for each event instance s in labeled data set, we can get its corresponding classi?er and the similarities with other event instances. Though there may be some instances in positive event bags belonging to negative instances after de-noising. It is still ef?cient to learn the similarity measure based on our experimental results. Based on the L labeled data and U unlabeled data, the similarity measure W can be splitted into labeled and unlabeled sub-matrices: ( ) WLL WLU W = , where WLU = WU L . WLL and WLU WU L WU U can be obtained by Eq.(13) using learned classi?ers. For the unlabeled data, we adopt Euclidean distance to measure the similarity WU U between data points.
(a)
... ... ... ...
Event instance Feature:
Event instance
kiss
Feature:
kiss
...
...
...
...
Event instance Feature:
Event instance
answerphone
kiss
Feature:
...
...
...
...
5.4
run
Iterative Solution Using CCCP
Event instance Feature:
Event instance
fightperson
Feature:
(b) Figure 2: Learning similar event instances to event instance s. (a) event instance s, which is one instance of kissing event bag. (b) Learning a classi?er to describe the similarities of other event instances to s. “? ”represents similar event instances from the same kind of event with s, and “? ”represents unrelated event instances. β × std (db,j,s ), where β is a trade-off between ef?ciency and acb∈pos
curacy. The larger β, the less probability that one positive instance will be ?ltered out before training, and the more event bags will be involved during solving the mi-SVM. In our experiments, the β is manually set to be 2.5 by experience. It is observed in experiments that there may be too many event instances falling into the hyper-sphere. To be more ef?cient, within the hyper-sphere, at most kp nearest event instances to Is are selected for each positive event bag, and kn for each negative event bag. In our experiments, kp = 5 and kn = 2 are used by experience. Experiments show that this strategy can signi?cantly reduce the computational burden. Denote yb,j to be the instance label of event instance xb,j and yb the label of event bag xb , where xb,j is the feature of the event instance j in the event bag xb . mi-SVM is formulated as follows: ∑ 1 2 min min ∥w? ∥ + C ξxb,j {yb,j } w? ,b0 ,ξ 2 x ∑ yb,j + 1 s.t. 2 j
?
b,j
Because E3 (f ) de?ned by Eq. (9) is non-convex, E(f ) can be viewed as that a convex function adds a concave function. Therefore, we adopt the constrained concave convex procedure (CCCP) to ?nd the sub-optimum solution. CCCP is proposed in [29] as an extension of [34], and is theoretically guaranteed to converge. It works in an iterative way: at each iteration, the ?rst order Taylor expansion is used to approximate the non-convex functions, and the problem is thus approximated by a convex optimization problem. The sub-optimum solution is given by iteratively optimizing the convex subproblem until convergence. Note that max (?) is not differentiable at all points. To use CCCP, we have to replace the gradients by the subgradients. Let c c c c l = [fb1 , · · · , fbj , · · · , fbnb ]T , where fbj denotes the probability th th that the j instance in the b bag belongs to the cth class and c ∈ {1 · · · C}, C is the total number of classes and nb is the number of instances in the bth bag. We pick the subgradient with ρ, which is an nb × 1 vector and its j th element is given by { ρj =
(t) 1 τ
0
if lj (t) = max (lb (t) ) , otherwise
(14)
where max(lb ) represents the largest label value of bag b and τ is (t) the number of instances with the label value max(lb ). At the (t + th 1) iteration, we estimate the current l based on l(t) and the cor∑ ∑ (t) responding ρj . As ρT l(t) = = max l(t) ρj ?=0 ρj = j ρj lj max l(t) , for the function max(l), its 1-st order Taylor expansion is approximated as (max l)l(t) ≈ ρT l. For the t-th iteration of CCCP, the objective function in Eq.(10) is rewritten in matrix form as follows: min T r(F T LF ) + α1 ∑∑
b c
1, ?xb s.t. yb = 1 (11) 1 ? ξxb,j , ξxb,j 0,
yb,j = ?1, ?xb s.t. yb = ?1 ?j : yb,j (?w , xb,j ? + b0 ) yb,j ∈ {?1, 1}. Denote mi ? SV Ms as the trained classi?er corresponding to event instance s. Based on this classi?er, all event instances can be projected to real value with a function. For simplicity, the project function is de?ned as follows: { mi ? SV Ms (xb,j ) ?xb,j , s.t. ∥xb,j ? Is ∥ rs gs (xb,j ) = , 0 otherwise (12)
F
(1 ? Ybc )hc F T qb + , (15)
α2 s.t. F
∑∑
b c
Ybc (1 ? hc βUb F hT ) c
0, F e1 = e2
where L is a Laplace matrix L = D ? W , with D being the degree matrix and T r (·) represents matrix trace operator, F = [f11 , · · · , f1n1 , · · · , fBn1 , · · · , fBnB ]T and F ∈ RN ×C . B is B ∑ the number of all bags and N = nb is the number of all inb=1
108

AnswerPhone
FightPerson
ApplaudHand
B_Foul
B_FreeThrow
B_Run
Run
DriveCar
Kiss
B_Shot
S_Foul
S_Shoot
(a) Movie Video
(b) Sports Video
(c) News Video
Figure 3: Realistic samples for three different video types: movie video, sports video and news video and their corresponding events. All samples have been automatically retrieved by text analysis. stances. Each row of F corresponds to the posterior probability distribution of an instance, hence should be: (1) positive (2) l1 normalized. Therefore, the constraint of Eq.(15) is necessary. Y = [Ybc ], Ybc = 1 if bag b belongs to the cth class, otherwise Ybc = 0. hc is a 1 × C indicator vector, the cth element of which is one and others are zero, and qb = [ 0, · · · , 0 , 1, · · · , 1, 0, · · · , 0 ]T is an N × 1 vector whose all elements, except for those elements corresponding to the bth bag, are zeros. β is a C × N matrix, β = T T T [β1 , · · · , βb , · · · , βB ], each βb = [βb1 , · · · , βbc , · · · , βbC ]T is a C × nb matrix corresponding to bag b and βbc = η T . e1 = 1C×1 and e2 = 1N ×1 are both all-one vectors, α1 and α2 are two parameters used to balance the weight. Ub = diag (u1 , · · · , ub , · · · uB ) is an N × N diagonal block matrix, where uk = 0nk ×nk for k = 1, · · · , b ? 1, b + 1, · · · B and ub = Inb ×nb , I represents an identity matrix. The subproblem in Eq. (15) is a standard quadratic programming (QP) [6] problem and can be solved by any state-of-the-art QP solvers. In our work, it is solved ef?ciently with global optimum using existing convex optimization packages, such as Mosek [1]. Running CCCP iteratively until convergence, we can obtain the sub-optimum solution for the instance labels. The label for each bag is then calculated as the largest label of all its instances using Eq. (6). Out-of-sample Extension: For a new testing instance t, its label is given as: / ∑ √ ∑ fj f t = dt w(j, t), (16) w(j, t) √ dj j j where w(j, t) represents the similarity between instance j and t. dj and dt have the same meaning as in Eq.(7). dt is an unknown constant for a particular testing instance t, and we can ignore it when making decision. fj is the obtained instance label by Eq.(10). When given a testing bag with multiple instances, we make use of Eq.(16) to obtain instance label and then apply Eq.(6) to classify the bag. Because we can obtain a label for each instance of a bag, the localization of event can be also obtained. Here, Gaussian kernel based temporal ?ltering is conducted to smooth the event instances from a video stream taking account of the temporal consistency of events.
1,··· ,b?1 b b+1,··· ,B
6.1
Dataset Introduction
Because there is little work to handle different kinds of video, no publicly generic dataset is avaliable. By using the method introduced in Section 4, we obtain three different video datasets. For movie videos, we select 6 different kinds of representative events: AnswerPhone, DriveCar, Kissing, FightPerson, Run and Applauding. 981 unlabeled data are obtained from the web and 311 expert labeled data are obtained by text-video alignment. The dataset is from about 60-hour videos. For sports videos, we select 6 different kinds of events: Basketball Foul (B_Foul), Basketball Free Throw (B_FreeThrow), Basketball Run (B_Run), Basketball Shot (B_Shot), Soccer Foul (S_Foul) and Soccer Shoot (S_Shoot). 913 unlabeled data are obtained from the web and 316 expert labeled data are obtained by text-video alignment. There are about 25-hour videos for this dataset. For news videos, we select 4 kinds of events: Eating, PlayingInstrument, Demonstration and Dancing. 863 unlabeled data are obtained from the web and 311 expert labeled data are collected by text-video alignment from about 20-hour videos. The number of test data is 253, 211 and 203 for the three datasets, respectively. Note that the three different kinds of videos are very challenging for video analysis due to its loose and dynamic structure as shown in Fig. 3.
6.2
Video Feature Extraction
Visual features and audio features are complimentary and important for video event detection. For example, Basketball Free Throw and Basketball Shot are most similar just using visual features, however, audio features, such as whistle of referee, are discriminative. To the contrary, visual features are very important to distinguish dancing and playing a guitar, because they both have similar background music and different motion features. In the following subsections, we will introduce two features, respectively.
6.2.1
Spatio-temporal Features
6. EXPERIMENTAL RESULTS
In this section, we present extensive experimental results on movie, sports and news video datasets in order to validate the proposed approach.
Sparse space-time features have recently shown good performance for video analysis [11]. They provide a compact video representation and tolerance to background clutter, occlusions and scale changes. We detect interest points using a space-time extension of the Harris operator. To characterize motion and appearance of local features, we compute histogram descriptors of space-time volumes in the neighborhood of detected points. For a 3D video patch in the neighborhood of each detected space-time interest point (STIP), it is partitioned into a grid with 3×3×2 spatio-temporal blocks; 4-bin HOG descriptor and 5-bin HOF descriptor are then computed for all blocks and are concatenated into a 72-element and 90-element descriptors, respectively. The details can be found in [23]. For each volume we compute coarse histograms of oriented gradient (HoG)
109

(a) Movie Video Dataset
(b) Sport Video Dataset
(c) News Video Dataset
Figure 4: The accuracy of different similarity measures on the three datasets, respectively. and optical ?ow (HoF). Normalized histograms are concatenated into HoG and HoF descriptor vectors and are similar in spirit to the well known SIFT descriptor. HoF is based on local histograms of optical ?ow. It describes the motion in a local region. HoG is a 3D histogram of 2D (spatial) gradient orientations. It describes the static appearance over space and time. Then the two descriptors are concatenated into one 162-dimensional vector, which is reduced by PCA to 60.
6.3
Recognition Evaluation of Different Similarity Measurements
6.2.2 Audio Features
The mel-frequency cepstral coef?cients (MFCCs) [25] are proved more ef?cient [38] for audio recognition. Therefore, we adopt the MFCCs to represent audio. The MFCCs are based on a short-term spectrum, where Fourier basis audio signals are decomposed into a superposition of a ?nite number of sinusoids. The power spectrum bins are grouped and smoothed according to the perceptually motivated Mel-frequency scaling. Then the spectrum is segmented by means of a ?lter bank that typically consists of overlapping triangular ?lters. Finally, a discrete cosine transform applied to the logarithm of the ?lter bank outputs results in vectors of decorrelated MFCC features. In our experiments, we use 13-dimensional MFCC features.
6.2.3 Bag of Features
For the two different modality features, we build bag-of-features (BoF) respectively. This requires the construction of visual vocabulary. In our experiments we cluster a subset of 400k features sampled from the training videos with the k-means algorithm for visual features. The number of clusters is set to k = 400 for visual features and k = 100 for audio features with a subset of 100k features sampled from the training data, which have shown empirically to give good results. The BoF representation then assigns each feature to the closest (we use Euclidean distance) vocabulary word and computes the histogram of visual word occurrences over a spacetime volume corresponding to the segments obtained by over segmentation of the entire video clip. Then, for each segment, the two histograms are concatenated into one 500 dimensional vector and then normalized. Table 1: Comparison of the average accuracy of different similarity measures on the three datasets, respectively.
Method KNN Dataset Movie News Sports 39.76% 47.48% 43.35% GKS 42.82% 50.10% 45.58% SIS 44.23% 55.79% 49.41% MILIS 51.27% 63.87% 57.71%
This experiment compares the proposed MILIS measure with three similarity measures, GKS, sparsity induced similarity measure(SIS) [7], and KNN on the three datasets. We compare our approach with the three methods, because they are very popularly and extensively used to measure similarity. For GKS, we use dij = ∥pi ?pj ∥2 ) to measure similarity and the variance δ is set to exp(? δ2 be 1.5, 1.5, 1.2 which achieved the best performance for the three datasets, respectively. For KNN, we use inner product similarity to ?nd the K nearest neighbors while the number of nearest neighbors K is tuned by cross-validation. We found that 30, 30,20 work better for our experiments on the three datasets, respectively. Then, the similarity values between a sample and its K nearest neighbors are their correlation coef?cients while those between the sample and the rest are set to 0. As for SIS, we normalize all feature vectors so that their L2 norms are 1 before computing the weight matrix. Fig. 4 shows the propagation accuracies of four different similarity measures: KNN, GKS, SIS, and MILIS. The x-axis is categories of different events. The y-axis is label propagation accuracy for individual classes and the mean accuracy for all of the categories denoted as "ALL". We can see that GKS (labeled as ‘GKS’) works better than KNN (labeled as ‘KNN’), SIS (labeled as ‘SIS’) works better than GKS, and MILIS (labeled as ‘MILIS’) works the best. From Fig. 4, we notice that the mean accuracies for all of concepts of our method outperform the other methods on the three datasets respectively. The average precision values of different similarity measures on the three datasets are shown in Table 1. We can see that our method has an improvement of at least 7%. Fig. 4 also shows that there are some classes where our approach does not outperform the other approaches, such as AnswerPhone, Soccer Foul (S_Foul) and Dancing. To explain the reason, we give an example on news video dataset. We can see two similarity measures (GKS and KNN) outperform our method (MILIS) for "Dancing" recognition. This is because the four classes are very prone to be classi?ed as "Dancing" using GKS and KNN with our video features, which leads to a high performance for "Dancing" recognition with a high false alarm rate. However, our proposed similarity measure considers the class structure, which improves the discrimination of the feature points and reduces the false alarm rate. Based on the mean accuracies for all of the concepts on the three datasets, we can con?rm that our approach obtains the best performance compared among the existing methods. The result is obvious, because the existing similarity measures such as KNN, SIS and GKS completely ignore the class structure. However, our approach considers the distribution of feature points in the feature space and adopts a classi?er to improve the similarity measure by using discriminative class information.
110

Table 2: Comparison of different learning strategy on the three datasets.
Method SVM Dataset Movie News Sports 44.99% 53.23% 51.30% mi-SVM 47.85% 58.51% 53.31% GSSMIL
Table 4: Comparison of the average precision of localization on the three datasets.
Dataset # Positive Sample Movie 117 43.6% 46.2% News 114 55.6% 61.6% Sports 105 47.6% 53.3%
51.27% 63.87% 57.71%
mi-SVM GSSMIL
6.4 Recognition Evaluation of Different Learning Strategies
Three experiments with different learning strategies are performed to validate the effectiveness of our method. First, we do not formulate the event detection into a multiple instance learning problem, instead, we use the entire video as a training sample and train an SVM classi?er. The results are shown in Fig. 6 (labeled as ‘SVM’). In the second experiment, we do not use the unlabeled data from Internet, and only employ the expert labeled data, its results are shown in Fig. 6 (labeled as ‘mi-SVM’). The last experiment is our ‘GSSMIL’ method which formulates the event detection as multiple instance learning problem and combines labeled and unlabeled data under a semi-supervised framework. The average precisions of three different learning strategies on the three datasets are shown in Table 2. From Table 2, we can see that the mean accuracies for all of concepts of our approach outperform the other methods on the three datasets, and have been improved about 4%. Fig. 6 also shows that certain results are worse for certain types of content, for example, AnswerPhone. This is because the ?ve classes are very easy to be classi?ed as this type of event, which leads to a high false alarm rate using ‘SVM’. From the results of mean accuracies for all of the concepts, we can see that our GSSMIL performs better than the other methods. The reasons can be summarized as follows: (1) The entire video contains not only events of our interest, but also some clutter noises, which harms the classi?er training and results in the bad performance. The increasing performance of GSSMIL clearly illustrates the importance of temporal event localization in the training data. In addition, it is very suitable to formulate the event analysis as a multiple instance learning problem. (2) By combining large-scale unlabeled data, the GSSMIL algorithm is effective to mine useful information.
Figure 5: Some examples of temporally localizations of events by the proposed algorithm on the three datasets. Each row shows example frames from the entire video clip. Example frames of automatically localized events within the clips are shown in red. The four rows represent "Kissing" and "Applauding" on movie video dataset, "B_Shot" on sports video dataset, and "Demonstration" on news video dataset, respectively. used in order to compensate for the fact that temporal boundaries of events are somewhat ambiguous and not always accurately de?ned. Using this performance measure, we conduct the experiments for videos without text information alignment. For the news video dataset, the results are shown in Table 3. For video without text information, the GSSMIL correctly localizes 102 out of 114 clips, which corresponds to an accuracy of 61.6% (labeled as ‘GSSMIL’). However, if we use mi-SVM, the precision is only 55.6% (labeled as ‘miSVM’). The result shows that the localization performance is improved by using of unlabeled data. The average precision of event localization is shown in Table 4, which shows our algorithm can effectively localize event boundaries. Some automatically localized segments are shown in Fig. 5.
6.5 Event Localization Evaluation in Video Sequence
In this section, we apply the GSSMIL algorithm described above to temporally localize event boundaries on the three video datasets. The quality of the segmentation is evaluated in terms of localization accuracy. The GSSMIL algorithm is evaluated on a set of 117, 114 and 105 events for movie, news and sports videos, respectively. Our testing and training videos do not share the same scenes or actors. For both the training and test set, the ground truth event boundaries were obtained manually. The temporal localization accuracy is measured by the percentage of clips with relative temporal overlap to ground truth event segments greater than 0.3. This relatively loose threshold of 0.3 is Table 3: Comparison of the performance of localization on news dataset.
Event # Positive Sample mi-SVM GSSMIL Demonstration 31 51.6% 61.2% Dancing 28 53.6% 60.7% Eating 23 60.8% 65.2% PlayingInstrument 32 56.3% 59.3%
7.
CONCLUSIONS
Video event detection is very important for content based video indexing and retrieval. This paper provides an effective event detection method using the GSSMIL strategy. To tackle the insuf?cient labeled data problem and alleviate human labeling effort, text information is mined and served as labels to assist model training. Besides the expert-level labels, Internet is also able to provide a huge amount of event-relevant data (used as unlabeled data). Our GSSMIL method can exploit both datasets together. To handle the ambiguity of event boundary in our labeled and unlabeled datasets, the GSSMIL algorithm incorporates a multiple instance learning module. Moreover, the GSSMIL algorithm employs an effective method to describe the sample af?nity, which is proved to boost the event recognition and localization performance signi?cantly. In the future, we will extend the generic method to more broad categories and more video domains. Moreover, we will combine the three datasets together and research the performance of our proposed method to detect all kinds of events.
111

(a) Movie Video Dataset
(b) Sports Video Dataset
(c) News Video Dataset
Figure 6: Comparison of different learning strategies. The result of an entire video as a training sample without no segmentation shown as ‘SVM’; The result of just using labeled training data without unlabeled data is denoted as ‘mi-SVM’; The result of our method is labeled as ‘GSSMIL’.
8. ACKNOWLEDGMENTS
This work was partially supported by the National Natural Science Foundation of China (Project No. 60835002 and 90920303), and National Basic Research Program (973) of China (Project No. 2010CB327905).
9. REFERENCES
[1] https://www.doczj.com/doc/b08946901.html,/. [2] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance learning, 2002. In NIPS. [3] N. Babaguchi, Y. Kawai, and T. Kitahashi. Event based indexing of broadcasted sports video by intermodal collaboration, 2002. IEEE Transactions on Multimedia. [4] A. Basharat, A. Gritai, and M. Shah. Learning object motion patterns for anomaly detection and improved object detection, 2008. In CVPR. [5] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering, 2002. In NIPS. [6] S. Boyd and L. Vandenberghe. Convex optimization, 2003. Cambridge University Press. [7] H. Cheng, Z. Liu, and Z. Liu. Sparsity induced similarity measure for label propagation, 2009. In ICCV. [8] T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar. Movie/script: Alignment and parsing of video and text transcription, 2008. In ECCV. [9] M.-S. Dao and N. Babaguchi. Sports event detection using temporal patterns mining and web-casting text, 2008. In AREA ’08: Proceeding of the 1st ACM workshop on Analysis and retrieval of events/actions and work?ows in video streams. [10] T. Dietterich, R. Lathrop, and T. Lozano-Perez. Solving the multiple-instance problem with axis parallel rectangles, 1997. Arti?cial Intelligence. [11] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features, 2005. In PETS. [12] O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce. Automatic annotation of human actions in video, 2009. In ICCV. [13] A. Ekin, A. M. Tekalp, and R. Mehrotra. Automatic soccer video analysis and summarization, 2003. IEEE Trans. on Image Processing. [14] M. Everingham, J. Sivic, and A. Zisserman. Hello! my name is... buffy - automatic naming of characters in tv video, 2006. In BMVC. [15] M. Fleischman and D. Roy. Grounded language modeling for automatic speech recognition of sports video. In Proceedings of ACL-08: HLT. [16] A. G. Hauptmann and M. J. Witbrock. Story segmentation and detection of commercials in broadcast news video, 1998. Advances in Digital Libraries. [17] https://www.doczj.com/doc/b08946901.html,. [18] Y. Hu, L. Cao, F. Lv, S. Yan, Y. Gong, and T. S. Huang. Action detection in complex scenes with spatial and temporal ambiguities, 2009. In ICCV.
[19] C. Huang, W. Hsu, and S. Chang. Automatic closed caption alignment based on speech recognition transcripts, 2003. Tech. Rep. 007, Columbia University. [20] Y. Jia and C. Zhang. Instance-level semisupervised multiple instance learning, 2008. AAAI’08: Proceedings of the 23rd national conference on Arti?cial intelligence. [21] T. Kato, H. Kashima, and M. Sugiyama. Robust label propagation on multiple networks. 2009. IEEE TNN. [22] J. G. Kim, H. S. Chang, K. Kang, M. Kim, J. Kim, and H. M. Kim. Summarization of news video and its description for content-based access. International Journal of Imaging Systems and Technology, 13:267–274, 2003. [23] I. Laptev, M. Marsza?ek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies, 2008. In CVPR. [24] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos. Anomaly detection in crowded scenes, 2010. In CVPR. [25] M. Müller. Information retrieval for music and motion. Springer, page 65, 2007. [26] B. N, K. Y, O. T, and K. T. Personalized abstraction of broadcasted american football video by highlight selection. IEEE Trans Multimedia, 6:575–586, 2004. [27] Y. Rui, A. Gupta, and A. Acero. Automatically extracting highlights for tv baseball programs, 2000. In Proc. of ACM Multimedia, Los Angeles. [28] A. Singh, R. D. Nowak, and X. Zhu. Unlabeled data: now it helps, now it doesn’t, 2008. In NIPS. [29] A. Smola, S. Vishwanathan, and T. Hofmann. Kernel methods for missing variables, 2005. Proc. International Workshop on Arti?cial Intelligence and Statistics. [30] C. Wang, L. Zhang, and H.-J. Zhang. Graph-based multiple-instance learning for object-based image retrieval, 2008. MIR ’08: Proceeding of the 1st ACM international conference on Multimedia information retrieval. [31] J. Wang, C. Xu, E. Chng, K. Wan, and Q. Tian. Automatic generation of personalized music sports video, 2005. In Proc. of ACM International Conference on Multimedia. [32] https://www.doczj.com/doc/b08946901.html,. [33] C. Xu, J. Wang, K. Kwan, Y. Li, and L. Duan. Live sports event detection based on broadcast video and web-casting text, 2006. In MM’06 Conference Proceedings. [34] A. Yuille and A. Rangarajan. The concave-convex procedure. 15(4):915–936, 2003. Neural Computation. [35] D. Zhang and S. Chang. Event detection in baseball video using superimposed caption recognition, 2002. In Proc. of ACM International Conference on Multimedia. [36] T. Zhang, H. Lu, and S. Li. Learning semantic scene models by object classi?cation and trajectory clustering, 2009. In CVPR. [37] X. Zhu. Semi-supervised learning literature survey, 2008. Computer Sciences Technical Report 1530, University of Wisconsin-Madison. [38] D.S.B and M.P. Comparison of parametric representation for monosyllabic word recognition in continuously spoken sentences, 1980. IEEE Trans. ASSP.
112

高清视频事件检测器技术方案

高清视频事件检测器 技术方案 北京途博交通技术有限公司

目录 1市场分析 (5) 1.1辅助视频监控-事件事故检测需求 (5) 1.2统计交通信息-交通参数信息需求 (5) 1.3局部能见度检测-能见度检测需求 (6) 1.4理想状态的系统 (6) 2设计思想 (7) 3产品介绍 (9) 3.1交通参数检测功能 (9) 3.2事件事故检测功能 (10) 3.3视频能见度检测功能 (14) 3.4云台摄像机自学习检测功能 (14) 3.5数据的存储及管理 (15) 3.6系统本地设置、维护及诊断 (15) 4TOD-V交通视频分析处理器指标 (16) 4.1设备外观和安装尺寸 (16) (16) 4.2硬件技术指标 (17) 4.3功能技术指标 (19) 4.4产品检验报告 (20) 5系统方案设计 (21) 5.1系统组成 (21)

5.3信息传输单元: (22) 5.4信息处理单元: (22) 5.5系统结构图 (23) 5.6系统主要模块功能规划 (23) 5.6.1固定式或PTZ摄像机视频源 (23) 5.6.2TOD-V交通状况视频处理器 (24) 5.6.3数据安全存储 (25) 5.6.4“途博”交通状况综合监测管理平台 (26) 5.7系统应用环境需求 (27) 5.7.1视频检测器的最低环境需求 (27) 5.7.2系统平台软件需求 (28) 5.7.3管理平台服务器的环境需求 (28) 5.7.4管理平台服务器的硬件需求 (28) 6TOD-V系统外场环境要求 (30) 6.1信息采集部分 (30) 6.1.1隧道内摄像机的安装要求 (30) 6.1.2摄像机特性要求 (32) 6.2信息传输部分要求 (33) 7工程量计算与选型 (34) 8服务 (35) 8.1协作与支持 (35)

装配式结构[1]解析

装配式预制混凝土框架结构 抗震性能研究综述 摘要介绍了预制混凝土框架结构抗震研究的最新进展,包括预制混凝土框架预应力拼接节点、后浇整体式节点、全装配式节点的抗震研究和装配式混凝土框架结构抗震研究情况。总结了装配式混凝土框架结构各类节点的抗震性能,指出了全装配式节点和预制结构整体抗震性能是今后需要进一步研究的内容。 关键词预制混凝土,框架结构,连接节点,抗震性能 Summary of Investigation on Seism ic Behavior of Precast Concrete Frame Structures Abstract The progress of investigations on seismic behavior of precast concrete frame structures including the investigations on emulation monolithic connections, joint precast connections and the global precast frame structures is introduced. A summary of seismic behavior of different connections is presented and it is suggested that the seismic behavior of jointed precast connections and global precast structures should be paid more attention in the future reseach. Keywords precast concrete, frame structure, connection joint, seismic behavior 0.引言 装配式钢筋混凝土结构是我国建筑结构发展的重要方向之一,它有利于我国建筑工业化的发展,提高生产效率节约能源,发展绿色环保建筑,并且有利于提高和保证建筑工程质量。与现浇施工工法相比,装配式RC结构有利于绿色施工,因为装配式施工更能符合绿色施工的节地、节能、节材、节水和环境保护等要求,降低对环境的负面影响,包括降低噪音、防止扬尘、减少环境污染、清洁运输、减少场地干扰、节约水、电、材料等资源和能源,遵循可持续发展的原则。而且,装配式结构可以连续地按顺序完成工程的多个或全部工序,从而减少进场的工程机械种类和数量,消除工序衔接的停闲时间,实现立体交叉作业,减少施工人员,从而提高工效、降低物料消耗、减少环境污染,为绿色施工提供保障。另外,装配式结构在较大程度上减少建筑垃圾(约占城市垃圾总量的30%―40%),如废钢筋、废铁丝、废竹木材、废弃混凝土等。 国内外学者对装配式RC结构做了大量的研究工作,并开发了多种装配式结构形式,如无粘结预应力装配式框架、混合连接装配式混凝土框架、预制结构钢纤维高强混凝土框架、装配整体式钢骨混凝土框架等。但目前,由于我国对预制混凝土结构抗震性能认识不足,导致预制混凝土结构的研究和工程应用与国外先进水平相比还有明显差距,预制混凝土结构在地震区的应用受到限制,因此我国迫切需要展开对预制混凝土结构抗震性能的系统研究。

基于视频图像序列的抛洒物检测毕业设计资料

基于视频图像序列的抛洒物检测第1章概述1.1 论文研究背景 如今,中国高速公路里程已达7.4万公里,居世界第二位[1]。随着高速公路、城市公路通行量的不断增加,交通事故所带来的安全隐患也所之增加,在众多危害安全的事故中交通事故是当前最为严重的,而中国交通局对近10年交通事故官方统计显示,世界上因交通事故死亡人数最多的国家中,中国位列前三之中。至今中国每年交通事故约50万起,每年的事故死亡人数均已达到10万人以上,已经连续十年居世界第一。 而今,随着我国经济的不断发展,交通需求的不断增长,高速公路和隧道已成为经济社会发展的重要助推器。然而,高速公路在给人们带来巨大的经济效益和社会效益的同时,各类交通事故也明显增加,尤其是各类恶性重大事故频发,据不完全统计,2005年底每万公里死亡1823人,死亡人数以每年近20%的比率递增。由于高速公路和隧道具有车流量大、行车速度高等特点,一旦发生交通事故将会非常严重,不仅一次事故殃及的车辆多、伤亡率高,还会造成严重的交通阻塞和行车延误,而且还可能会引起二次事故的发生,严重影响高速公路和隧道的整体通行能力和运营效率。而高速公路里程长,交通事件自身又有很强的随机性,如何快速检测交通事件,最大限度地减少交通事件的发生和影响,一直是国际交通领域所关注的问题。 城市中大货车货运过程中抛洒物事件已经严重危害交通安全,成为造成交通事故的主要原因之一,其带来的安全隐患是我们急需重视及解决的问题。因过往车辆上的抛洒物、坠落物引发的交通事故不仅给通行车辆和司乘人员造成了生命财产损失,影响了道路的通畅,而且给高速公路运营管理单位带来了经济损失和诉讼纠纷,损害了高速公路的社会形象。每天通过车辆数以千万计,路面上抛洒物、坠落物随时随地可能出现[2]。尽管高速公路管理单位已安排保洁工路面巡查,养排中心专职巡查,交警路政也在巡查,但仍不可能做到在时间上、空间上的无缝覆盖。而这些抛洒物、坠落物很有可能随时引发交通事故,给过往司机旅客造成财产甚至生命的损失,给高速公路经营管理带来经济上、法律上的纠纷和后果。如何及时准确的检测到抛洒物事件的发生,高效率的检测路面上产生的抛洒物,并且及时清理避免造成交通事故已经成为国内外交通部门关注的热点问题。

几种主要车辆检测器的对比

几种主要检测技术的对比 道路交通信息采集是智能交通系统的一项重要内容。在道路交通信息采集技术中,环形线圈车辆检测器因其技术成熟、易于掌握、初期建设成本较低而成为当前国内用量最大一种检测设备。但是,环形线圈检测器同时具有获得的信息量少,难于安装和较低的灵活性等缺点。为克服以上不足,微波车辆检测器和视频车辆检测器技术得以发展并应用于城市道路和高速公路的交通信息检测。 下面对几种检测技术的优缺点做具体分析 随着道路交通检测技术的发展,基于视频图像处理、模式识别技术的视频车辆检测器应运而生。视频车辆检测器具有采集信息量大、区域广泛、设定灵活、调整维护简便等特点,与传统的交通信息系统采集技术相比,视频检测器可提供现场的视频图像。 1.地感线圈 环形线圈车辆检测器是传统的交通检测器,其工作原理为在道路上埋设感应线圈,感应线圈与车辆检测器连接。当车辆经过线圈时,由于线圈电感量的变化,车辆的通过状态变化将被检测到,同时将状态信号传输给车辆检测器,由其进行采集和计算。 环形线圈车辆检测器相对于其他检测器具有低成本、高可靠性、高检测精度、全天候工作的优点,是目前应用最广泛的车辆检测器。 缺点:1、按照环形线圈施工要求,检测线圈在初次安装时要切割路面,植入环形检测线圈。封路施工不可避免会造成交通阻塞,对于城市主干道交通产生影响。2、埋植线圈的切缝容易使路面受损,缩短路面及检测线圈的使用寿命。实际使用中尤其对沥青路面的损坏更为严重,导致检测线圈的损毁率居高不下,使用和维护成本上升,影响系统的可用性。3、检测线圈容易受到路面下沉、裂缝、冰冻等环境影响,产生误报。4、受自身测量原理限制,当车流拥堵、车辆间距较小时,其测量精度大幅度下降,不适于城市交叉路口交通流检测。5、环形线圈车辆检测器一经设置即固定不变,在道路通行状况改变时调整困难。 2.微波车辆检测器 微波车辆检测器是以微波对车辆发射电磁波产生感应原理为基础。以RTMS微波为例,其工作方式为:悬挂于路侧,在扇形区域内发射连续的低功率调制微波,

VROAD高清视频事件检测方案

VROAD高清视频交通事件检测系统技术方案 智达威路特科技有限公司 2010年6月

目录 第1章系统设计原则和依据 (5) 1.1.系统设计原则 (5) 1.1.1系统的实时性 (5) 1.1.2系统的高可靠性 (5) 1.1.3系统的实用性 (5) 1.1.4系统的安全性 (5) 1.1.5系统的扩展性 (6) 第2章.系统总体设计需求 (8) 2.1.概述 (8) 2.2.设备介绍 (8) 2.3.系统结构 (9) 2.4.系统网络结构 (9) 第3章.系统功能 (10) 3.1.视频事件检测子系统 (10) 3.1.1事件检测功能 (10) 3.1.2交通数据检测功能 (11) 3.1.3电子地图功能 (11) 3.1.4事件图像抓拍、录像功能 (12) 3.1.5虚拟电视墙集中显示功能 (13) 3.1.6交通数据、交通事件报警的统计、分析、查询功能 (14) 3.1.7操作人员权限和操作日志管理功能 (15) 3.1.8信息共享及发布功能 (15) 3.1.9设备工作状态自检和报警功能 (15) 第4章.主要技术规格 (16)

4.1性能指标 (16) 4.2压缩和存储 (16) 4.3高清视频检测器 (16)

前言 高速公路事件是指破坏正常交通流量并造成交通堵塞的非重复性随机发生的事件,它包括交通事故、车辆抛锚、道路养护或施工等临时的大型活动。事件发生后对其进行快速可靠的探测,对减少交通延误,保障道路安全,防止二次事故发生具有十分重要的意义。 另外,随着高速公路的交通量不断增加,交通事故明显上升,有的事故甚至导致了严重后果。如1996年11月,沪宁高速公路由于能见度低,引发了44辆车辆连续追尾相撞,导致10人死亡,造成5小时交通阻塞的严重交通事故。1997年12月类似的一起交通事故发生在京津塘高速公路上。在高速公路上,交通事故如果得不到及时处理,会带来许多不良后果。如交通警察不能尽快获取事故信息,无法及时实施救援工作等。同时,中国的高速公路一般采取收费方式,若事故或交通堵塞不能及时检测和处理,必然会引起车辆的绕行,降低了高速公路运营效率,给高速公路运营公司带来巨大的经济损失。因此,研究高速公路事件检测与管理具有十分的重要意义。

装配式混凝土框架结构技术

装配式混凝土框架结构技术 4.2.1 技术内容 装配式混凝土框架结构包括装配整体式混凝土框架结构及其他装配式混凝土框架结构。装配式整体式框架结构是指全部或部分框架梁、柱采用预制构件通过可靠的连接方式装配而成,连接节点处采用现场后浇混凝土、水泥基灌浆料等将构件连成整体的混凝土结构。其他装配式框架主要指各类干式连接的框架结构,主要与剪力墙、抗震支撑等配合使用。 装配整体式框架结构可采用与现浇混凝土框架结构相同的方法进行结构分析,其承载力极限状态及正常使用极限状态的作用效应可采用弹性分析方法。在结构内力与位移计算时,对现浇楼盖和叠合楼盖,均可假定楼盖在其平面为无限刚性。装配整体式框架结构构件和节点的设计均可按与现浇混凝土框架结构相同的方法进行,此外,尚应对叠合梁端竖向接缝、预制柱柱底水平接缝部位进行受剪承载力验算,并进行预制构件在短暂设计状况下的验算。装配整体式框架结构中,应通过合理的结构布置,避免预制柱的水平接缝出现拉力。 装配整体式框架主要包括框架节点后浇和框架节点预制两大类:前者的预制构件在梁柱节点处通过后浇混凝土连接,预制构件为一字形;而后者的连接节点位于框架柱、框架梁中部,预制构件有十字形、T形、一字形等并包含节点,

由于预制框架节点制作、运输、现场安装难度较大,现阶段工程较少采用。 装配整体式框架结构连接节点设计时,应合理确定梁和柱的截面尺寸以及钢筋的数量、间距及位置等,钢筋的锚固与连接应符合国家现行标准相关规定,并应考虑构件钢筋的碰撞问题以及构件的安装顺序,确保装配式结构的易施工性。装配整体式框架结构中,预制柱的纵向钢筋可采用套筒灌浆、机械冷挤压等连接方式。当梁柱节点现浇时,叠合框架梁纵向受力钢筋应伸入后浇节点区锚固或连接,其下部的纵向受力钢筋也可伸至节点区外的后浇段内进行连接。当叠合框架梁采用对接连接时,梁下部纵向钢筋在后浇段内宜采用机械连接、套筒灌浆连接或焊接等连接形式连接。叠合框架梁的箍筋可采用整体封闭箍筋及组合封闭箍筋形式。 4.2.2 技术指标 装配式框架结构的构件及结构的安全性与质量应满足国家现行标准《装配式混凝土结构技术规程》JGJ12014、《装配式混凝土建筑技术标准》GB/T 51231、《混凝土结构设计规范》GB50010、《混凝土结构工程施工规范》GB50666、《混凝土结构工程施工质量验收规范》GB50204以及《预制预应力混凝土装配整体式框架结构技术规程》JGJ 224等的有关规定。当采用钢筋机械连接技术时,应符合现行行业标准《钢筋机械连接应用技术规程》JGJ 107的规定;当采用

车辆检测站视频监管系统

车辆检测站视频监管系统 解决方案 XXXXXXX工程有限责任公司

目录 第1章系统概述 (1) 1.1机动车检测站总体架构图 (1) 1.2机动车检测站工作流程 (2) 1.3机动车检测站视频监控系统图 (4) 第2章系统简介 (5) 2.1设计原则 (5) 2.1.1先进性与适用性 (5) 2.1.2经济性与实用性 (5) 2.1.3可靠性与安全性 (5) 2.1.4开放性 (5) 2.1.5可扩充性 (5) 2.1.6追求最优化的系统设备配置 (6) 2.1.7提高监管力度与综合管理水平 (6) 2.2设计规范和依据 (6) 2.3系统组成 (7) 2.3.1视频采集系统 (7) 2.3.2视频信号传输系统 (9) 2.3.3数据存储系统 (9) 2.3.4视频监管平台 (10) 2.4系统功能和特点 (10) 2.4.1系统功能 (10) 2.4.2系统特点 (11) 第3章设备选型及配置 (12) 3.1主要设备功能介绍 (12)

3.1.1140万高清一体机 (12) 3.1.2200万高清摄像机 (13) 3.1.3视频综合平台 (13) 3.1.4大屏.................................................................... 错误!未定义书签。 3.2配置清单 (14)

第1章系统概述 随着我省经济的迅速发展,机动车急剧增加,机动车的安全运行问题,质量问题越来越突出。为了进一步加强对机动车的安全管理,提高机动车安全检测的客观性和可靠性,在全省各地州市的机动车检测线安装视频监控系统,将有效地提高检测质量。在检测线安装车牌识别系统及视频监控设备,实现车辆牌号识别并自动录入;检测信息的实时传输;实现与机动车登记系统双向数据交换;车辆识别代号条码识别并校验(另加扫描仪);通过智能视频监控系统,有效避免来车不检及检车不彻底的问题;同时,监控中心可查看检测现场的实时视频监控信息;严把车辆检验关,将被检车与车牌图像信息唯一对应,全面监控汽车检测全过程,有效防止汽车检测过程中的徇私舞弊现象。 1.1机动车检测站总体架构图

整体装配式框架结构施工工法

整体装配式框架结构施工工法

第一章装配式 第一节整体装配式框架结构施工工法 一、特点 整体装配式框架结构施工具有以理特点: (一)标准化施工: 以标准层每层、每跨为单元,根据结构特点和便于构件制作和安装的原则将结构拆分成不同种类的构件(如墙、梁、板、楼梯等)并绘制结构拆分图。相同类型的构件尽量将截面尺寸和配筋等统一成一个或少数几个种类,同时对钢筋都进行逐根的定位,并绘制构件图,这样便于标准化的生产、安装和质量控制。 (二)现场施工简便: 构件的标准化和统一化注定了现场施工的规范化和程序化,使施工变得更方便操作,使工人能更好更快的理解施工要领和安装方法。 (三)质量可靠: 构件图绘制详细,构件工厂加工,都使构件质量充分得到保障。构件类型相对较少,形式统一使现场施工标准化、规范化,这更便于现场质量的控制。外墙采用混凝土外墙,外墙的窗框、涂料或瓷砖均在构件厂与外墙同步完成,很大程度上解决了窗框漏水和墙面深水的质量通病。 (四)安全: 外墙采用预制混凝土外墙,取消了砌体抹灰工作,同时涂料、瓷

砖、窗框等外立面工作已经在加工厂完成,大大减少了危险多发区建筑外立面的工作量和材料堆放量,使施工安全更有保证。 (五)制作精度高: 预制构件的加工要求构件截面尺寸误差控制在±3mm 以内,钢筋位置偏差在±2mm 以内,构件安装误差水平位置控制在±3mm 以内,标高误差控制在±2mm 以内。 (六)环保节能效果突出: 大部分材料在构件厂加工,标准化统一化的加工减少了材料的浪费;现场基本取消湿作业,初装修均采用装配施工大大减少了建筑垃圾的产生;模板除在梁柱交接的核心区使用外,基本不再使用,大大降低了木材的利用率;钢筋和混凝土的现场用量大大减少,降低了水、电的现场使用量,同时也减少了施工噪音。 (七)计划和程序管理严密: 各种施工措施埋件要反映在构件图中,这就要求方案的可执行型强,而且施工时严格按照方案和施工程序施工。构件的加工计划、运输计划和每辆车构件的装车顺序紧密的与现场施工计划和吊装计划相结合,确保每个构件严格按实际吊装时间进场,保证了安装的连续性确保整体工期的实现。 二、工艺原理 梁、板等水平构件采用叠合形式,既构件底部(包含底筋、箍筋、底部混凝土)采用工厂预制,面层和深入支座处(包含面筋)采

监狱智能视频分析解决方案

监狱智能视频分析 解决方案

监狱智能视频分析解决方案 一、方案背景 监狱是关押和改造犯罪人员的重要场所,因此安全是首先要保障的因素。安全保障既要保障社会的安全,也要保障狱警人员和在押人员的安全。特别是在“构建和谐社会”的大环境下,构建好监狱的安全防范体系就显得格外重要。在公安、司法部门,在监狱管理工作上,“向科技要警力”已经成为一种趋势。 在监狱、看守所这种特殊的场所,保安系统处于一个最为重要的位置,而视频监控则是其中最为重要的环节。国内监狱现多采用模拟闭路电视监控系统,或普通数字监控系统。视频监控系统能够使得安保人员实时了解到监狱内各个重点区域的人员活动情况及其它事件,而且能够将这些视频信息进行长时间的录像存储保存,方便日后查询。可是普通的视频监控系统也存在不尽如人意的地方,其最大弊端是完全依赖于人工监控。由于视频太多而监控人员有限,且长时间盯着监视画面容易疲劳而导致疏忽,监控人员根本无力监看成百上千路摄像头的视频信息。据有关数据分析,20分钟后监控人员可能错过最多高达95%的画面。试想一下,人的监控力度是有限的,而突发事件的发生是不可预见的,仅靠人为7*24小时的监控难以保证事件是否存在疏漏。一般监狱的视频监控系统能录制并保存数月的监控资料,但一旦事件发生时,没有智能分析的监控系统却无法做出即时判断,只能成为一个事后取证的工具。 本方案的提出旨在利用先进的智能视频分析系统,利用科技手段使得监狱的视频监控系统智能化,充分发挥监狱视频监控系统在整个安防体系的作

用,从而为监狱这个高度戒备的场所提供充分可靠的保障。使传统的监控系统从被动变为主动,防患于未然。智能化主要体现在: 1)对事件的发生提前做出预警,最大限制地防止突发事件的发生,例如重点场所的遗留物检测、可疑人员人脸识别、游荡检测等; 2)即时警报,对发生的突发事件第一事件发出报警,从而有利于安保人员做出快速反应,例如奔跑检测、人员跌倒检测、重点区域入侵检测等; 本方案致力于从整体提升监狱的安防系统级别,所采用的视频分析系统基于澳大利亚 iOmniScient Hi-iQ公司的IQ-Infinity产品,iOmniScient公司具有多项业界领先的国际专利技术,iOmniScient以拥有业内口碑和功能广泛独特的智能视频分析系统受到尊敬。IQ系列智能视频分析产品,曾在各大洲的主流安防展上多次获最佳产品奖项。当前IQ系列产品用户超过30,000个。 二、方案特征 2.1 智能视频分析系统概述 视频监控系统的发展经历了第一代的全模拟系统,第二代的部分数字化的系统,第三代的完全数字化的系统(网络摄像机和视频服务器)三个阶段的发展演变,现在整个行业正在酝酿视频监控新的革命——智能视频监控。 智能视频监控是以数字化、网络化视频监控为基础,但又有别于一般的网络化视频监控,它是一种更高端的视频监控应用。智能视频监控系统能够自动识别不同的物体,发现监控画面中的异常情况,并能够以最快和最佳的方式发出警报和提供有用信息,从而能够更加有效的协助安全人员处理危机,并最大限度的降低误报和漏报现象。

智能视频分析系统解决方案

智能视频分析系统解决方案 1.1 系统概述 智能视频(Intelligent Video)技术源自计算机视觉(Computer Vision)与人工智能(Artificial Intelligent)的研究,其发展目标是在图像与事件描述之间建立一种映射关系,使计算机从纷繁的视频图像中分辩、识别出关键目标物体。这一研究应用于安防视频监控系统,将能借助计算机强大的数据处理能力过滤掉图像中无用的或干扰信息,自动分析、抽取视频源中的关键有用信息,从而使传统监控系统中的摄像机成为人的眼睛,使“智能视频分析”计算机成为人的大脑,并具有更为“聪明”的学习思考方式。这一根本性的改变,可极大地发挥与拓展视频监控系统的作用与能力,使监控系统具有更高的智能化,大幅度节省资源与人员配置,同时必将全面提升安全防范工作的效率。因此,智能视频监控不仅仅是一种图像数字化监控分析技术,而是代表着一种更为高端的数字视频网络监控应用。 智能视频分析包含视频诊断、视频分析和视频增强等,它们各自又包含了大量的功能算法,比如清晰度检测、视频干扰检测、亮度色度检测、PTZ(云台)控制功能检测,以及视频丢失、镜头遮挡、镜头喷涂、非正常抖动等检测都属于视频诊断内容,而视频分析算法则包含区域入侵、绊线检测、遗留遗失检测、方向检测、人群计数、徘徊检测、流量统计、区域稠密度统计、人脸识别、车牌识别、烟火烟雾检测、自动 PTZ 跟踪等功能,视频图像增强则包括稳像、去雾、去噪、全景拼接等算法。由此组合衍生出的算法种类又有很多,应用方式也千变万化,所以智能视频分析的应用范围很广。 在以往的视频监控系统中,操作人员盯着屏幕电视墙超过 10 分钟后将漏掉90%的视频信息,而使视频监控工作失去意义。随着社会发展,视频监控被越来越广泛地应用到各行各业中,摄像机数量越来越庞大,这给传统的视频监控带来严峻的挑战。针对行业发展推出智能视频分析系统,主要解决以下问题:一个是将安防操作人员从繁杂而枯燥的“盯屏幕”任务解脱出来,由机器来完成分析识别工作;另外一个是为在海量的视频数据中快速搜索到想要找的的图象。 1.2 系统组成 智能视频分析系统以数字化、网络化视频监控为基础,用户可以设置某些特定的规则,系统识别不同的物体,同时识别目标行为是否符合这些规则,一旦发现监控画面中的异常情况,系统能够以最快和最佳的方式发出警报并提供有用信息,从而能够更加有效的协助安全人员处理危机,最大限度的降低误报和漏报现象。智能视频分析是在传统的监控系统中,加入智能视频技术,在整个系统中,系统分布图如下:

CITILOG视频事件检测系统V6.0

CITILOG视频事件检测系统V6.0 一.概述: 法国 Citilog 公司的视频检测系统,它采用国际先进的车辆跟踪检测的技术实现交通事件检测的功能,比其它所有国内外厂家的虚拟线圈检测技术有革命性的改进,而且由于系统能提供强大的系统管理和检测数据输出功能,为管理者对道路的交通安全管理提供了强有力的帮助。 该系统为全天候、智能化、实时自动检测所有输入的视频图像信号,根据不同的地点、位置检测各种交通事件,特别是可以实现对PTZ 遥控摄像机的交通事件检测 最重要的是 Citilog 提供全球化的标准来解决这一问题:包括简体中文的多国语言的全套系统界面、全球化开发、全球化技术支持、基于以太网( TCP/IP )的外部数据交换接口,使得该系统非常容易地集成到各种交通监控系统中。 系统所提供的这些强大的交通事件检测和系统管理功能,在国内外处于绝对领先地位。 二.基本功能: 1.可用的视频图像信号 ? PTZ 可遥控或固定式摄像机 ?黑白或彩色摄像机信号 ? 1 Vp-p / 75? ?模拟 25 帧 / 秒的视频图像 ? PAL 或 NTSC 制式 ? CCIR 标准 录像播放画质为D1 说明:就是说基本道路监控使用的任何图像都可以用来做交通事件检测 2.系统功能 2.1 交通事件报警 (检测条件:固定光照下,视频信号质量满足测量要求) 分析仪产生 3 种警报,传送至数据服务器,转发到管理器。 2.1 .1 交通事件事故检测报警 具备2个有针对性的版本 ( 1 ) VisioPad针对云台摄像机,可移动摄像机的版本。是全球唯一可用于可移动的摄像机上的视频检测软件。 分析仪自动检测下列事件事故: ?车辆停驶 >98% ( 2 ) MediaRoad 和 MediaTunnel 针对固定摄像头 分析仪自动检测下列事件事故: 在每个车道类型(行车道、紧急道、停车道、匝道)和任何交通设置(流动、拥堵、停 / 开等): ?车辆停驶检测率≥98%;

高清(200万CCD)视频车辆检测器

高清(200万CCD)视频车辆检测器 HD (2M CCD) Video Detector TT-L200D-CJ 高清(200万CCD)视频车辆检测器

主要特性 Features ●采用TI达芬奇系列高性能ARM Cortex A8+DSP High performance TI Vinci ARM Cortex A8 & DSP; ●采用标准H.264 main profile 5.0视频压缩技术,压缩比高,码流控制准确、稳定 Standard H.264 main profile 5.0 video compression technology with high compression rate ensures accurate and stable code steam control; ●采用逐行扫描CCD,捕捉运动图像无锯齿 Progressive scan CCD,capturing moving images without serrated edges; ●最高分辨率可达2M (1624×1224) Max. resolution 2M (1624×1224); ●支持1路复合视频输出 Video output: 1 channel; ●支持双码流,ACF(活动帧率控制) Dual-steam, ACF; ●支持数字水印加密,防止数据被篡改 Digital watermark encryption to prevent data to be tampered; ●支持I/O报警、RS485控制功能 Alarm: 1/1 channel In/Out, RS485 control; ●支持丰富的网络协议 Support different network protocol;

整体装配式框架结构规划项目施工工法

第一章装配式 第一节整体装配式框架结构施工工法 一、特点 整体装配式框架结构施工具有以理特点: (一)标准化施工: 以标准层每层、每跨为单元,根据结构特点和便于构件制作和安装的原则将结构拆分成不同种类的构件(如墙、梁、板、楼梯等)并绘制结构拆分图。相同类型的构件尽量将截面尺寸和配筋等统一成一个或少数几个种类,同时对钢筋都进行逐根的定位,并绘制构件图,这样便于标准化的生产、安装和质量控制。 (二)现场施工简便: 构件的标准化和统一化注定了现场施工的规范化和程序化,使施工变得更方便操作,使工人能更好更快的理解施工要领和安装方法。 (三)质量可靠: 构件图绘制详细,构件工厂加工,都使构件质量充分得到保障。构件类型相对较少,形式统一使现场施工标准化、规范化,这更便于现场质量的控制。外墙采用混凝土外墙,外墙的窗框、涂料或瓷砖均在构件厂与外墙同步完成,很大程度上解决了窗框漏水和墙面深水的质量通病。 (四)安全: 外墙采用预制混凝土外墙,取消了砌体抹灰工作,同时涂料、瓷砖、窗框等外立面工作已经在加工厂完成,大大减少了危险多发区建筑外立面的工作量和材料堆放量,使施工安全更有保证。 (五)制作精度高: 预制构件的加工要求构件截面尺寸误差控制在±3mm 以内,钢筋位置偏差在±2mm 以内,构件安装误差水平位置控制在±3mm 以内,标高误差控制在±2mm 以内。 (六)环保节能效果突出: 大部分材料在构件厂加工,标准化统一化的加工减少了材料的浪费;现场基本取消湿作业,初装修均采用装配施工大大减少了建筑垃圾的产生;模板除在梁柱交接的核心区使用外,基本不再使用,大大降低了木材的利用率;钢筋和混凝土的现场用量

装配整体式混凝土框架结构设计要点及案例分析

装配式混凝土框架结构设计要点及案例分析 2017.9

目录 前言﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒3 第一章、装配式结构设计流程﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒4 第二章第二章装配式设计流程特点﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒6 第三章装配式框架结构设计分析﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒14 第四章装配式框架结构的连接设计﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒17 第五章装配式框架结构深化设计﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒30第六章装配式框架结构案例分析﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒32第七章塔吊及汽车吊布置﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒﹒43

前言 近年来,整浇式梁柱节点在北京地区民用建筑钢筋混凝土多层装配式框架结构中被普遍采用。为了有利于构件生产定型化及现场安装施工速度的提高,我们在几年来的工程设计、施工和科研实践的基础上,对J_(72)、J_(74)型节点,进行了总结和改进,并经河南饭庄等工程试点,提出了DZ1、DZ2型两种梁柱节点方案。由于这类节点是在现场把预制钢筋混凝土梁、柱构件浇注成整体的,故统称为"整浇式梁柱节点"。 装配整体式钢筋混凝土结构房屋是指主要水平受力构件和垂直非承重构件采用预制的方法,在现场经过装配式施工,垂直受力构件和节点进行现浇连接,使建筑物成为一个整体,这样的建筑结构称为装配整体式结构。其主要特征为:工厂化批量预制、机械化施工,现场湿作业少,具有施工快、质量好、节省材料和人工的优点。 装配整体式钢筋混凝土结构房屋在建筑设计时,除满足国家现行设计规范外,还应考虑装配整体式结构的技术特点,合理设计构件节点,使构造方式有利于构件拆解和预制生产,提高工业化程度,以更好地保证构件质量、加快建设进度 本文从装配式结构设计流程、构件组成分析,设计分析、连接设计、深化设计、案例分析六个方面介绍了装配式混凝土框架结构设计要点,结合多年工作经验总结了装配整体式混凝土框架结构设计过程中需要注意的问题。

比利时Traficon创飞克智能交通视频事件检测管理系统 FLUX

T RAFICON M ANAGEMENT S YSTEM 管理、控制及实时查看交通数据和交通事件 图形用户界面,具有强效的事件报警功能及事故过滤功能 Flux 是一款智能软件平台,与Traficon 视频检测系统并行应用。 Flux 软件系统收集视频检测器生成的交通数据、 事件、 报警和视频图像序列。 Flux 主要用于管理并监控各种检测器采集的交通信息,将其整理并提供给相关用户。 通过以太网络与视频检测系统进行通信。Flux 将所有的交通数据、事件及交通报警信息存储于关系型数据库内。 Flux 具有用户友好界面,配备监控和报告功能。Flux 可实时监控所有事件及报警。事件信息均自动制成文件并直接显示出来,用户可对每项交通信息进行有效管理。 可同时观看多路摄像机的实时视频。 通过报告功能可查询数据库内信息并创建相关交通数据及事件报告,按图表或表格形式将其导出。 Flux 可定义不同的智能过滤功能确保此系统在特定环境下可进行相关数据统计和事件报警。 (特定环境例如道路维护) Flux 将视频检测系统布局呈现给用户。定制的图形用户界面帮助视频检测系统直观有效地响应各种交通事件及报警。 Flux 的开放式系统构架确保系统适合于不同项目的具体需求。 重要特征 采 实时交通数据分析及显示 监控、报警及报告 图 智 同时查看多个摄像机视频流 主要优势 快 便 用户友好配置和操作 基 即 可自定义并支持多个不同授权级别用户设置 可 集 开

基于浏览器的图形用户界面 (GUI) Flux 客户端界面是以浏览器为基础的应用程序。用户仅需在客户端PC 上安装网页浏览器,并将此PC 连接至视频检测系统整体网络中,即可使用交通管理系统。 无需在客户端PC 安装任何软件代码,使得基于浏览器设计“零安装”的GUI 界面可向用户提供更加灵活高效的管理能力。 实时交通监控 Flux 具有实时事件和交通数据监控功能。当交通及技术事件发生时,系统会自动显示出事件发生时的当前状态、摄像机图像、所有事件信息和事件发生前后的录像资料。 事件录像和立即回放 Flux 用于保存并采集数据、事件及视频录像,管理员可快速检索这些视频序列。 (视频序列包括事件发生前后的录像) 。视频序列作为一种直观信息,不仅有助于管理员在事件发生时采取必要措施,同时还有助于事后对交通进行分析与评估。 事件报警及事故过滤 为确保相关事件报警正确(避免同一事件多次报警), Flux 可通过设置高级过滤器实现其灵活配置。过滤器是应用于某摄像机集合的一组屏蔽,每一个屏蔽定义为事件过滤掉一个或多个检测区域。这些过滤器可直接从Flux 用户界面触发,也可以通过视频检测系统的数字输入或者通过Flux 的内置日程表或者通过大型管理系统远程触发。 Flux 主页上的登陆窗口

视频交通事件检测器系统方案汇总

VTD3000视频交通事件检测系统应用方案深圳市哈工大交通电子技术有限公司

目录 1. 概述 (3) 2. 视频交通事件检测系统的组成结构 (4) 2.1 系统组成结构 (4) 2.2系统组成特点 (4) 2.3 摄像机 (5) 2.4 视频信号传输与分配系统 (5) 2.5 VTD3000视频交通事件检测器 (5) 2.6 视频交通事件数据服务器 (6) 2.7 以太网交换机 (6) 2.8 管理客户端 (6) 3.1 VTD3000视频交通事件检测器简介 (7) 3.2 VTD3000视频交通事件检测器的功能 (7) 3.2.1 交通事件检测与报警功能 (7) 3.2.2 交通事件图像自动录像功能 (8) 3.2.3 交通参数异常报警功能 (8) 3.2.4 设备工作状态自检测和报警功能 (8) 3.2.5 交通参数检测与统计功能 (8) 3.3 VTD3000视频交通事件检测器的产品规格 (9) 3.4 VTD3000视频交通事件检测器的性能指标 (10) 3.4.1 交通流检测指标 (10) 3.4.2 交通事件检测性能 (10) 3.4.3 压缩和存储 (11) 3.4.4 视频信号输入 (11) 3.4.5 数据输出 (11) 3.5 VTD3000视频交通事件检测器的特点 (11) 3.5.1 结构特点 (11) 3.5.2 技术特点 (11) 3. 客户端管理软件 (13) 4.1 客户端的运行环境要求 (13) 4.1.1 客户端的硬件要求 (13) 4.1.2 客户端的配套软件要求 (13) 4.2 客户端管理软件的功能 (13) 4.2.1 客户端管理软件的界面表现形式 (14) 4.2.1.1 电子地图显示方式 (14) 4.2.1.2 虚拟大屏显示方式 (14) 4.2.2 视频交通事件检测器工作参数设置 (15) 4.2.3 交通事件报警响应和处理 (16) 4.2.4 人工录像操作 (17) 4.2.5 交通事件录像的查询和回放 (17) 4.2.6 交通事件录像的存储 (17) 4.2.7 交通流参数统计与图表显示 (17) 4.2.8 日志管理与远程维护 (18)

车辆检测器

交通流检测技术及应用 摘要:车辆检测器是用来实时采集通过检测点的车辆有关交通信息的设备,主要是通过数据采集和设备监视等方式,向监控系统中的信息处理和信息发布单元提供各种交通参数,是监控中心分析、判断、发出信息和提出控制方案的主要依据。 关键词:车辆检测器交通信息 Abstract: ITS real-time traffic information is the most basic one of the information source, only for real-time traffic information having accurate master can effectively implement and play such as traffic guidance and so on ITS functions, so the real-time detection of the traffic information technology is the core of ITS technology ,so is one of the most basic technology. Traffic information collectionneeds to rely on all kinds of detectors. This paper introduces several kinds of mainstream detector technologies, and gives analyses and comparisons on the performance. Key words: traffic information; vehicle detector 分类 ①按安装方式分为永久式安装(固定式安装)、临时性安装(便 携式安装); ②按采集时间长短分为连续式采集设备(一般采用永久式安 装设备)、间隙式采集设备(多采用临时性安装设备); ③按检测技术方法分为感应线圈检测、视频检测、微波检测、 气压管检测、超声波检测、磁映像检测、红外检测、激光检

【CN110110610A】一种用于短视频的事件检测方法【专利】

(19)中华人民共和国国家知识产权局 (12)发明专利申请 (10)申请公布号 (43)申请公布日 (21)申请号 201910303095.7 (22)申请日 2019.04.16 (71)申请人 天津大学 地址 300072 天津市南开区卫津路92号 (72)发明人 张静 刘靖辉 井佩光 苏育挺  (74)专利代理机构 天津市北洋有限责任专利代 理事务所 12201 代理人 李林娟 (51)Int.Cl. G06K 9/00(2006.01) (54)发明名称一种用于短视频的事件检测方法(57)摘要本发明公开了一种用于短视频的事件检测方法,包括:提出低秩约束模型,用于最大化不同视角间的关联性和互补性,获得更加鲁棒的子空间结构;采用判别学习的方式,通过回归分析建立样本的特征表征与类别标签之间的联系;建立弹性的正则化网络,引入非负标号松弛矩阵,将严格的二元标签矩阵松弛为一个松弛变量矩阵,用于在扩大不同类之间的距离同时,提供更多空间来拟合标签;根据获取到的目标函数,将提取的训练集的特征矩阵以及对应的标签矩阵带入,通过拉格朗日乘子法求出字典矩阵,映射矩阵;根据约束条件,带入测试集的特征矩阵进而求出预测的特征集的标签,将其和数据真实的标签做比对, 通过计算mAP的方式求出最后的预测结果。权利要求书1页 说明书5页 附图1页CN 110110610 A 2019.08.09 C N 110110610 A

1.一种用于短视频的事件检测方法,其特征在于,所述事件检测方法包括以下步骤:采集短视频的前景信息和背景信息; 提出低秩约束模型,用于最大化不同视角间的关联性和互补性,获得更加鲁棒的子空间结构; 采用判别学习的方式,通过回归分析建立样本的特征表征与类别标签之间的联系;建立弹性的正则化网络,引入非负标号松弛矩阵,将严格的二元标签矩阵松弛为一个松弛变量矩阵,用于在扩大不同类之间的距离同时,提供更多空间来拟合标签; 根据获取到的目标函数,将提取的训练集的特征矩阵X以及对应的标签矩阵Y带入,通过拉格朗日乘子法求出字典矩阵U,映射矩阵W,A; 根据约束条件带入测试集的特征矩阵X,U ,W,求出对应的Z,再根据Y=ZA,求出预测的特征集的标签Y,将其和数据真实的标签做比对,通过计算mAP的方式求出最后的预测结果。 2.根据权利要求1所述的一种用于短视频的事件检测方法,其特征在于,所述低秩约束 模型具体为: 其中,U i ∈R D ×P 表示第i个视角对应的字典,P表示字典中元素的个数;Z∈R P ×N 表示不同视角共享的特征表征矩阵;E i ∈R D ×(N+M)代表第i个视角的稀疏误差矩阵,R表示设定的字典个数,γ1和γ2表示平衡因子; T表示转置; 代表矩阵U的核范数,δi (U)表述矩阵U的第i个奇异值, ||·||1表示L 1范数,||·||2,1表示矩阵的L 2,1范数。 3.根据权利要求2所述的一种用于短视频的事件检测方法,其特征在于,所述非负标号 松弛矩阵具体为: 其中,A表示学习的低秩特征Z与标签矩阵的Y映射矩阵,Y表示标签矩阵,||·||F 表示矩阵的F范数,γ表示平衡因子,矩阵B的元素B ij 定义为矩阵M∈R N ×C 为弹性矩阵。 4.根据权利要求3所述的一种用于短视频的事件检测方法,其特征在于,所述目标函数 具体为: 将U,W,A,带入测试集的特征矩阵,根据Y=ZA,求得最后的预测结果。 权 利 要 求 书1/1页2CN 110110610 A

相关主题
文本预览
相关文档 最新文档