Efficient Action Spotting Based on a Spacetime Oriented Structure
- 格式:pdf
- 大小:628.56 KB
- 文档页数:8
EfficientActionSpottingBasedonaSpacetimeOrientedStructure
Representation
KonstantinosG.Derpanis,MikhailSizintsev,KevinCannonsandRichardP.Wildes
DepartmentofComputerScienceandEngineering
YorkUniversity
Toronto,Ontario,Canada
{kosta,sizints,kcannons,wildes}@cse.yorku.ca
Abstract
Thispaperaddressesactionspotting,thespatiotem-
poraldetectionandlocalizationofhumanactionsin
video.Anovelcompactlocaldescriptorofvideody-
namicsinthecontextofactionspottingisintroduced
basedonvisualspacetimeorientedenergymeasure-
ments.Thisdescriptorisefficientlycomputeddirectly
fromrawimageintensitydataandtherebyforgoesthe
problemstypicallyassociatedwithflow-basedfeatures.
Animportantaspectofthedescriptoristhatitallows
forthecomparisonoftheunderlyingdynamicsoftwo
spacetimevideosegmentsirrespectiveofspatialappear-
ance,suchasdifferencesinducedbyclothing,andwith
robustnesstoclutter.Anassociatedsimilaritymeasure
isintroducedthatadmitsefficientexhaustivesearchfor
anactiontemplateacrosscandidatevideosequences.
Empiricalevaluationoftheapproachonasetofchal-
lengingnaturalvideossuggestsitsefficacy.
1.Introduction
Thispaperaddressestheproblemofdetectingand
localizingspacetimepatterns,asrepresentedbyasin-
glequeryvideo,inareferencevideodatabase.Specifi-
cally,patternsofcurrentconcernarethoseinducedby
humanactions.Thisproblemisreferredtoasaction
spotting.Theterm“action”referstoasimpledynamic
patternexecutedbyapersonoverashortdurationof
time(e.g.,walkingandhandwaving).Potentialap-
plicationsofthepresentedapproachincludevideoin-
dexingandbrowsing,surveillance,visually-guidedin-
terfacesandtrackinginitialization.
Akeychallengeinactionspottingarisesfromthe
factthatthesameunderlyingpatterndynamicscan
yieldverydifferentimageintensitiesduetospatialap-
pearancedifferences,aswithchangesinclothingand
liveactionversusanimatedcartooncontent.Another
challengearisesinnaturalimagingconditionswhere
sceneclutterrequirestheabilitytodistinguishrelevantInput Videos
……template (query)
oriented energy volumes
…Distributed Spacetime OrientationDecomposition
Sliding Window Matching: Pointwise Distribution Matching
similarity volumesearch (database)xty
Figure1.Overviewofapproachtoactionspotting.(top)Atemplate(query)andsearch(database)videoserveasin-put;boththetemplateandsearchvideosdepicttheactionof“jumpingjacks”takenfromtheWeizmannactiondataset[11].(middle)Applicationofspacetimeorientedenergyfiltersdecomposestheinputvideosintoadistributedrep-resentationaccordingto3D,(x,y,t),spatiotemporalorien-tation.(bottom)Inaslidingwindowmanner,thedistri-butionoforientedenergiesofthetemplateiscomparedtothesearchdistributionatcorrespondingpositionstoyieldasimilarityvolume.Finally,significantlocalmaximainthesimilarityvolumeareidentified.
patterninformationfromdistractions.Cluttercanbe
oftwotypes:Backgroundclutterariseswhenactions
aredepictedinfrontofcomplicated,possiblydynamic,
backdrops;foregroundclutterariseswhenactionsare
depictedwithdistractionssuperimposed,aswithdy-
namiclighting,pseudo-transparency(e.g.,walkingbe-
hindachain-linkfence),temporalaliasingandweather
effects(e.g.,rainandsnow).Itisproposedthatthe
choiceofrepresentationiskeytomeetingthesechal-
lenges:Arepresentationthatisinvarianttopurelyspa-
tialpatternallowsactionstoberecognizedindepen-
dentofactorappearance;arepresentationthatsup-
portsfinedelineationsofspacetimestructuremakesitpossibletoteaseactioninformationfromclutter.
Forpresentpurposes,localspatiotemporalorienta-
tionisoffundamentaldescriptivepower,asitcaptures
thefirst-ordercorrelationstructureofthedatairre-
spectiveofitsorigin(i.e.,irrespectiveoftheunderlying
visualphenomena),evenwhiledistinguishingawide
rangeofimagedynamics(e.g.,singlemotion,multiple
superimposedmotions,temporalflicker).Correspond-
ingly,visualspacetimewillberepresentedaccording
toitslocal3D,(x,y,t),orientationstructure:Each
pointofspacetimewillbeassociatedwithadistribu-
tionofmeasurementsindicatingtherelativepresenceof
aparticularsetofspatiotemporalorientations.Com-
parisonsinsearchingaremadebetweenthesedistribu-
tions.Figure1providesanoverviewoftheapproach.
Awealthofworkhasconsideredtheanalysisofhu-
manactionsfromvisualdata[26].Abriefsurveyof
representativeapproachesfollows.
Tracking-basedmethodsbeginbytrackingbody
partsand/orjointsandclassifyactionsbasedonfea-
turesextractedfromthemotiontrajectories(e.g.,
[28,19,1]).Generalimpedimentstofullyautomated
operationincludetrackerinitializationandrobustness.
Consequently,muchofthisworkhasbeenrealizedwith
somedegreeofhumanintervention.
Othermethodshaveclassifiedactionsbasedonfea-
turesextractedfrom3Dspacetimebodyshapesasrep-
resentedbycontoursorsilhouettes,withthemotiva-
tionthatsuchrepresentationsarerobusttospatialap-
pearancedetails[3,11,15].Thisclassofapproach
reliesonfigure-groundsegmentationacrossspacetime,
withthedrawbackthatrobustsegmentationremains
elusiveinuncontrolledsettings.Further,silhouettes
donotprovideinformationonthehumanbodylimbs
whentheyareinfrontofthebody(i.e.,insidesilhou-
ette)andthusyieldambiguousinformation.
Recently,spacetimeinterestpointshaveemergedas
apopularmeansforactionclassification[22,8,17,16].
Interestpointstypicallyaretakenasspacetimeloci
thatexhibitvariationalongallspatiotemporaldimen-
sionsandprovideasparsesetofdescriptorstoguide
actionrecognition.Sparsityisappealingasityields
significantreductionincomputationaleffort;however,
interestpointdetectorsoftenfireerraticallyonshad-
owsandhighlights[14],andalongobjectoccluding
boundaries,whichcastsdoubtontheirapplicability
toclutterednaturalimagery.Additionally,foractions
substantiallycomprisedofsmoothmotion,important
informationisignoredinfavourofasmallnumberof
possiblyinsufficientinterestpoints.
Mostcloselyrelatedtotheapproachproposedin
thepresentpaperareothersthathaveconsidered
densetemplatesofimage-basedmeasurementstorep-resentactions(e.g.,opticalflow,spatiotemporalgra-
dientsandotherfilterresponsesselectivetobothspa-
tialandtemporalorientation),typicallymatchedto
avideoofinterestinaslidingwindowformulation.
Chiefadvantagesofthisframeworkincludeavoidance
ofproblematiclocalization,trackingandsegmentation
preprocessingoftheinputvideo;however,suchap-
proachescanbecomputationallyintensive.Further
limitationsaretiedtotheparticularsoftheimagemea-
surementusedtodefinethetemplate.
Opticalflow-basedmethods(e.g.,[9,14,20])suffer
asdenseflowestimatesareunreliablewheretheirlocal
singleflowassumptiondoesnothold(e.g.,alongoc-
cludingboundariesandinthepresenceofforeground
clutter).Workusingspatiotemporalgradientshasen-
capsulatedthemeasurementsinthegradientstructure
tensor[23,15].Thistackyieldsacompactwaytochar-
acterizevisualspacetimelocally,withtemplatevideo
matchesviadimensionalitycomparisons;however,the
compactnessalsolimitsitsdescriptivepower:Areas
containingtwoormoreorientationsinaregionarenot
readilydiscriminated,astheirdimensionalitywillbe
thesame;further,thepresenceofforegroundclutter
inavideoofinterestwillcontaminatedimensionality
measurementstoyieldmatchfailures.Finally,meth-
odsbasedonfilterresponsesselectiveforbothspatial
andtemporalorientation(e.g.,[5,13,18])sufferfrom
theirinabilitytogeneralizeacrossdifferencesinspatial
appearance(e.g.,differentclothing)ofthesameaction.
Thespacetimefeaturesusedinthepresentworkde-
rivefromfiltertechniquesthatcapturedynamicas-
pectsofvisualspacetimewithrobustnesstopurely
spatialappearance,asdevelopedpreviously(e.g.,in
applicationtomotionestimation[24]andvideoseg-
mentation[7]).Thepresentworkappearstobethe
firsttoapplysuchfilteringtoactionanalysis.Other
notableapplicationsofsimilarspacetimeorienteden-
ergyfiltersinclude,patterncategorization[27],track-
ing[4]andspacetimestereo[25].
Inthelightofpreviouswork,themajorcontribu-
tionsofthepresentpaperareasfollows.(i)Anovel
compactlocalorientedenergyfeaturesetisdeveloped
foractionspotting.Thisrepresentationsupportsfine
delineationsofvisualspacetimestructuretocapture
therichunderlyingdynamicsofanactionfromasin-
glequeryvideo.(ii)Associatedcomputationallyef-
ficientsimilaritymeasureandsearchmethodarepro-
posedthatleveragethestructureoftherepresentation.
Theapproachdoesnotrequirepreprocessinginthe
formofpersonlocalization,tracking,motionestima-
tion,figure-groundsegmentationorlearning.(iii)The
approachcanaccommodatevariableappearanceofthe
sameaction,rapiddynamics,multipleactionsinthe