研究成果 – 现代信号与数据处理实验室.pdf
Pattern Recognition Letters 30 (2009) 827–837 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec Robust human tracking based on multi-cue integration and mean-shift Hong Liu *, Ze Yu, Hongbin Zha, Yuexian Zou, Lin Zhang National Lab on Machine Perception, Shenzhen Graduate School, Peking University, Beijing 100871, PR China a r t i c l e i n f o Article history: Available online 1 November 2008 Keywords: Mean-Shift Multi-cue tracking Adaptive integration a b s t r a c t Multi-cue integration has been researched extensively for robust visual tracking. Researchers aim to use multiple cues under the probabilistic methods, such as Particle Filtering and Condensation. On the other hand, Color-based Mean-Shift has been addressed as an effective and fast algorithm for tracking color blobs. However, this deterministic searching method suffers from objects with low saturation color, color clutter in backgrounds and complete occlusion for several frames. This paper integrates multiple cues into Mean-Shift algorithm to extend its application areas of the fast and robust deterministic searching method. A direct multiple cues integration method with an occlusion handler is proposed to solve the common problems in color-based deterministic methods. Moreover, motivated by the idea of tuning weight of each cue in an adaptive way to overcome the rigidity of the direct integration method, an adaptive multi-cue integration based Mean-Shift framework is proposed. A novel quality function is introduced to evaluate the reliability of each cue. By using the adaptive integration method, the problem of changing appearance caused by object rotation can be solved. Extensive experiments show that this method can adapt the weight of individual cue efficiently. When the tracked color blob is invisible for human bodies’ rotation, the color cue is compensated by motion cue. When the color blob becomes visible again, the color cue will become dominating as well. Furthermore, the direct-cue-integration method with an occlusion handler is combined with the adaptive integration method to extend the application areas of the adaptive method to full occlusion cases. Ó 2008 Published by Elsevier B.V. 1. Introduction Tracking objects in complex environments is a challenging task in intelligent surveillance field (Haritaoglu et al., 2000; Wren and Pentland, 1997). A good tracking algorithm should be able to work well in various difficult situations, such as various illuminations, background clutter, and occlusion. There are two technique trends in the computer vision tracking community. One is to develop more inherently robust algorithms and another is to employ multiple cues to enhance tracking robustness. To increase the robustness and generality of tracking, various image features must be employed. Every single cue has its own advantages and disadvantages (Tao et al., 2000; Hayman and Eklundh, 2002). For example, shape cue is suitable for tracking rigid objects which seldom change their shapes in video sequences, like human heads. However, shape cue based methods perform poorly when backgrounds are with rich texture and edges. Color feature is widely used in tracking (Vermaak et al., 2002) because it is easy to extract and robust to the partial occlusion. Unfortunately, it is vulnerable to sudden light change and backgrounds with similar colors. As a result, using a single cue for tracking is insufficient because of the complexity and time varying properties of environments. Various com* Corresponding author. Tel.: +86 10 62755569. E-mail address: hongliu@pku.edu.cn (H. Liu). 0167-8655/$ - see front matter Ó 2008 Published by Elsevier B.V. doi:10.1016/j.patrec.2008.10.008 plementary features can be combined to get more robust tracking results. It is our interest to employ multiple cues under a robust tracking framework. Tracking problem can be viewed as a state estimation problem of dynamic systems. From this point of view, algorithms can be divided into two categories. The first category is probabilistic method. This method views tracking as a dynamic state estimation problem under the Bayesian framework, provided that the system model and measurement model bring in uncertainty (Sherrah and Gong, 2001; Toyama and Horvitz, 2000). The representative methods are Kalman Filter and its derivatives, multi-hypothesis tracking algorithms, such as Condensation (Isard and Blake, 1998), Particle Filtering (Arulampalam et al., 2002; Nummiaro et al., 2002), and Monte Carlo tracking (Perez et al., 2002). The second category is deterministic method. This method compares a model with current frame and finds out the most promising region. Mean-Shift (Bradski, 1998; Comaniciu et al., 2000, 2003) and Trust Region (Liu and Chen, 2004) are two typical examples. The deterministic methods are hard to handle complete occlusion very well since the tracking is based on the previous tracking results. If the tracked object is lost or occluded completely, deterministic searching methods will fail. However, they are usually more accurate in tracking than that of the probabilistic multi-hypothesis tracking algorithms. Mean-Shift is a non-parametric method of climbing the density gradient to find the peak of a distribution, which belongs to the 828 H. Liu et al. / Pattern Recognition Letters 30 (2009) 827–837 deterministic methods category. Generally, Mean-Shift converges fast and is robust to small distractors in distributions. It is firstly applied to the color tracking scheme by Bradski (1998) and Comaniciu et al. (2000), respectively. The well-known Continuous Adaptive Mean-Shift (CAMSHIFT) was developed by using color histograms to model the object’s color. Bradski and Commaniciu adopted different methods in calculation of the color distribution and kernel scale. The recent work for the Mean-Shift algorithms (Liu and Chen, 2004; Collins, 2003; Zivkovic and Krose, 2004) mainly focused on solving the window scale problem. How to handle the problems caused by the background color clutter and the complete occlusion has not been addressed in the related literatures. In this paper, the idea of motion cue integrated with the color cue is proposed to solve these problems. Many researchers focus on establishing a multi-cue-integration mechanism under the probabilistic framework, including Dynamic Bayesian Network (Wang et al., 2004), Monte Carlo method (Wu and Huang, 2001), and Particle Filters (Spengler and Schiele, 2003). In these methods, multiple cues are tightly coupled with the tracking model and the tracking algorithm based on Bayesian framework, which makes them difficult to be used in the deterministic tracking methods. Another kind of multi-cue-integration methods is the pixelwise integration method. In this method, tracking is considered as a pixel classification problem. A pixel belongs to the foreground or background is determined by all the cues. Every cue has a saliency map, and these maps are combined according to certain principle. One representative method is the adaptive democratic integration method proposed by Triesch and Malsburg (2000). Each cue votes for the final combined saliency maps and the voting-like integration scheme is adaptive. Spengler and Schiele (2003) uses this adaptive integration method to integrate cues in human face tracking. This pixel-wise integration method is suitable to be used in deterministic tracking methods. Up to now, most literatures concerning deterministic searching methods employ only single color probability distribution. This directly leads to the tracking results vulnerable to complex conditions, such as similarly colored backgrounds and tracking low saturation objects. We try to solve these common problems in color-based deterministic approaches by using multi-cue integration similar as in the area of figure-ground segmentation. The concept of multi-cue used in our method means feature combination for the same object. Multiple cues of features are really used to detect objects automatically. In Mean-Shift tracking, color cue is easy to be computed. However, it may include some similarly colored background areas that distract tracking. Moreover, when the tracked color has low saturation, the color blob will be lost soon because of the heavy noise. On the other hand, the motion cue gained from the background subtraction holds all moving objects, some of which are out of tracking targets. Furthermore, motion detection is usually difficult to get the complete and clean silhouette of moving objects. Combining the motion cue with the color cue will eliminate those uninterested regions as much as possible in both cues’ maps. This motivates us to develop a cue-integration method to integrate motion cue with color cue. Summarily, there are four reasons to integrate both cues under the framework of Mean-Shift. Firstly, integrating motion cue and color cue can eliminate noise and uninterested areas in both cues. Secondly, motion detection results are regarded as a motion probability distribution map and can be integrated with color distribution naturally. Thirdly, as Mean-Shift is robust to small distractors, we can employ preliminary motion detection algorithm, which will reduce the computational complexity. Lastly, Mean-Shift is a fast mode seeking algorithm, which saves computational resources for cue-integration methods. Deterministic algorithm is vulnerable to full occlusion for a few frames because the present iteration is initialized according to the previous one. Once the object being tracked is lost, the deterministic methods normally cannot recover when the object reappears. Based on the color-motion integration mechanism, the occlusion handler is introduced, which can be used to detect full occlusion cases and to reinitialize the tracking window automatically when the object reappears. Our work can be summarized as follows. Firstly, the multiple cue integration technique is brought into the framework of deterministic searching methods to improve tracking robustness. Mean-Shift is a fast and robust tracking algorithm, which is inherently very suitable to build a real-time tracking system, and the cue integration is able to enhance the tracking robustness under various conditions. Secondly, based on the motioncolor integration, an occlusion handler is employed to tackle the full occlusion problem in the deterministic Mean-Shift. Experiments show that it can handle occlusion reasonably well. Thirdly, we applied the Mean-Shift algorithm based on adaptive cue-integration method, and proposed a more robust quality function to evaluate cues’ reliability. With a cue-evaluation mechanism, this method overcomes the rigidity in direct integration. To the best of our knowledge, we have done a creative work to employ the adaptive integration mechanism under the framework of Mean-Shift with a new quality function suitable for evaluating cues’ reliability. In principle, when it is reliable and visible, the color cue will have higher weight in the combined probability distribution. Otherwise, it will be compensated by the motion cue. Lastly, the direct-cue-integration method with the occlusion handler is integrated with the adaptive cue integration to extend the application areas of the adaptive cueintegration method. The rest of this paper is organized as follows: Section 2 brings forward the direct color-motion cue-integration method incorporated with the occlusion handler. Section 3 illustrates the strategy of adaptive multi-cue integration. We also extend the adaptive integration method to the full occlusion cases by integrating the direct-cue integration based occlusion handler in this section. Experimental results and conclusions are given in Sections 4 and 5, respectively. 2. Integrating multiple cues 2.1. Deficiency with color-based mean-shift To use Mean-Shift iterations, a probabilistic distribution map indicating the object being tracked should be firstly calculated. A color probabilistic map is calculated by the histogram back projection: first, the color histogram of the object’s color is calculated and stored in a look-up table. When a new frame comes in, the table is looked up for each pixel’s color, and a probability value is assigned to each pixel. Hence, a probabilistic distribution map is obtained. Mean Shift procedure can be employed to find the nearby dominant distribution peak afterwards. In color-based tracking, in order to gain robustness against illumination variations, RGB video images generally are converted into the HSV color space. It is interesting to find that whether the color model is obtained by the hue channel only. This will improve the robustness of the color model against lighting changes. However, this method brings in a new problem. When pixels’ color has a low saturation near to zero, which means the RGB channels have similar values and the hue channel is not well defined or inaccurate (Swain and Ballard, 1991). Basically, we have 829 H. Liu et al. / Pattern Recognition Letters 30 (2009) 827–837 ( s¼ 0; if max ¼ 0 max min ; max if max ¼ min This represents the likelihood of pixel xi belonging to the foreground MF only using the motion cue. The background model needs to be updated to deal with illumination changes. The different image Di is calculated using the mean values of the background model (8) and incoming images if max ¼ r and g P b Di ¼ jIi li j ð1Þ otherwise v ¼ max 8 undefined > > > > > 60 maxgb ; > min >< gb h ¼ 60 max min þ 360; > > > > þ 120; 60 maxbr > min > > : br 60 max min þ 240; ð2Þ if max ¼ r and g < b ð3Þ if max ¼ g if max ¼ b: V ¼ R; B S¼1 ; R H ¼ 60 GB : S Suppose only G has changed to G + DG, and H has changed to H . From Eq. (4), we have DG : S ð5Þ From formula 5, it can be seen that small changes in G will cause wild swings in hue value when S 0. In this case, the hue value can not represent the original RGB color reliably, which results in inaccuracy and noise in the back-projection image. It is the first deficiency of using single color cue. Moreover, Mean-Shift is robust to small distractors, but if the distractor is larger than the object color area, the object may be lost when it moves near to the similar distractor. This is the second deficiency. Although increasing the color model’s dimensionality and increasing the number of bins can get a cleaner color back-projection image, there are three reasons why we do not choose this way. Firstly, in some cases, using 2D HS histogram model or 3D HSV histogram model can not get satisfying results. As we have shown that when saturation is low, the hue value is corrupted. This can not be solved by increasing the dimensionality of the histogram. Secondly, it is a problem that how many components of color we should use, and how many bins we should discretize each color component. It is difficult to get the right answer under varying circumstances. Thirdly, the computational resource is limited. Increasing the color model’s dimensionality and the number of bins means increasing the computation complexity. This should be avoided in real-time tracking applications. It tries to seek an alternative way to improve tracking robustness. In color-based Mean-Shift tracking, it should be noticed that the distrators are all from the background. When the camera is static, the background is assumed to be fixed, which can be used as a priori to eliminate those noisy areas in the back-projection image. Therefore, it is believed that motion information can be employed to solve the deficiencies. Firstly, motion cue is calculated according to a background model. We assume that the intensity value I of each pixel satisfies Gaussian distribution ! 1 ðI lÞ2 : pðIÞ ¼ pffiffiffiffiffiffiffi exp 2r2 2pr ð6Þ This can be viewed as the background model Mm,B. Here, the foreground model MF is difficult to calculate. However, we can calculate the observation likelihood pmotion(ZijMm,F) through MB : ! 1 ðIi li Þ2 : pmotion ðZ i jMm;F Þ ¼ 1 pffiffiffiffiffiffiffi exp 2r2i 2pr 1; Di > lri 0; ð7Þ ð9Þ Di 6 lri ; where l is a constant. Then Bi is used to update the background model using the following equations: ð4Þ 0 DH ¼ H0 H ¼ 60 then Di is binarized to get a motion mask Bi according to the following formula: Bi ¼ To illustrate this case, we assume that a color vector (R, G, B) is given and satisfies R > G > B > 0. From Eqs. (1)–(3), we can obtain: ð8Þ li ðt þ 1Þ ¼ ð1 aÞli ðtÞ þ aIi ðt þ 1Þ; Bi ðtÞ ¼ 1 ð10Þ li ðtÞ; Bi ðtÞ ¼ 0 8 2 2 >< ð1 aÞðri ðtÞ þ ðli ðt þ 1Þ li ðtÞÞ Þ 2 2 ri ðt þ 1Þ ¼ Bi ðtÞ ¼ 1 ð11Þ > þaðIi ðt þ 1Þ li ðt þ 1ÞÞ ; : 2 ri ðtÞ; Bi ðtÞ ¼ 0; where r is the corresponding standard deviation. For each image Ii, pm(xi, t) denotes the motion probability of the pixel xi at time t. Let pm ðxi ; tÞ ¼ pmotion ðZ i jM m;F Þ; ð12Þ where, pm(xi, t) can be viewed as a distribution that represents the probability of motion for each pixel. 2.2. Direct-cue integration Probabilistic distribution map (PDM) is a mono-chromatic image, the pixels in PDM pj(xi,t) satisfy pj ðxi ; tÞ / pj ðZ i jMj;F Þ; ð13Þ where Zi is the observation at pixel i, Mj,F is the foreground model in cue j, pj(ZijMj,F) represents the observation likelihood of the pixel i given the foreground model Mj,F in cue j. The higher the pixel’s value in pj(xi,t) is, the higher the likelihood of pixel i belongs to the foreground target. The direct-cue-integration method uses the minimum pixel value in several PDMs from different cues. In the direct-cue-integration method, if the pixel’s value in a certain cue is 0, the value of the corresponding pixel in the combined PDM p(xi,t) will also be 0. Since each pixel will be examined by the multiple cues, the direct integration method is very strict. If a pixel has higher probability to be assigned to the background according to any cue, regardless of other cue’s result, the pixel will have a probability lower than 0.5 in the combined PDM. Therefore, the integration method can eliminate the scattered noise on the PDM of the single cue. Suppose c cues are considered, the combined PDM p(xi,t) in the direct integration method can be represented as pðxi ; tÞ ¼ minðpj ðxi ; tÞÞ j ¼ 1; . . . ; c: ð14Þ The color observation likelihood pcolor(ZijMc,F) is calculated through the back projection. The color model Mc,F is represented by the histogram of the object’s color, which is saved as a lookup table. pcolor(ZijMc,F) is calculated using this table. Let pc ðxi ; tÞ ¼ pcolor ðZ i jMc;F Þ: ð15Þ Then, pm(xi, t) is integrated into the original color probability distribution pc(xi, t). The color-motion integration based MeanShift algorithm is illustrated in Table 1. In step 5, k is a constant, b is the centroid of the area, and it is given and the mean location P as 830 H. Liu et al. / Pattern Recognition Letters 30 (2009) 827–837 Table 1 Algorithm of Mean-Shift based on the color-motion integration. 1. Calculate the color PDM: Calculate the color probabilistic distribution map pc(xi, t) through the back projection 2. Calculate the motion PDM: Calculate the motion probabilistic distribution map pm(xi, t) through the motion detection 3. Cue integration: Integrate the two maps by using formula (7) 4. Initialize Mean-Shift iteration: Choose a search window scale s0 and initial the location P0 on combined distribution map p(xi, t) 5. Mean-Shift iteration: Compute moments M00, M01 of the region in the search window (P, s) as follows: P M00 ðtÞ ¼ i pðxi ; tÞ P M01 ðtÞ ¼ i xi pðxi ; tÞ b by using formula (16). Set the new and calculate the mean location P window parameters as p ffiffiffiffiffiffiffiffi ffi b P ¼ P; s ¼ k M 00 . Repeat step 5 until convergence b ¼ M 01 ¼ P M 00 P i xi pðxi ; tÞ P : i pðxi ; tÞ ð16Þ Fig. 1. Direct color-motion integration based Mean-Shift occlusion handler (emphasized by the dotted box). The motion and the color cues are employed explicitly. Motion continuity has been implicitly used, as it is initialized according to the tracking result of the last frame. Note that the integration scheme in the formula (16) is open and more cues can be integrated in. 2.3. Occlusion handling Mean-Shift algorithm is vulnerable to full occlusions in a few frames because the present Mean-Shift iteration is initialized according to the result of the previous iteration. If the object is totally occluded in a couple of frames, the tracking window will drift away and the algorithm will have no mechanism to continue tracking. Based on the direct color-motion integration mechanism, an occlusion handling approach will be helpful, which can be used to detect the full occlusion cases and to reinitialize tracking automatically when the lost object reappears. Based on the direct-cue integration, a distribution map with little background noise is obtained, which makes it possible to search larger non-zero regions on the distribution map to find the reappeared object. Without the direct-cue integration, the background noise may cause the occlusion handler to fail. Fig. 1 shows that the flow chart of the occlusion handler using direct-cue integration. The occlusion handler is composed of an occlusion detecting part and an occlusion recovering part. If the object’s color is fully occluded by some other objects, the tracking window will shrink. When the window area or the density of non-zero pixels in the window becomes smaller than the presetting thresholds, it will be regarded as a full occlusion case. In such case, larger non-zero regions are searched in the object’s probabilistic distribution map near the place where the object disappeared. If some other large regions are found, the largest one is used to initialize the tracking window. A projection-based region segmentation method is used to search the big regions after the full occlusion happens. To minimize the possibility of misclassifying the background clutter as the reappeared object, as well as to save computational resources, searching is limited to the region near the place where the object disappeared. Suppose the person disappeared at xd, then the person is expected to reappear at x near to xd satisfying: xd r < x < xd + r (suggest to use another symbol to represent this r to avoid ambiguity), where r is an empirical parameter. Fig. 2 Fig. 2. Principle to discover reappearing targets. Region larger than a threshold is searched in the interval [xd r, xd + r] after the full occlusion. The region found (white rectangular box) is then used to reinitialize Mean-Shift iterations. shows the principle of obtaining the reappeared target. The algorithm to obtain the reappearing target is summarized in Table 2. Since Mean-Shift converges fast and is robust to small distractors, the region segmentation algorithm is allowed to be a coarse result. Therefore, it is suitable to use the fast projection-based region segmentation method to find the big region after the full occlusion. With the color-motion integration method and the occlusion handler, we can deal with the color background clutter and the full occlusion over a few frames, which is said to be deficiency of the deterministic methods (Perez et al., 2002). Furthermore, the Table 2 Algorithm of finding lost target. Save the horizontal coordinate of the disappearing point of the target, xd, 1. Find the left and right boundaries: The part in the [xd r, xd + r] of the combined distribution p(X, t) is projected to the horizontal axis first to find regions’ left and right boundaries (l, r) 2. Find the top and bottom boundaries: The part in the [l,r] of the combined distribution p(X, t) is projected to the vertical axis to find regions’ top and bottom boundaries (t, b) 3. Tune the searching range or return: If no large region is found, extend the searching range: r = r + D and go to step1; otherwise, return the found region (l, r, t, b) 831 H. Liu et al. / Pattern Recognition Letters 30 (2009) 827–837 occlusion handler can handle long time complete occlusion or the object’s departure from the field of view, FOV of the camera over a couple of frames, which is difficult to be handled by the multihypothesis based probabilistic tracking methods, such as Particle Filter. 3. Mean-Shift adaptive multi-cue integration Although the direct multi-cue integration can enhance the tracking performance of the color-based Mean-Shift algorithm, it may erode the color probabilistic image because of the inevitable holes in the motion detection results. This is a disadvantage of the direct integration when an object’s color has sufficiently high saturation component and its color probabilistic map alone is good enough for tracking. In addition, the direct multi-cue-integration method assumes that the contributions of each cue are the same, regardless of their reliabilities. Hence, we employ an adaptive multi-cue-integration technique. Our work mainly differs from the adaptive multi-cue-integration work suggested in (Spengler and Schiele, 2003; Triesch and Malsburg, 2000), in terms of that we introduce a new quality function which is suitable for the blob tracking. 3.1. Adaptive multi-cue integration Suppose pj(xi,t) is the probability distribution map of cue j, p(xi,t) is the combined probability distribution map, the cues are integrated as a weighted sum of probability distribution pðxi ; tÞ ¼ X X xj ðtÞpj ðxi ; tÞ; ð17Þ j xi ðtÞ ¼ 1 ðsuggest j instead of i hereÞ: ð18Þ i The adaptive integration method changes each cue’s weight adaptively according to the reliability of them in the previous frame. Suppose the performance of the individual cue j can be evaluated using a quality function qj(t). The normalized quality of cue j is given as q ðtÞ j ðtÞ ¼ P j q : j qj ðtÞ ð19Þ In (Spengler and Schiele, 2003; Triesch and Malsburg, 2000), the authors take the maximum point on the PDM as the estimated position of the target. This simplified estimation method is not robust to find the region which the target encompasses. When the target changes its shape and size in the image, this method may fail to segment the target out accurately. In addition, if there are more than one maximum point on the combined PDM, how to determine the position of the target is not addressed in these literatures. Mean-Shift method is employed in this paper to find out the human body in images. There are two reasons to do this: Firstly, the adaptive Mean Shift tracking algorithm is able to find out not only the position of the human body, but also the area of the human body in images. Secondly, Mean-Shift is an iterative method to find the nearest mode in the distribution, which means that it employs the motion continuity cue implicitly. Therefore, it is not necessary to calculate a motion continuity cue separately as that in (Spengler and Schiele, 2003; Triesch and Malsburg, 2000), which is computational expensive. In (Spengler and Schiele, 2003; Triesch and Malsburg, 2000), the authors use a cue quality function as following: ( qj ðtÞ ¼ ^; tÞ 6 p j ðxi ; tÞ pj ðx 0; _ ^; tÞ > p j ðxi ; tÞ; pj ðx j ðxi ; tÞ pj ðx ; tÞ p ^ is the estimated j ðxi ; tÞ is the mean value over the PDM of cue j, x p target position. This quality function depends heavily on a single ^, which may be distracted by noise. To avoid influences of point x noise, in (Spengler and Schiele, 2003; McKenna et al., 2000, the approach of smoothing each cue’s PDM is used in the price of introducing high computation complexity. To make the algorithm tractable, the image is subsampled, which sacrifices the image resolution. This paper presents a new quality function based on the statistics over the region. The new quality function is much more robust to noise. In addition, in our approach, images are processed using the original resolution and subsampling is not needed. 3.2. Quality function based on region statistics The sum of the probabilities in the window W is defined as follows: f ðpðxi Þ; WÞ ¼ X pðxi Þ; xi 2 W ð21Þ i The relation between the quality and the weight of cue j is defined as sx_ j ðtÞ ¼ qj ðtÞ xj ðtÞ: _ and X 0 ðtÞ is a larger window including background pixels, then the quality function qj(t) is defined as ð20Þ Formula (20) can be used to update individual weight of each cue. As the quality function qj(t) is normalized, the weight xj(t) is normalized as well. The parameter s is a time constant controlj ðtÞ > xj ðtÞ, xj(t) tends to be increased. ling the updating speed. If q Essentially, qj(t) represents the feedback of tracking results, so Eq. (14) can be regarded as a running average, and xj(t) is adapted according to qj(t), which brings in the information about the performance of cue j in the last frame. The remaining work is the choice of an appropriate quality function qj(t). Quality function qj(t) can be viewed as the feedback of the _ tracking result X ðtÞ ¼ ðP; sÞ. Here P and s are the estimated center and the scale of the tracking window, respectively. Each cue’s weight is adjusted according to the quality of the cue in last frame. In this paper, the quality function is defined as the ratio between the number of the non-zero pixels in the foreground and those in the background in the individual probability distribution map. To eliminate the effect of pixels far from the object, a center-surround approach is used. The background is defined as the area between _0ðtÞ the tracking box and a larger window X , which shares the same center as the tracking window. _ qj ðtÞ ¼ f ðpj ðxi ; tÞ; X ðtÞÞ _ _ : ð22Þ f ðpj ðxi ; tÞ; X 0 ðtÞ X ðtÞÞ Each cue’s reliability is evaluated by this quality function and the weights are adapted accordingly. When the object with low saturation color is tracked, the quality function value of the color cue will be much lower than that of the motion cue. The weight of the motion cue is increased. When the object changes its color appearance, the color-based algorithm will fail because the tracked color is invisible, but the adaptive color-motion cue based Mean-Shift is able to continue tracking the human by using the information from the motion cue. Sometimes when the motion cue dominates, as the area of color cue is always smaller than that of the motion cue, the area of the _ tracking window X ðtÞ will be expanded larger by the motion cue. This may make the tracking window much larger than the object color area. According to Eq. (22), the color cue may have smaller weight even when it is reliable and this causes a difficulty for color cue to take back the dominating position after the motion cue dominates the tracking. To avoid this, Mean-Shift is applied to the individual color cue as well. We compare tracking window 832 H. Liu et al. / Pattern Recognition Letters 30 (2009) 827–837 on the color cue with the tracking window on the combined distribution, and choose the smaller window as the final tracking window. With this improvement, when the reliable color cue becomes visible again, its weight will automatically increase. When the object color reappears, the weight of the color cue will increase again if the color cue is reliable. The weight can reveal the orientation of the tracked person relative to the camera. It can be seen that the adaptive weighted sum integration is different from the direct integration method. In the direct integration method, if a pixel value is zero in the motion probabilistic distribution map, its probability value is set to 0 no matter what its value in the color probabilistic distribution map is. However, in the weighted sum integration the combined probability value is always decided by both the color and motion probability. Considering possible detection holes in the cue extraction process, the adaptive weighted sum integration is more robust to detect holes than the direct integration method. After the combined probability distribution map is obtained, a region detection algorithm should operate on the combined map to locate the object. Spengler and Schiele (2003) uses the projection of the combined distribution p(xi, t) to the coordinate axes to find the estimated position, more robust multi-cue algorithm is used in their paper, with a new quality function. The motion cue is inherently very suitable to be integrated into the Mean-Shift framework using the adaptive integration method discussed above. The background subtraction results usually have holes and patches of noise. The holes may be remedied by other cues using the weighted sum integration. Moreover, as Mean-Shift is robust to small distractors, noise from the background subtraction has much less influence on the tracking results. The flow chart of our proposed adaptive multi-cue integration based Mean Shift tracking is illustrated in Fig. 3. It is noted that the cue performance evaluation forms a feedback loop, which is the key difference from the direct multi-cue integration. Each cue’s reliability is evaluated in the cue-evaluation phase by the new quality function. Another point deserving notice is that there are two places using Mean Shift iterations. The first one is applied to the combined PDM, and the second is applied to the color PDM. Fig. 4. Integration of the adaptive cue integration and the direct-cue-integration methods. This extends the adaptive integration to the full occlusion cases. The aim of this is to favor the color cue when it is reliable. When the color cue is occluded by the target itself and reappears again, the second Mean-Shift iteration can help to focus on the color cue again. It should be mentioned that the adaptive integration method in Fig. 3 works well in single person sequences, and it may fail in some sequences in which the object person is occluded. We also integrate the direct integration method with the occlusion handler (Section 2) to increase the robustness of the adaptive cue integration in full occlusion cases. The principle of the direct integration and the adaptive integration method is shown in Fig. 4. Direct-cue integration is performed in every frame aiming at helping the adaptive integration method to detect and handle the full occlusion cases. When occlusion is detected, the occlusion handler is called to search for the lost target. Once the target is found, the found region is used to initialize the adaptive cue-integration based Mean-Shift iterations. Direct integration is performed in every frame besides the adaptive integration, and the probability density map from the directcue integration is used to detect occlusion and search for the occluded person. It is noted that there are three times of Mean-Shift iterations (one Mean-Shift is used in the occlusion handler) in the proposed algorithm. Fortunately, Mean-Shift is very efficient and the adaptive cue integration with the occlusion handler can run in real time. 4. Experiments Fig. 3. Flow chart of the adaptive color-motion integration based Mean Shift tracking algorithm. To evaluate the effectiveness of the proposed algorithm, an experimental system is set up. Experiments are carried out on a PC with a 1.8 GHz Pentium 4 CPU and 512 MB memory. Pixels with saturation lower than 30 and brightness lower than 10 are discarded. The direct color-motion integration method is tested. The occlusion handler based on the direct cue-integration method is tested as well. Finally, the advantages of the adaptive multi-cue H. Liu et al. / Pattern Recognition Letters 30 (2009) 827–837 integration are demonstrated. Video sequences are 640 480 in resolution without subsampling compared to Spengler and Schiele (2003) and Triesch and Malsburg (2000). S1 and S2 are the sequences with low saturation objects. S3 and S4 are the sequences in which the color cue with media saturation is tracked with a similar background color clutter. S5 and S6 are the multi-people sequences in which a good color cue is tracked but the object is never occluded. S7 to S11 are the video sequences with two persons and there are full occlusion cases. There are three persons in S12 and S13. In total, the video sequence database has over 3000 frames. All the results are obtained in real-time running. 4.1. Direct motion cue integration 833 values not accurate and forms a large area of distractor. Motion cue helps to eliminate the distractor at the door, and enables the color-based Mean-Shift method to converge to the right color area. 4.2. Occlusion handling The occlusion handler is tested by using multiple person video sequences. In these sequences, the occluded person is tracked and the Bradski (1998) algorithm fails in all these sequences. With the occlusion handler, when the occluded person reappears from the occlusion, the occlusion handler can reinitialize the tracking and recover from the object lost. Fig. 7 shows a full occlusion case in sequence S8. Color model is initialized on the boy’s red coat Algorithms are tested with the video sequences whose information is shown in Table 3. An 1D hue-histogram with 16 bins is used. In cases with only color cue, objects were lost after initialization later. Table 4 shows the tracking results using four video sequences, that is S1 to S4. It can be seen that the original color-based MeanShift algorithm failed for all sequences. When tracking window is converged to a wrong region or much larger than the object color area, it is defined as the tracking failure. Fig. 5 shows the tracking results using the method by Bradski (1998) and the proposed direct integration algorithm in frame 68 of Sequence S1. From Fig. 5, it can be seen that the integration of the motion information helps the color-based Mean-Shift algorithm to overcome the difficulty of tracking objects with low saturation color. Even when the saturation of the object color is not low, background color clutter may also cause tracking failure. Integrating motion cue enables the color-based Mean Shift algorithm to track objects in this case. Fig. 6 shows the results of a real-time video sequence in frame 90 of sequence S3, the lighting condition in this sequence is dark, which makes the background pixels’ hue Table 3 The video sequences used in our experiments. Sequence Human Sequence characteristics Total frames S1(200 frames) S2(200 frames) S3(200 frames) S4(200 frames) S5(350 frames) S6(384 frames) S7(384 frames) S8(110 frames) S9(160 frames) S10(200 frames) S11(255 frames) S12(350 frames) S13(350 frames) Single Low saturation color Low saturation color Background distractor Background distractor Occlusion by object Reliable color Reliable color Once occlusion Once occlusion Once occlusion Twice occlusion Three times occlusion Three times occlusion 1918 2 More than 2 Fig. 5. Tracking results in frame 68 of video sequence S1 (tracking object with low saturation color): (a) shows the tracking results and probabilistic distribution maps using only color cue and (b) show the tracking results using color-motion cue integration. 1515 Table 4 Tracking performance: S1 and S2 are sequences with low saturation objects. S3 and S4 are sequences with similar background color clutter. Video sequence Integrate motion info. Model initialization (nth frame) Fail (nthframe) Successful rate (%) S1 (200 frames) N Y 60 175 Success 82 100 S2 (200 frames) N Y 30 112 Success 48 100 S3 (200 frames) N Y 30 151 Success 71 100 S4 (200 frames) N Y 40 83 Success 27 100 Fig. 6. Tracking results in frame 90 of video sequence S3: tracking human under similarly colored backgrounds: (a) the tracking result and corresponding probabilistic distribution map using only color cue and (b) the results after color-motion cue integration. 834 H. Liu et al. / Pattern Recognition Letters 30 (2009) 827–837 Fig. 7. Full occlusion case from video sequence S8. (a–d) are tracking results from S8. In frame 67, the tracked human reappears, the algorithm successfully detects him and reinitializes Mean Shift iterations: (a) frame 64, (b) frame 66, (c) frame 67, and (d) frame 68. before the occlusion occurs. In frame 66, the full occlusion occurs. In frame 67, the occlusion handler can recover from the full occlusion and reinitialize the tracking. Fig. 8 shows the performance of the occlusion handler in sequence S13. There are two occlusion cases in the figure. Color model is initialized on the boy’s blue shirt before the occlusion. In frame 204, the boy is totally occluded by another boy. In frame 206, the boy in blue shirt becomes visible again and the algorithm can continue tracking him. Later, the girl in red occludes the boy completely in frame 219. The occlusion handler can successfully work as well. 4.3. Mean-Shift with adaptive color-motion integration In the adaptive multi-cue-integration experiments, a 2D histogram is used. The hue and saturation components are discretized into 16 and 10 bins, respectively. Other experimental conditions unchanged. Adaptive multi-cue-integration strategy is tested in sequences S1 to S6, the object can be tracked during the whole sequences. Fig. 9 shows a representative result of tracking a low saturation color in video sequence S1. Color model is selected according to the color of boy’s shirt. As the color cue has a low value of quality function, its weight is diminished and the weight of motion cue (the lighter curve) increases. It can be seen from Fig. 10 that both cues have impacts on the combined probability distribution map and the motion cue dominates (see Fig. 11). We compared the effect of quality functions in (Spengler and Schiele, 2003; Triesch and Malsburg, 2000) with our proposed quality function. Fig. 10 shows the weights adjusting results using the quality function in (Spengler and Schiele, 2003; Triesch and Malsburg, 2000). As this quality function is based on the value at only one estimated point, there is a great possibility that the result may be distracted by noise. Compared with the result of our proposed quality function shown in Fig. 9, the quality functions used in (Spengler and Schiele, 2003; Triesch and Malsburg, 2000) can not adjust the weights as smooth as ours. And even there are unreasonable cases in which motion cue has lower weights than color cue around frame 100. When the color cue is reliable, i.e. there is a little distractor on the color probability map, and the adaptive integration mechanism should not depress the color cue, otherwise, we should give it a Fig. 8. Tracking results in full occlusion case from video sequence S13. This figure shows two successive full occlusion cases in a video involving three persons. Fig. 9. Adaptive color-motion integration for tracking low saturation color. The lighter curve is the weight of motion cue, and the darker one is the weight of color cue. The vertical dotted line indicates the time for Fig. 10. higher weight. Sequence S3 to S6 are sequences where good color features are tracked. Table 5 shows the tracking results. In all the sequences, color cue has a higher average weight than motion cue. Fig. 12 shows a typical result of video sequence S4. The color model is initialized on the boy’s orange T-shirt. The color cue becomes dominant after initialization, which can also be seen from Fig. 13. With the adaptive cue integration, the problem of changing appearance caused by human rotation can be handled as well, H. Liu et al. / Pattern Recognition Letters 30 (2009) 827–837 Fig. 10. Tracking results of Adaptive color-motion integration in low saturation color. The result is from frame 120 of sequence S1: (a) is combined probability density maps and (b) the corresponding tracking result. Motion cue is dominating (refer to Fig. 9). Fig. 11. Weight–time curve of the quality functions in (Spengler and Schiele, 2003, 2000; compare with ours in Fig. 9). Table 5 Result of tracking reliable color cue using adaptive cue integration. In all sequences, the color cue has a higher average weight. Video sequence Model initialization (nth frame) Average weight (color) Average weight (motion) S3 (200 frames) S4 (200 frames) S5 (384 frames) S6 (384 frames) 67 47 80 84 0.91 0.86 0.69 0.52 0.09 0.14 0.30 0.48 835 Fig. 13. Tracking result of the adaptive color-motion integration: track reliable color. The result is from frame 82 of sequence S4. (a) Combined probability density maps and (b) the corresponding tracking result. Fig. 14. Tracking results of the adaptive color-motion integration: object changes appearance. The lighter curve is the weight of motion cue, and darker one is the weight of color cue. The vertical dotted line indicates the time when Fig. 15 is taken. the motion cue. In frame 120, the boy begins to turn back and the blue pattern is visible again, the color cue returns to increase again when the boy turns around. Fig. 12. Adaptive color-motion integration: track reliable color. The lighter curve is the weight of motion cue, and the darker one is the weight of color cue. The vertical dotted line indicates the time when Fig. 15 is taken. which is demonstrated in Figs. 14 and 15. Color model is initialized with the bright blue pattern on the boy’s T-shirt when he is facing to the camera. For the first few frames, the weight of color cue has a tendency to increase because of its high quality. In frame 100, the boy begins to turn left and walk to the right of the image. The blue pattern becomes invisible, in which case the color-based tracking algorithm fails. However, the object can still be tracked because the weight of the motion cue begins to increase and becomes dominant in the combined map. The failed color cue is compensated by Fig. 15. Tracking result of the adaptive color-motion integration: object changes appearance: a, c and e are combined probability density maps, b, d and f are the corresponding tracking results. These are taken from frame 100, 120, 140 of sequence S1 (refer to Fig. 14). 836 H. Liu et al. / Pattern Recognition Letters 30 (2009) 827–837 difficulties in full occlusion case. This is shown in Fig. 16. When the target person is occluded by another, the adaptive integration can not realize that the target is occluded and mistakes the occluder for the target. It causes failure tracking. Therefore, we also try to extend the adaptive algorithm to the full occlusion cases. The direct-cue-integration method is performed in every frame in order to detect and handle full occlusion. The algorithm is tested in all the multi-person sequences. Fig. 17 shows one result in full occlusion case, compared with Fig. 16. The result is taken from sequence S10. The direct-cue integration is performed for every frame, and this result is used to detect and handle the full occlusion. In frame 96, full occlusion is detected, and the occlusion handler begins to search in a larger region near the disappearing point on the direct cue-integration PDM. In frame 102, the tracked person is detected when he reappears. 5. Conclusions Fig. 16. Tracking failure caused by full occlusion problem in adaptive color-motion integration. Result is taken from sequence S10: (a, b and d) are tracking results, (c) is the combined PDM corresponding to (b). This experiment also demonstrates that the motion probabilistic distribution map is suitable to Mean-Shift framework. Note the distractor in the left of Fig. 15c which is brought in by motion probabilistic distribution map. Mean-Shift is robust to this kind of small distractor. It has been mentioned in previous section that the adaptive algorithm acts very well in the single human case and can automatically change cues’ weights soundly. However, it will encounter This paper has demonstrated that the motion cue can be integrated with the color cue to solve the problems encountered in tracking an object with low saturation color and the background color clutter by using the color-based Mean-Shift algorithm. With the direct-cue-integration approach, an occlusion handler is proposed to handle the full occlusion for a couple of frames using Mean-Shift algorithm as well. We also applied Mean-Shift algorithm to the adaptive multi-cue-integration method to find out the region that the target person involved. A novel quality function is suggested to evaluate the reliability of each cue smoothly and soundly. When the color cue is more reliable, its weight will become higher than the motion cue. When the color cue is less reliable, it is compensated by the motion cue. Extensive experiments demonstrate that the proposed adaptive color-motion integration algorithms perform quite well in the unpleasant tracking situations, such as images with low situation color, tracking object which has similar color to the background, tracking objects occluded partially or fully. Acknowledgements This work is supported by National Natural Science Foundation of China (NSFC, Nos. 60675025, 60875050) and National High Technology Research and Development Program of China (863 Program, No. 2006AA04Z247). References Fig. 17. Adaptive color-motion integration: using the direct-cue integration with the occlusion handler to handle the full occlusion: a, b, d, e and f are tracking results, c is the adaptively combined PDM in frame 96 corresponding to b (refer to Fig. 16). Arulampalam, M.S., Maskell, S., Gordon, N., Clapp, T., 2002. A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Trans. Signal Process. 50. Bradski, G.R., 1998. Computer vision face tracking for use in a perceptual user interface. IEEE Workshop Appl. Comput. Vision, 214–219. Collins, R.T., 2003. Mean-shift blob tracking through scale space. IEEE Conf. Comput. Vision Pattern Recognition Proc. 2, 234–240. Comaniciu, D., Ramesh, V., Meer, P., 2000. Real-time tracking of non-rigid objects using mean shift. IEEE Conf. Comput. Vision Pattern Recognition 2, 142–149. Comaniciu, D., Ramesh, V., Meer, P., 2003. Kernel-based object tracking. IEEE Trans. Pattern Anal. Machine Intell. 25, 564–577. Haritaoglu, I., Harwood, D., Davis, L.S., 2000. W4: Real-time surveillance of people and their activities. IEEE Trans. Pattern Anal. Machine Intell. 22. Hayman, E., Eklundh, JO., 2002. Probabilistic and voting approaches to cue integration for figure-ground segmentation. Proc. 7th Eur. Conf. Comput. Vision. Isard, M., Blake, A., 1998. CONDENSATION – conditional density propagation for visual tracking. Int. J. Comput. Vision, 5–28. Liu, T.-L., Chen, H.-T., 2004. Real-time tracking using trust-region methods. IEEE Trans. Pattern Anal. Machine Intell. 26 (3), 397–402. McKenna, S., Jabri, S., Duric, Z., Rosenfeld, A., Wechsler, H., 2000. Tracking groups of people. Computer Vision and Image Understanding 80, 42–56. Nummiaro, K., Koller-Meier, E., Van Gool, L., 2002. Object tracking with an adaptative color-based particle filter. Image Vision Comput., 99–111. Perez, P., Hue, C., Vermaak, J., Gangnet, M., 2002. Color-based probabilistic tracking. Eur. Conf. Comput. Vision, 661–675. H. Liu et al. / Pattern Recognition Letters 30 (2009) 827–837 Sherrah, J., Gong, S., 2001. Continuous global evidence-based bayesian modality fusion for simultaneous tracking of multiple objects. Int. Conf. Comput. Vision, 42–49. Spengler, M., Schiele, B., 2003. Toward robust multi-cue integration for visual tracking. Machine Vision Appl. 14, 50–58. Swain, M., Ballard, D., 1991. Color indexing. Int. J. Comput. Vision, 11–32. Tao, H., Sawhney, H.S., Kumar, R., 2000. Dynamic layer representation with applications to tracking. Proc. Comput. Vision Pattern Recognition, II:134–141. Toyama, K., Horvitz, E., 2000. Bayesian modality fusion: Probabilistic integration of multiple vision algorithms for head tracking. Proc. Asian Conf. Comput. Vision. Triesch, J., Malsburg, C., 2000. Self-organized integration of adaptive visual cues for face tracking. IEEE Int. Conf. Automatic Face Gesture Recognition, 102–107. 837 Vermaak, J., Perez, P., Gangnet, M., Blake, A., 2002. Towards improved observation models for visual tracking: Selective adaptation. Eur. Conf. Comput. Vision, 645–660. Wang, T., Diao, Q., Zhang, Y., Song, G., Lai, C., Bradski, G., 2004. A dynamic bayesian network approach to multi-cue based visual tracking. Proc. Int. Conf. Pattern Recognition, 167–170. Wren, C.R., Pentland, A.P., 1997. Pfinder: Real-time tracking of human body. IEEE Trans. Pattern Anal. Machine Intell. 19 (July), 780–785. Wu, Y., Huang, T.S., 2001. A co-inference approach to robust visual tracking. Proc. Int. Conf. Comput. Vision 2, 26–33. Zivkovic, Z., Krose, B., 2004. An EM-like algorithm for color-histogram-based object tracking. IEEE Conf. Comput. Vision Pattern Recognition Proc., 798–803.