There are several problems in doing effective tracking. It is a deceptively hard problem since it’s trivial for the eye and human brain to track any object at any waking moment. For a machine it is harder since you can only vaguely define mathematically to who is tutoring / lecturing. Apart from that any good tracker can be confused in many situations. This is because an automated tracker might find it extremely hard to segment a visual scenario into interesting objects and just background noise. As an example, in a lecture hall where the lecturer is a person whose video stream is being webcast and shown using a projector might lead some trackers to believe that there is no lecturer.
Intuitively thinking about this problem, one might put his money on a motion tracker since in a classroom environment significant motion might only originate from the lecturer himself. You might have caught me over the last sentence, thinking that their are ever-irritant moving students / audience, but did you come with other scenarios? Did you think of animation on slides, camera motion, flickering lights, global light changes (e.g. when they are switched off during a slide show), changes in the board, and the movement of boards (see OCW lectures to find the ever-useful sliding boards)? Apart from these things, all trackers have to deal with the contrast of lecturer body as compared to the background (see the following comparison):
|Figure 1. Contrast the distinctness of the lecturer from the background in the two frames|
To make the problem simpler, we rely heavily on a heuristic which says that are aim is to keep the lecturer inside a camera frame, although the tracking itself may misbehave. All our trackers returned a confidence measure indicating the likelihood that the output of the tracker is correct. Because of the inherent problems with the motion tracker, AVRiL relies on two sets of trackers. One is an advanced motion tracker (with heuristics which I can't explain effectively on a page). The other is a haar-classifier, which is only used now and then with a profile human body template, to reinforce and re-assure the motion tracker that its tracking the right "human". The haar-classfier can't be used on every frame since its resource intensive as compared to the motion tracker. The motion tracker built works on the technique of motion histograms, which will be explained in detail below. To find out more about the Haar Classifier technique read Viola and Jones seminal paper on real-time object detection and OpenCV's object recognition technique. Motion tracking is basically a method which involves frame differencing to get motion over the last frame change, and then a detection step of pin-pointing the location of the biggest cluster of motion pixels. Motion tracking is an easily implemented tracker and a good choice where motion can be localized to the object needed to be tracked, but of course due to our scenario we need to do some optimizations. One of the optimizations is to avoid tracking from the audience area, which can be easily done by detecting initially the lower region of the frame where motion happens in diffused areas.
Motion tracking can be essentially done in two ways. You can either do a background subtraction or frame differencing. In background subtraction the wide-angle camera would first take an initial shot (or a set of shots) before the lecture starts in-order to build the background of the lecture hall. Then over every subsequent frame in the lecture, that frame would be minus-ed from initial background frame. This is not such an effective method as any changes in the lecture environment will result in the subtracted image i.e. any new object coming in like the lecturer opening her laptop. The concept of background subtraction is given in the image below (Figure 2). Although the background model is built with a Gaussian of a set of background images, the final subtracted output has a lot of noise in it.
|Figure 2. Background Subtraction run on a background image and a frame taken during a lecture|
Rather than doing background subtraction, which is bad when the background has changed by a great degree, we do frame differencing. Rather than subtracting each frame from the background model, you subtract each frame from every previous frame (Figure 3). Hence for every frame the following formula is used to obtain the foreground image for that:
There are also better ways to do frame differencing. You can make a better background model by using a weighted average of the previous few frames as follows.
The value for α is set typically to a value close to 0.05. The foreground is computed in the same way as before, by just subtracting the background from every frame. The effect of this equation would be to create a background which is built by some percentage from the previous frame and increasingly smaller percentage from all the frames before it.
|Figure 3. Frame differencing operation illustrated|
Gaussian masking is used to denote at least two things when we talk about tracking from images. The first is used in the context of building a Gaussian background model. In this method, a group of background images is taken and the mean and variance for each pixel is calculated. Once that is done every new frame’s pixels is compared with the mean, variance and a specific threshold to calculate if the pixel forms the foreground in the frame or not.
AVRiL specifically uses a Gaussian mask for subtracting two frames. When using a Gaussian mask, foreground calculation for a specific pixel is not only dependent on the pixel values over the frames at that location, but also weighted over a pixel region around. This pixel region is modeled around a 2D Gaussian curve as given in the figure below:
|Figure 4. Shape of a 2D Gaussian for taking weighted average of pixels|
It is important to note that a lecturer cannot be only detected by just doing some form of background subtraction or frame differencing. A detection step is necessary to complete the process. Think of it this way that once frame differencing has been done, the system has figured out the foreground. The story doesn't end there as foreground might have a lot of other things than the lecturer himself, since other things in the lecture hall can also be moving. So from the foreground, the system needs to detect a boundary where the lecturer possibly stands. Maybe by some complex means, you could carve out an exact boundary of the lecturer but since this process will be done tens of time in a second, it needs to be as computationally cheap as possible. Hence, just computing a rectangle fit suits our purpose the best. From intuition, one can clearly say that marking a human silhouette would be much more difficult than marking a rectangular bounding region. The whole lecturer detection process is explained in the instructive figure below:
|Figure 5. Steps of Lecture Detections|
So how is this rectangular bounding region marked? After frame differencing (to mark the foreground) and Gaussian masking (to enhance the foreground), a Motion Histogram (1) is built of the frame. The motion histogram method is extremely simple. It was used by the MSR team to build iCam/iCam2. It just divides the frame into multiple vertical columns where each column is around 5 pixels wide. We build a motion histogram by noting all the non-zero values in each column i.e. the histogram has as many values as there are columns and each value represents the number of non-zero pixels in that column. If you add up all these values in the histogram, you will get the total number of non-zero values in that frame which we will call ‘Total Motion’. One should note that we are not talking about the total number of non-zero values in the original frame rather we are talking about the non-zero values remaining after the process of frame differencing.
The value of Total Motion is an important one. Before creating a histogram of the given frame, the value of Total Motion is checked to see if that it is within certain limits. If it exceeds a certain threshold the frame is not processed and it is supposed that the lecturer hasn't moved. This is done since a large number of motion pixels can be generated by change in lighting, animation on slides, camera movement and many other things. Translation of lecturer during camera movement is dealt through other methods.
After building a motion histogram, all that needs to be calculated is the biggest cluster of motion values. To calculate this we use a method called the “Thinnest Mountain” method. In this method we try to figure where is the bunch of columns which contains 60% (or some other higher percentage) of all motion pixels. Figure 5 illustrates the thinnest mountain method (see the the Thinking Code Monkey page to see what this problem is and how to solve it). In the figure the dark gray bars show the part where the lecturer stands. The two vertical markers show the end result of the Thinnest Mountain method. The diagram points out that it is always not possible to get completely correct results, as according to the motion over the last frame, some portions of the lecturer might be left out of the bounding region computed. As for computing the horizontal boundaries (across lecturer height), we can do the same method but less frequently, as we don’t expect the lecturer to move much towards or away from the camera as much as he would move across the frame.
I hope you enjoyed the article, and gained some insight into how to do simple tracking :)
(1) Hybrid Speaker Tracking in An Automated Lecture Room | C. Zhang, Y. Rui, L. He, M. Wallick | IEEE International Conference on Multimedia and Expo, 2005