Motion Tracking with Python

Download from github:


My daughter, Alex, was in the 6th grade this year. For her science fair project, Alex wanted to do something involving animals. I had read about an experiment where lab rats were recorded on video, and their motion was analyzed by a computer (to determine the effects of a neurotoxin). We owned 2 gerbils, Havoc and Zoom, so I suggested to Alex we do a similar experiment with the gerbils. For some reason, she didn’t want to test a neurotoxin on her pets, so instead she decided to test the effects of full spectrum lighting on their movement.

Full spectrum light has all visible colors in it, just like natural sunlight. In contrast, regular lightbulbs only emit light in a narrow color spectrum, usually with a yellow or sometimes blue-ish tint to it. Like sunlight, full spectrum lighting has a somewhat magical effect on people: it improves work performance in schools and offices, elevates mood, and is used medically to treat depression. Here in the Pacific Northwest, where we have a lot of cloud cover during the winter, full spectrum lamps are a popular way to ward off the winter blues.

So we mounted a web cam over the gerbil cage. I wrote a Python script to analyze the incoming video stream, in real time, and detect movement. My script only logged the timestamp and movement coordinates. Alex then wrote her own, separate Python script that took the Cartesian coordinate data, computed the distance and velocity, and then wrote out a CSV file with the computed motion data. She then used OpenOffice to graph the data from her script. We ran the experiment twice: first, with a regular Halogen lamp over the cage, and second, with a full spectrum lamp. In the end, the gerbils traveled 21% more distance (and with 22% higher velocity) when the full spectrum light bulb was over their cage.

This was Alex’s first real-world Python script. It was a great chance for us to work together on a project, and she learned a lot. She learned how to parse a log file, loop over the lines, calculate the basic formulas for distance and velocity, and how to write out a file. I’m convinced that Python is the best language currently available for teaching kids how to program.

Step 1: Detecting Motion

The Python script for motion detection was an interesting project all its own. Detecting movement was the first goal, and surprisingly, it was the easiest part. The OpenCV library (which has Python bindings) has all the functions you need to detect motion in a video feed.

To detect motion, we first create a running average of the incoming video frames, of the last ~0.25 seconds worth of frames. This “blurs” together a sliding time window in the video, so that the noise from the CCD (web cam) is blurred out and only significant motion stands out above the noise. OpenCV provides a function called RunningAvg() for this purpose. The running average is roughly equal to the previous one quarter second of video, all averaged together. Additionally, before we feed the video images into the running average, we apply a smoothing filter with OpenCV’s Smooth() function, to reduce pixel noise from the CCD and create more tolerance against false positives.

To search for motion, we then take the current camera frame and subtract it from the running average, using OpenCV’s AbsDiff() function. If there was no motion, then the resulting difference image will be pure black (that is, the equivalent of zero). However, if there was a fresh, new set of colors someplace in the frame (i.e., because something moved) then those colors will stand out against a black background of non-motion.

			# Smooth to get rid of false positives
			cv.Smooth( color_image, color_image, cv.CV_GAUSSIAN, 19, 0 )

			# Use the Running Average as the static background			
			# a = 0.020 leaves artifacts lingering way too long.
			# a = 0.320 works well at 320x240, 15fps.  (1/a is roughly num frames.)
			cv.RunningAvg( color_image, running_average_image, 0.320, None )

			# Convert the scale of the moving average.
			cv.ConvertScale( running_average_image, running_average_in_display_color_depth, 1.0, 0.0 )

			# Subtract the current frame from the moving average.
			cv.AbsDiff( color_image, running_average_in_display_color_depth, difference )

At this point, we have basically detected motion. If you display the difference image, it will appear pure black, unless something moves. Anything moving appears visible, while anything static quickly fades to black. (Watching this on video makes me feel like a Velociraptor.)

Step 2: Detecting Moving Blobs

Once we have a difference image, the next step it find out where in the frame the motion is occuring. One method would be to look at every pixel of every frame in the difference image, and thus know which pixels are non-black (and hence, were part of the detected movement). I implemented that method with the help of the Numeric Python (numpy) library. Another method would be to determine the “blobs” of changed pixel space, and get the geometry of those blobs. I implemented that method using OpenCV’s FindContours() function. For the purpose of finding moving targets, the results were nearly identical between the two methods.

But before I could try either method, I had to do some more image processing on the difference image. I needed to change the partially-colored areas of the difference image (that is, the new movement areas) into pure white, so that I have definitive image mask that is pure black for non-motion, and pure white for detected motion. This gives me a boolean yes-or-no answer to the question, “Was movement detected at this pixel?” This was done by converting the difference image to greyscale, and then applying a thresholding function that keeps the black pixels pure black, and converts any grey pixel to pure white. (At this stage, it also helps to apply another round of smoothing to eliminate the few random single-pixel “sparkles” that get through.)

To find the “blobs” in the new black-and-white difference image, I simply used the OpenCV function FindContours(), followed by ApproxPoly() to get polygon coordinates. That returns a list of coordinates that represent polygons surrounding the white area(s). For the other method (examining every pixel of every frame), I convert the image object into a numpy array, and then used numpy’s where() function to find the coordinates of the white pixels efficiently.

Step 3: Detecting Targets

Next, we want to use the white pixels (where motion was detected) to define an entity: a single moving object or person in the frame. These moving entities will become the targets that we track over time. This turned out to be the most difficult problem to solve.

One might (naively) think that each solid blob detected by FindContours() would represent a single entity in the video frame, but that is unfortunately not the case. When a person walks by, or moves their arm, or a gerbil moves, the motion shows up as chunks near the perimeter of the object. I like to call them “puddles”, because they dynamically splash around the area of motion. This happens because single-color objects, such as a red sweater, appear nearly the same in small regions. The red pixels in the middle of the red sweater fall into the running average, because red cloth looks just like the red cloth next to it, so the only place you can tell that the red cloth has moved is near the edge, where red stands out against the static background. Furthermore, when something stops moving and thus falls into the running average (becoming part of the static background), there is a short-lived explosion of small “puddles” around its perimeter, as it fades into the background.

Solving this problem for one entity was difficult enough. But we had two gerbils that we wanted to track simultaneously. Tracking multiple targets is considerably more difficult, because the software needs to distinguish between one, two, or zero moving gerbils. The solution I came up with was based on a combination of axis-aligned bounding boxes (AABBs), and the k-means algorithm.

AABBs are widely used in video games as a fast way to compute collision detection. But here I used AABBs to convert disconnected blobs around the perimeter of a moving object into a single entity that could be tracked. I found the AABB of each blob, and then merged all overlapping AABBs together recursively. This results in a box that grows in size until it encompasses all of the nearby “puddles” of white motion pixels into a single box.

# BBoxes must be in the format:
# ( (topleft_x), (topleft_y) ), ( (bottomright_x), (bottomright_y) ) )
top = 0
bottom = 1
left = 0
right = 1

def merge_collided_bboxes( bbox_list ):
	# For every bbox...
	for this_bbox in bbox_list:

		# Collision detect every other bbox:
		for other_bbox in bbox_list:
			if this_bbox is other_bbox: continue  # Skip self

			# Assume a collision to start out with:
			has_collision = True

			# These coords are in screen coords, so > means 
			# "lower than" and "further right than".  And < 
			# means "higher than" and "further left than".

			# We also inflate the box size by 10% to deal with
			# fuzziness in the data.  (Without this, there are many times a bbox
			# is short of overlap by just one or two pixels.)
			if (this_bbox[bottom][0]*1.1 < other_bbox[top][0]*0.9): has_collision = False
			if (this_bbox[top][0]*.9 > other_bbox[bottom][0]*1.1): has_collision = False

			if (this_bbox[right][1]*1.1 < other_bbox[left][1]*0.9): has_collision = False
			if (this_bbox[left][1]*0.9 > other_bbox[right][1]*1.1): has_collision = False

			if has_collision:
				# merge these two bboxes into one, then start over:
				top_left_x = min( this_bbox[left][0], other_bbox[left][0] )
				top_left_y = min( this_bbox[left][1], other_bbox[left][1] )
				bottom_right_x = max( this_bbox[right][0], other_bbox[right][0] )
				bottom_right_y = max( this_bbox[right][1], other_bbox[right][1] )

				new_bbox = ( (top_left_x, top_left_y), (bottom_right_x, bottom_right_y) )

				bbox_list.remove( this_bbox )
				bbox_list.remove( other_bbox )
				bbox_list.append( new_bbox )

				# Start over with the new list:
				return merge_collided_bboxes( bbox_list )

	# When there are no collions between boxes, return that list:
	return bbox_list

But AABBs alone were not quite good enough for our gerbil tracker. Gerbils have long, narrow tails that would confuse the simple AABB method described above, and the “puddles” of detected motion are too unstable; there are many times an AABB is detected on the outlying perimeter, and thus, does not get properly merged. So I pulled out the heavy artillery and used the k-means clustering algorithm. K-means takes as input a bunch of points, and then groups each point to its nearest natural cluster. It is usually used to find patterns in statistical data. In this case, instead of grouping data points into clusters, I used it to group motion pixels into targets.

One major drawback of k-means is that you need to tell how many clusters (or in this case, targets) you are looking for. For the gerbil tracker, this wasn’t a big issue; we have (had) two gerbils, so I told k-means we were looking for two clusters. But I wanted a more generalized solution, one which could detect new objects moving into the frame and automatically start targeting it. So I combined the AABBs with k-means. First, I use the number recursively merged AABBs to get an estimate of the number of targets. The number of AABBs tells me how many non-overlapping regions of motion are in the frame, which is a rough approximation of how many individual entities there are. Then the k-means function gives a more accurate analysis of the actual number of cluster that were found.

Other Notes

The above description is a somewhat simplified explanation. There are a handful of tricks that I used to make the detection more reliable.

  • The estimate using AABBs only happen when new targets are detected. If a moving target is persistent from one frame to the next, I just pass k-means the list of centroids it gave me on the last frame, which provides more consistent (and reliable) results.
  • When building the AABBs, I filter out any boxes that are less than 10% the size of the average box size. This elimates tiny AABBs that appear around things like fingers or hair.
  • I clip the rate at which new targets can appear or disappear, which eliminates a lot of boxes popping into our out of existence when targets stop moving.
  • I inflate the size of the AABBs by 10% when doing the merging, because otherwise, there are a lot of small boxes that incorrectly miss merging by just one or two pixels.
  • When using numpy’s where() function to generate pixel coordinates, I had to convert the output format to the one expected by kmeans using zip().
  • There were several variables that needed experimentation and tweaking to provide good results. This includes the alpha argument to OpenCV’s RunningAvg() function, and the options for Smooth(). The values I came up with may be unique to my camera.
  • I tested both the numpy pixel coordinates, and the (much shorter) list of polygon coordinates to k-means. The results where nearly identical. There is less processing for k-means to do on the polygon coordinates, but at the resolutions I was using (640×480 and 320×240) there was no perceptible performance difference between the two.
  • I did run into a couple of hardware limitations. When recording a video, the MPEG encoding uses up enough CPU to make my laptop drop frames. That’s why the frame rate in the sample video above is inconsistent – if I’m not saving a video, I get a solid 15 fps with medium CPU load. Also, my Logitech web cam has built-in white balance correction. This feature sometimes causes full-screen flashes of “motion”. In normal usage, where the camera is fixed, this is not an issue.

I noticed that other people doing motion detection were using some different ideas. One blogger used the OpenCV functions Erode() and Dilate() to blob the motion together, but I found this method to be too coarse for gerbil detection. Another blogger used color tracking (looking at red pixels only), which would have required attaching some kind of marker to the gerbils.

Finally, I discovered that between Ubuntu 10.04 and 10.10, the included numpy library underwent an API change that broke the where() function. The computer we used for tracking the gerbils was running Ubuntu 10.10, so I was forced to used the polygon coordinates instead of pixel coordinates (which, luckily, was almost as accurate as using the pixel coordinates directly).

Further Work

I’ve played around with a few other algorithms in an effort to improve the reliability of the tracking. I tried using the AABBs as the primary entity, using k-means only to find the centroid within the box. However, that turned out to be far less accurate than letting k-means find the centroids over the entire image.

As a separate project, I also added face detection, using OpenCV’s Haar classifier. This worked amazingly well, even for low resolutions and small faces that were off in the distance. However, the Haar classified is computationally very expensive, and it reduced my frame rate to about ~10 frames per second on my Intel(R) Core(TM)2 Duo CPU T7800 @ 2.60GHz. I think face detection could make a useful addition anywhere tracking people was desired; it gives a rough idea of scale (and hence, distance) to the moving targets. It would be fun to see how much functionality of the Xbox Kinect could be replicated, using only motion detection plus scale analysis on a standard web cam.

Finally, I added the ability to process raw video from a file, in addition to input from the camera, and also the ability to save the processed video stream to a file. That is how I created the videos you see in here.


This project implemented motion tracking using a web cam and a low-power laptop computer. The project took about a week of a research, plus a week of development. It proved to be reliable enough for a scientific research project with gerbils. The main error we encountered when the gerbils were moving very near each other. We compensated for this shortcoming by averaging the motion of both gerbils in our data analysis.

Comments are closed.