A Kinect hack that performs body detection in real-time and cuts an individual person from the Kinect video feed, distorts them using GLSL shaders and pastes them back into the image using OpenGL multitexturing, blending them seamlessly with other people in the image.

It’s a straightfoward concept, but the possibilities are endless. Pixelate your naked body and taunt your boyfriend over video chat. Turn yourself into a “hologram” and tell the people around you that you’ve come from the future and demand beer. Using only your Kinect and a pile of GLSL shaders, you can create a wide array of effects.

This hack relies on the PrimseSense framework, which provides the scene analysis and body detection algorithms used in the XBox. I initially wrote my own blob-detection code for use in this project, but it was slow and placed constraints on the visualization. It required that people’s bodies intersected the bottom of the frame, and it could only detect the front-most person. It assumed that the user could be differentiated from the background in the depth image, and it barely pulled 30 fps. After creating implementations in both Processing (for early tests) and OpenFrameworks (for better performance), I stumbled across this video online: The video shows the PrimeSense framework tracking several people in real-time, providing just the kind of blob identification I was looking for. Though PrimeSense was originally licensed to Microsoft for a hefty fee, it’s since become open-source and I was able to download and compile the library off the PrimeSense website. Their examples worked as expected, and I was able to get the visualization up and running on top of their high-speed scene analysis algorithm in no time.

However, once things were working in PrimeSense, there was still a major hurdle. I wanted to use the depth image data as a mask for the color image and “cut” a person from the scene. However, the depth and color cameras on the Kinect aren’t perfectly calibrated and the images don’t overlap. The depth camera is to the right of the color camera, and they have different lens properties. It’s impossible to assume that pixel (10,10) in the color image represents the same point in space as pixel (10, 10) in the depth image. Luckily, Max Hawkins let me know that OpenNI can be used to perform corrective distortions, aligning the image from the Kinect’s color camera with the image from the depth camera and adjusting for the lens properties of the device. Luckily, OpenNI performs all of the adjustments necessary to perfectly overlay one image on the other. I struggled for days to get it to work, but Max was a tremendous help and pointed me toward these five lines of code, buried deep inside one of the sample projects (and commented out!)

1
2
3
4
5
6
7
     // Align depth and image generators
     printf("Trying to set alt. viewpoint");
     if( g_DepthGenerator.IsCapabilitySupported(XN_CAPABILITY_ALTERNATIVE_VIEW_POINT) )
     {
         printf("Setting alt. viewpoint");         g_DepthGenerator.GetAlternativeViewPointCap().ResetViewPoint();
         if( g_ImageGenerator ) g_DepthGenerator.GetAlternativeViewPointCap().SetViewPoint( g_ImageGenerator );
     }

Alignment problem, solved. After specifying an alternative view point, I was able to mask the color image with a blob from the depth image and get the color pixels for the users’ body. Next step, distortion! Luckily, I started this project with a fair amount of OpenGL experience. I’d never worked with shaders, but I found them pretty easy to pick up and pretty fun (since they can be compiled at run-time, it was easy to write and test the shaders iteratively!) I wrote shaders that performed pixel averaging and used sine functions to re-map texcoords in the cut-out image, producing interesting wave-like effects and blockiness. I’m no expert, and I think these shaders could be improved quite a bit by using multiple passes and optimizing the order of operations.

Since many distortions and image effects turn the user transparent or move their body parts, I found that it was important to fill in the pixels behind the user in the image. I accomplished this using a “deepest-pixels” buffer that keeps track of the furthest color at each pixel in the image. These pixels are substituted in where the image is cut out, and updated anytime deeper pixels are found.

Here’s a complete breakdown of the image analysis process:

The color and depth images are read off the Kinect. OpenNI is used to align the depth and color images, accounting for the slight difference in the lenses and placement that would otherwise cause the pixels in the depth image to be misaligned with pixels in the color image.
The depth image is run through the PrimeSense Scene Analyzer, which provides an additional channel of data for each pixel in the depth buffer, identifying it as a member of one or more unique bodies in the scene. In the picture at left, these are rendered in red and blue.
One of the bodies is selected and the pixels are cut from the primary color buffer into a separate texture buffer.
The depth of each pixel in the remaining image is compared to the furthest known depth, and deeper pixels are copied into a special “most-distant” buffer. This buffer contains the RGB color of the furthest pixel at each point in the scene, effectively keeping a running copy of the scene background.
The pixels in the body are replaced using pixels from the “most-distant” buffer to effectively erase the individual from the scene.
A texture is created from the cut-out pixels and passed into a GLSL shader along with the previous image.
The GLSL shader performs distortions and other effects on the cut-out image before recompositing it onto the background image to produce the final result.
Final result!

Here’s a video of the Kinect Fun House Mirror at the IACD 2011 Showcase: