Posts from the “Portfolio” Category

Kinect Fun House Mirror


A Kinect hack that performs body detection in real-time and cuts an individual person from the Kinect video feed, distorts them using GLSL shaders and pastes them back into the image using OpenGL multitexturing, blending them seamlessly with other people in the image.

It’s a straightfoward concept, but the possibilities are endless. Pixelate your naked body and taunt your boyfriend over video chat. Turn yourself into a “hologram” and tell the people around you that you’ve come from the future and demand beer. Using only your Kinect and a pile of GLSL shaders, you can create a wide array of effects.

This hack relies on the PrimseSense framework, which provides the scene analysis and body detection algorithms used in the XBox. I initially wrote my own blob-detection code for use in this project, but it was slow and placed constraints on the visualization. It required that people’s bodies intersected the bottom of the frame, and it could only detect the front-most person. It assumed that the user could be differentiated from the background in the depth image, and it barely pulled 30 fps. After creating implementations in both Processing (for early tests) and OpenFrameworks (for better performance), I stumbled across this video online: The video shows the PrimeSense framework tracking several people in real-time, providing just the kind of blob identification I was looking for. Though PrimeSense was originally licensed to Microsoft for a hefty fee, it’s since become open-source and I was able to download and compile the library off the PrimeSense website. Their examples worked as expected, and I was able to get the visualization up and running on top of their high-speed scene analysis algorithm in no time.

However, once things were working in PrimeSense, there was still a major hurdle. I wanted to use the depth image data as a mask for the color image and “cut” a person from the scene. However, the depth and color cameras on the Kinect aren’t perfectly calibrated and the images don’t overlap. The depth camera is to the right of the color camera, and they have different lens properties. It’s impossible to assume that pixel (10,10) in the color image represents the same point in space as pixel (10, 10) in the depth image. Luckily, Max Hawkins let me know that OpenNI can be used to perform corrective distortions, aligning the image from the Kinect’s color camera with the image from the depth camera and adjusting for the lens properties of the device. Luckily, OpenNI performs all of the adjustments necessary to perfectly overlay one image on the other. I struggled for days to get it to work, but Max was a tremendous help and pointed me toward these five lines of code, buried deep inside one of the sample projects (and commented out!)

1
2
3
4
5
6
7
     // Align depth and image generators
     printf("Trying to set alt. viewpoint");
     if( g_DepthGenerator.IsCapabilitySupported(XN_CAPABILITY_ALTERNATIVE_VIEW_POINT) )
     {
         printf("Setting alt. viewpoint");         g_DepthGenerator.GetAlternativeViewPointCap().ResetViewPoint();
         if( g_ImageGenerator ) g_DepthGenerator.GetAlternativeViewPointCap().SetViewPoint( g_ImageGenerator );
     }

Alignment problem, solved. After specifying an alternative view point, I was able to mask the color image with a blob from the depth image and get the color pixels for the users’ body. Next step, distortion! Luckily, I started this project with a fair amount of OpenGL experience. I’d never worked with shaders, but I found them pretty easy to pick up and pretty fun (since they can be compiled at run-time, it was easy to write and test the shaders iteratively!) I wrote shaders that performed pixel averaging and used sine functions to re-map texcoords in the cut-out image, producing interesting wave-like effects and blockiness. I’m no expert, and I think these shaders could be improved quite a bit by using multiple passes and optimizing the order of operations.

Since many distortions and image effects turn the user transparent or move their body parts, I found that it was important to fill in the pixels behind the user in the image. I accomplished this using a “deepest-pixels” buffer that keeps track of the furthest color at each pixel in the image. These pixels are substituted in where the image is cut out, and updated anytime deeper pixels are found.

Here’s a complete breakdown of the image analysis process:

The color and depth images are read off the Kinect. OpenNI is used to align the depth and color images, accounting for the slight difference in the lenses and placement that would otherwise cause the pixels in the depth image to be misaligned with pixels in the color image.
The depth image is run through the PrimeSense Scene Analyzer, which provides an additional channel of data for each pixel in the depth buffer, identifying it as a member of one or more unique bodies in the scene. In the picture at left, these are rendered in red and blue.
One of the bodies is selected and the pixels are cut from the primary color buffer into a separate texture buffer.
The depth of each pixel in the remaining image is compared to the furthest known depth, and deeper pixels are copied into a special “most-distant” buffer. This buffer contains the RGB color of the furthest pixel at each point in the scene, effectively keeping a running copy of the scene background.
The pixels in the body are replaced using pixels from the “most-distant” buffer to effectively erase the individual from the scene.
A texture is created from the cut-out pixels and passed into a GLSL shader along with the previous image.
The GLSL shader performs distortions and other effects on the cut-out image before recompositing it onto the background image to produce the final result.
Final result!

Here’s a video of the Kinect Fun House Mirror at the IACD 2011 Showcase:

GLSL & The Kinect – Part 2

For the last couple weeks, I’ve been working on a kinect hack that performs body detection and extracts individuals from the scene, distorts them using GLSL shaders, and pastes them back into the scene using OpenGL multitexturing. The concept is relatively straightforward. Blob detection on the depth image determines the pixels that are part of each individual. The color pixels within the body are copied into a texture, and the non-interesting parts of the image are copied into a second background texture. Since distortions are applied to bodies in the scene, the holes in the background image need to be filled. To accomplish this, the most distant pixel at each point is cached from frame to frame and substituted in when body blobs are cut out.

It’s proved difficult to pull out the bodies in color. Because the depth camera and the color camera in the Kinect do not align perfectly, using a depth image blob as a mask for color image does not work. On my Kinect, the mask region was off by more than 15 pixels, and color pixels flagged as belonging to a blob might actually be part of the background.

To fix this, Max Hawkins pointed me in the direction of a Cinder project which used OpenNI to correct the perspective of the color image to match the depth image. Somehow, that impressive feat of computer imaging is accomplished with these five lines of code:

1
2
3
4
5
6
7
8
    // Align depth and image generators
    printf("Trying to set alt. viewpoint");
    if( g_DepthGenerator.IsCapabilitySupported(XN_CAPABILITY_ALTERNATIVE_VIEW_POINT) )
    {
        printf("Setting alt. viewpoint");
        g_DepthGenerator.GetAlternativeViewPointCap().ResetViewPoint();
        if( g_ImageGenerator ) g_DepthGenerator.GetAlternativeViewPointCap().SetViewPoint( g_ImageGenerator );
    }

I hadn’t used Cinder before, and I decided to migrate the project to Cinder since it seemed to be a much more natural environment to use GLSL shaders in. Unfortunately, the Kinect OpenNI drivers in Cinder seemed to be crap compared to the ones in OpenFrameworks, et. al. The console often reported that the “depth buffer size was incorrect” and that the “depth frame is invalid”. Onscreen, the image from the camera flashed and occasionally frames appeared misaligned or half missing.

I continued fighting with Cinder until last night, when at 10PM I found this video in an online forum:

This video is intriguing, because it shows the real-time detection and unique identification of multiple people with no configuration. AKA it’s hot shit. It turns out, the video is made with PrimeSense, the technology used for hand / gesture / person detection on the XBox.

I downloaded PrimeSense and compiled the samples. Behavior in the above video achieved. The scene analysis code is incredibly fast and highly robust. It kills the blob detection code I wrote performance-wise, and doesn’t require that people’s legs intersect with the bottom of the frame (the technique I was using assumed the nearest blob intersecting the bottom of the frame was the user.)

I re-implemented the project on top of the PrimeSense sample in C++. I migrated the depth+color alignment code over from Cinder and built a background cache and rebuilt the display on top of a GLSL shader. Since I was just using Cinder to wrap OpenGL shaders, I decided it wasn’t worth linking it in to the sample code. It’s 8 source files, it compiles on the command line. It was ungodly fast. I was in love.

Rather than apply an effect to all the individuals in the scene, I decided it was more interesting to distort one. Since the PrimeSense library assigns each blob a unique identifier, this was an easy task. The video below shows the progress so far. Unfortunately, it doesn’t show off the frame rate, which is a cool 30 or 40fps.

My next step is to try to improve the edge of the extracted blob and create more interesting shaders that blur someone in the scene or convert them to “8-bit”. Stay tuned!

Generative Art in Processing

I threw around a lot of ideas for this assignment. I wanted to create a generative art piece that was static and large–something that could be printed on canvas and placed on a wall. I also wanted to revisit the SMS dataset I used in my first assignment, because I felt I hadn’t sufficiently explored it. I eventually settled on modeling something after this “Triangles” piece on OpenProcessing. It seemed relatively simple and it was very abstract.

I combined the concept from the Triangles piece with code that scored characters in a conversation based on the likelihood that they would follow the previous characters. This was accomplished by generating a Markov chain and a character frequency table using combinations of two characters pulled from the full text of 2,500 text messages. The triangles generated to represent the conversation were colorized so that more likely characters were shown inside brighter triangles.

Process:

I started by printing out part of an SMS conversation, with each character drawn within a triangle. The triangles were colorize based on whether the message was sent or received, and the individual letter brightnesses were modulated based on the likelihood that the characters would be adjacent to each other in a typical text message.

In the next few revisions, I decided to move away from simple triangles and make each word in the conversation a single unit. I also added some code that seeds the colors used in the visualization based on the properties of the conversation such as it’s length.

Final output – click to enlarge!

Kinect & GLSL Shaders = Fun!

I’m revisiting the Kinect for my final project. I’m separating the background of an image from the foreground and using OpenGL GLSL Multitexturing Shaders to apply effects to the foreground.

GLSL Shaders work in OpenFrameworks, which is cool. However, there’s a trick that took me about three days to find. By default, ofTextures use an OpenGL plugin that allows for non-power of two textures. Even if you use a power of two texture the plugin is enabled and allocates textures that can’t be referenced from GLSL. FML.

The first GLSL shader I wrote distorted the foreground texture layer on the Y axis using a sine wave to adjust the image fragments that are mapped onto a textured quad.

I wrote another shader that blurs the foreground texture using textel averaging. You can see that the background is unaffected by the filter!

Looking Outwards – Final Project

I’m still tossing around ideas for my final project, but I’d like to do more experimentation with the kinect. Specifically, I think it’d be fun to do some high-quality background subtraction and separate the user from the rest of the scene. I’d like to create a hack in which the users body is distorted by a fun house mirror, while the background in the scene remains entirely unaffected. Other tricks, such as pixelating the users body or blurring it while keeping everything else intact could also be fun. The basic idea seems manageable, and I think I’d have some time left over to polish it and add a number of features. I’d like to draw on the auto calibration code I wrote for my previous kinect hack so that it’s easy to walk up and interact with the “circus mirror.”

I’ve been searching for about an hour, and it doesn’t look like anyone has done selective distortion of the RGB camera image off the kinect. I’m thinking something like this:

Imagine how much fun those Koreans would be having if the entire scene looked normal except for their stretched friend. It’s crazy mirror 2.0.

I think background subtraction (and then subsequent filling) would be important for this sort of hack, and it looks like progress has been made to do this in OpenFrameworks. The video below shows someone cutting themselves out of the kinect depth image and then hiding everything else in the scene.

To achieve the distortion of the user’s body, I’m hoping to do some low-level work in OpenGL. I’ve done some research in this area and it looks like using a framebuffer and some bump mapping might be a good approach. This article suggests using the camera image as a texture and then mapping it onto a bump mapped “mirror” plane:

Circus mirror and lens effects. Using a texture surface as the rendering target, render a scene (or a subset thereof) from the point of view of a mirror in your scene. Then use this rendered scene as the mirror’s texture, and use bump mapping to perturb the reflection/refraction according to the values in your bump map. This way the mirror could be bent and warped, like a funhouse mirror, to distort the view of the scene.

At any rate, we’ll see how it goes! I’d love to get some feedback on the idea. It seems like something I could get going pretty quickly, so I’m definitely looking for possible extensions / features that might make it more interesting!

Kinect Hand-tracking Visualization

What if you could use hand gestures to control an audio visualization? Instead of relying on audio metrics like frequency and volume, you could base the visualization on the user’s interpretation of perceivable audio qualities. The end result would be a better reflection of the way that people feel about music.

To investigate this, I wrote an OpenFrameworks application that uses depth data from the Kinect to identify hands in a scene. The information about the users’ hands – position, velocity, heading, and size – is used to create an interactive visualization with long-exposure motion trails and particle effects.

There were a number of challenges in this project. I started with Processing, but it was too slow to extract hands and render the point sprite effects I wanted. I switched to OpenFrameworks and started using OpenNI to extract a skeleton from the Kinect depth image. OpenNI worked well and extracted a full skeleton with wrists that could be tracked, but it was difficult to test because the skeletal detection took nearly a minute every time the visualization was tested. It got frustrating pretty quickly, and I decided to do hand detection manually.

Detecting Hands in the Depth Image

I chose a relatively straightforward approach to finding hands in the depth image. I made three significant assumptions that made realtime detection possible:

  1. The users body intersects the bottom of the frame
  2. The user is the closest thing in the scene.
  3. The users hands are extended (at least slightly) in front of their body

Assumption 1 is important because it allows for automatic depth thresholding. By assuming that the user intersects the bottom of the frame, we can scan the bottom row of depth pixels to determine the depth of the users body. The hand detection ignores anything further away than the user.

Assumptions 2 and 3 are important for the next step in the process. The application looks for local minima in the depth image and identifies the points nearest the camera. It then uses a breadth-first search algorithm to repeatedly expand the blob to neighboring points and find the boundaries of hands. Each pixel is scored based on it’s depth and distance from the source. Pixels that are scored as part of one hand cannot be scored as part of another hand and this prevents near points in the same hand from generating multiple resulting blobs.

Interpreting Hands

Once pixels in the depth image have been identified as hands, a bounding box is created around each one. The bounding boxes are compared to those found in the previous frame and matched together, so that the user’s two hands are tracked separately.

Once each blob has been associated with the left or right hand, the algorithm determines the heading, velocity and acceleration of the hand. This information is averaged over multiple frames to eliminate noise.

Long-Exposure Motion Trails

The size and location of each hand are used to extend a motion trail from the user’s hand. The motion trail is stored in an array. Each point in the trail has an X and Y position, and a size. To render the motion trail, overlapping, alpha-blended point sprites are drawn along the entire length of the trail. A catmul-rom spline algorithm is used to interpolate between the points in the trail and create a smooth path. Though it might seem best to append a point to the motion trail every frame, this tends to cause noise. In the version below, a point is added to the trail every three frames. This increases the distance between the points in the trail and allows for more smoothing using catmul-rom interpolation.

Hand Centers

One of the early problems with the hand tracking code was the center of the blob bounding boxes were used as the input to the motion trails. When the user held up their forearm perpendicular to the camera, the entire length of their arm was recognized as a hand. To better determine where the center of the hand was, I wrote a midpoint finder based on iterative erosion of the blobs. This provided much more accurate hand centers for the motion trails.

Particle Effects

After the long-exposure motion trails were working properly, I decided that more engaging visuals were needed to create a compelling visualization. It seemed like particles would be a good solution because they could augment the feeling of motion created by the user’s gestures. Particles are created when the hand blobs are in motion, and more particles are created based on the hand velocity. The particles stream off the motion trail in the direction of motion, and curve slightly as they move away from the hand. They fade and disappear after a set number of frames.

Challenges and Obstacles

This is my first use of the open-source ofxKinect framework and OpenFrameworks. It was also my first attempt to do blob detection and blob midpoint finding, so I’m happy those worked out nicely. I investigated Processing and OpenNI but chose not to use them because of performance and debug time implications, respectively.

Live Demo

The video below shows the final visualization. It was generated in real-time from improv hand gestures I performed while listening to “Dare you to Move” by the Vitamin String Quartet.

HexDefense

Intense, arcade-style tower defense for Android

The Story

HexDefense started as a class project for a mobile prototyping lab I took while at Carnegie Mellon. The lab required that apps be written in Java on the Android platform, and I figured it’d be a good opportunity to try writing a game. I’m a big fan of the tower defense genre and I’ve been heavily influenced by games on the iPhone like Field Runners and GeoDefense Swarm. From the outset, I wanted the game to have arcade style graphics reminiscent of Geometry Wars. That way, I figured, I wouldn’t have to find an artist to create the sprites, and I could focus on explosive OpenGL particle effects and blend-based bloom.

During the fall semester, I collaborated with Paul Caravelli and Tony Zhang on the first iteration of the game. I had the strongest graphics and animation background, so I focused on the gameplay and wrote all of the OpenGL code behind the game. I also created most of the game model, implementing the towers and creeps and creating actions with game logic for tower targeting, attacks, projectile motion, explosions, implosions and other effects. Paul contributed path finding code for the creeps based on breadth-first-search and created interfaces for implementing in-game actions based on the command pattern. He also contributed the original implementation of the grid model and worked on abstract base classes in the game model. Tony created the app’s settings screen and linked together activities for the different screens of the application.

At the end of the fall semester, the game was functional but unrefined. There were no sounds, no levels, and I’d only created one type of enemy. After the class ended, I talked with Paul and decided to finish it over my Christmas break. Paul was too busy to continue working on the app, so I continued development independently. I worked full-time for four weeks to deliver the level of polish I was accustomed to on the iPhone. I refined the graphics, tested the app across a variety of phones and added fifteen levels. I also added 3D directional sound, boss creeps and wrapped everything in a completely new look and feel. People say that the last 10% is the 90% of the work, and I think that’s particularly true on Android – there are minor differences across devices that make writing a solid game a lot more work than I expected.

The game was released at the end of January and has been well received so far. I created a lot of promotional art and setup a website with gameplay footage and press resources, and the game has garnered quite a bit of attention. It’s been featured on the front page of the Android marketplace and has a 4 1/2 stars. It’s rising in the “Paid Apps” rankings and is currently the #16th most popular game on the Android platform!

Lessons Learned:

I’ve learned a lot about the Android platform developing HexDefense. A couple of tips and takeaways:

  1. Let the OpenGL view run in CONTINUOUS mode. Nothing else (timers, threads that trigger redraws) will give performance close to this.
  2. Write all of the game logic so that it can advance the model by an arbitrary number of milliseconds. Because multitasking can cause hiccups in the game framerate, this is _really_ important for a smooth game.
  3. OpenGL textures are not numbered sequentially on all devices. The original DROID will choose random integer values each time you call glGenTexture.
  4. There are numerous drawbacks to using the Java OpenGL API. If your game needs to modify vertex or texcoord buffers every frame, you’ll have to accept a performance hit. The deformation of the grid in HexDefense is achieved by modifying the texcoords on a sub-segmented plane, and passing the data through a ByteBuffer to OpenGL is not cool.
  5. The iPhone’s OpenGL implementation is at least 2.5x faster, even on devices with half the processor speed. An iOS port of HexDefense is in progress, and the game runs twice as fast on an original iPod Touch as it does on a Nexus One. There are a lot of reasons for this, but it seems that drawing large textured quads has greater speed implications on Android devices.