|Real-Time Vision-based 3D Motion Capture|
While motion capture is typically solved using magnetic systems and
optical systems, there exist mass market applications in which such
solutions are untenable either due to cost or because it is
impractical for people entering an environment to be suited up
with active devices or special reflectors. Magnetic systems
also suffer from electro-magnetic interference caused by
ferrous metal objects nearby, magnets in audio speakers or
monitor's and TV radiation. Wires also are another drawback
of magnetic motion capture systems. They are encumbered and limit
the space in which a performer can move. It is just during the past
two years that wireless magnetic systems have been introduced.
Nevertheless, the performer still has to carry a
backpack containing an electronics unit that is connected to the
sensors/receivers. Also, the space limitation is not completely
eliminated by the wireless system as the signal becomes too weak when
the distance between receiver and transmitter becomes too great.
Unlike magnetic systems, the optical motion capture systems do
not face wire and electro-magnetic interference problems.
However, their costs are considerably higher. Furthermore, most of
the optical systems do not operate in real-time.
They require post-processing
calculation and some manual point registration, which prevent them
from operating in real-time. However, the first real-time
optical system was recently introduced in
Due to these restrictions of existing systems, a vision-based motion
capture system which does not rely on contact devices would have
significant advantages. In such vision-based systems, interference and
encumbrance will no longer be a problem.
Our system demonstrates the application of inexpensive and a completely
unencumbered computer vision system to the motion capture problem.
The system runs on a network of Dual-Pentium 400 PCs at 20-30 frames
per second (depending on the size of person whom the system observes).
We recently demonstrated the system at
SIGGRAPH98's Emerging Technologies.
The project, called "Shall We Dance?",is the result of collaboration amongst
ATR's Media Integration Research Laboratory,
the University of Maryland's Computer Vision Laboratory, and
Massachusetts Institute of Technology's Media Laboratory.
Below is a snap-shot of our SIGGRAPH98 demonstration. I was dancing in the dancing area which was surrounded
by six cameras; and the CG character in tuxedo was animated by my motion.
The sumo character was animated by another studio next door from
ATR's MIC lab.
See... how much people like it.
How it works. . . . .
A set of color CCD cameras observes a person. The number of cameras
could be any number more than two; our current system uses six cameras.
Each camera is attached to a PC running the W4 system. W4 detects
people, and locates and tracks body parts. It performs
silhouette analysis and
template matching to locate the
2D positions of salient body parts, e.g., head, torso, hands, and feet,
in the image. A central controller obtains the 3D positions of these
body parts by triangulation and optimization processes.
Kalman filters are also used to smooth the motion trajectories
as well as to predict the body part locations
for the next frame. We then feed back this prediction to each
instance W4 to help in 2D localization. The graphic reproduction
system uses the body posture output to render and animate the