• SENYO APEWOKIN Georgia Institute of Technology, U.S.A.
  • BRIAN VALENTINE Georgia Institute of Technology, U.S.A.
  • LINDA M. WILLS Georgia Institute of Technology, U.S.A.
  • SCOTT WILLS Georgia Institute of Technology, U.S.A.


DMA, background modeling, embedded computer vision, mobile vision systems, multicore, parallel processing


The emergence of multicore platforms has tremendous potential for achieving real-time performance of complex computer vision algorithms. However, these applications must run on embedded, mobile platforms with stringent size weight, power, and cost constraints. High utilization of local storage on execution cores and low-latency, highbandwidth data transfers between this storage and main memory are critical for real-time mobile system performance. General purpose processors employ hardware techniques, such as high-speed bus architecture and efficient data arbitration schemes, to address the memory bandwidth gap. However, these techniques are insufficient for mobile systems requirements. Concurrent algorithmic and architectural optimizations are necessary. This paper uses concurrency to minimize data transfer latency when executing video surveillance algorithms on multicore embedded architectures. It introduces cat-tail DMA, a technique that provides low-overhead, globally-ordered, non-blocking DMA transfers. Using this technique, data transfer latencies are reduced by over 30% for background modeling applications, while the local core storage utilization is increased by 60% over existing techniques.



Download data is not yet available.


Apewokin, S., Valentine, B., Choi, J., Wills, L., and Wills, S., “Real-Time Adaptive Background

Modeling for Multicore Embedded Systems,” to appear in Journal of Signal Processing Systems,

Springer, New York 2010.

Apewokin, S., Valentine, B., Forsthoefel, D., Wills, L., Wills, S., and Gentile, A., “Embedded Real-

Time Surveillance Using Multimodal Mean Background Modeling,” Advances in Pattern

Recognition, Embedded Computer Vision, editors Kisačanin, B., Bhattacharyya, S., and Chai, S.,

Chapter 8, pages 163-175, Springer, London 2008.

ARM limited. White Paper: The ARM Cortex-A9 Processors. Available online (Aug. 31, 2009) at

Chen, T.P., Haussecker, H., Bovyrin, A., Belenov, R., Rodyushkin, K., Kuranov, A., and Eruhimov,

V., “Computer Vision Workload Analysis: Case Study of Video Surveillance Systems,” Intel

Technology Journal, Vol. 9, No. 2, (2005), 109-118.

Dou, Y., Deng, L., Xu, J., and Zheng, Y., “DMA Performance Analysis and Multi-core Memory

Optimization for SWIM Benchmark on the Cell Processor,” Proceedings of the International

Symposium on Parallel and Distributed Processing with Applications, ISPA '08, (2008), 170 –

Frantz, G. A., Lin, K-S., Reimer, J.B., and Bradley, J., "The Texas Instruments TMS320C25 Digital

Signal Processor," IEEE Micro. Vol. 6, No. 6, December (1986), 10-28.

Gschwind, M., et al., "A Novel SIMD Architecture for the Cell Heterogeneous Chip

Multiprocessor," Hot Chips 17, Aug. 2005.

Khailany, B., Dally, W., Kapasi, U., Mattson, P., Namkoong, J., Owens, J., Towles, B., Chang, A.,

Rixner, S., “Imagine: Media Processing with Streams,” IEEE Micro, vol. 21, no.2, Mar/Apr,

(2001), 35-46.

Kim, D., Managuli, R., and Kim, Y., “Data cache and direct memory access in programming

mediaprocessors,” IEEE Micro, vol. 21, no. 4, July-Aug. (2001), 33-42.

Kistler, M., et al., "Cell Multiprocessor Communication Network: Built for Speed," IEEE Micro,

May/June (2006), 10–23.

Lin, K., Huang, C., and Lo, C., “Design and Implementation of a Schedulable DMAC on an

AMBA-Based SOPC Platform,” IEEE Asia Pacific Conference on Circuits and Systems, APCCAS

, December (2006), 279 – 282.

Markatos, E. and Katevenis, M., “User-level DMA without operating system kernel modification,”

Third International Symposium on High-Performance Computer Architecture, February (1997),

– 331.

Shida, S., Shibata, Y., Oguri, K., and Buell, D., “An optimization method of DMA transfer for a

general purpose reconfigurable machine,” International Conference on Field Programmable Logic

and Applications, FPL 2008, September (2008), 647 – 650.

Stauffer, C. and Grimson, W. E. L., “Learning Patterns of Activity Using Real-Time Tracking,”

IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, No. 8, August (2000),


Toyama, K., Krumm, J., Brummitt, B., and Meyers, B., “Wallflower: Principles and Practices of

Background Maintenance,” in Proceedings of the International Conference on Computer Vision,

ICCV 1999, (1999), 255-261.

Loc Truong, “Low power consumption and a competitive price tag make the six-core

TMS320C6472 ideal for high-performance applications,” Texas Instruments, available online:, October (2009), 1-7.

Tumeo, A., Monchiero, M., Palermo, G., Ferrandi, F., and Sciuto, D., “Lightweight DMA

management mechanisms for multiprocessors on FPGA,” International Conference on

Application-Specific Systems, Architectures and Processors, ASAP 2008, July (2008), 275 – 280.

Vivek, P., Jiang, W., Zhou, Y., and Bianchini, R., “DMA-aware memory energy management,”

The Twelfth International Symposium on High-Performance Computer Architecture, HPCA 2006,

February (2006), 133 – 144.

Zinner, C., and Kubinger, W., “ROS-DMA: A DMA double buffering method for embedded image

processing with resource optimized slicing,” in Proc. 12th IEEE Real-Time and Embedded

Technology and Applications Symposium (RTAS), April (2006), 361-372.