CAT-TAIL DMA: EFFICIENT IMAGE DATA TRANSPORT FOR MULTICORE EMBEDDED MOBILE SYSTEMS
Keywords:DMA, background modeling, embedded computer vision, mobile vision systems, multicore, parallel processing
The emergence of multicore platforms has tremendous potential for achieving real-time performance of complex computer vision algorithms. However, these applications must run on embedded, mobile platforms with stringent size weight, power, and cost constraints. High utilization of local storage on execution cores and low-latency, highbandwidth data transfers between this storage and main memory are critical for real-time mobile system performance. General purpose processors employ hardware techniques, such as high-speed bus architecture and efficient data arbitration schemes, to address the memory bandwidth gap. However, these techniques are insufficient for mobile systems requirements. Concurrent algorithmic and architectural optimizations are necessary. This paper uses concurrency to minimize data transfer latency when executing video surveillance algorithms on multicore embedded architectures. It introduces cat-tail DMA, a technique that provides low-overhead, globally-ordered, non-blocking DMA transfers. Using this technique, data transfer latencies are reduced by over 30% for background modeling applications, while the local core storage utilization is increased by 60% over existing techniques.
Apewokin, S., Valentine, B., Choi, J., Wills, L., and Wills, S., “Real-Time Adaptive Background
Modeling for Multicore Embedded Systems,” to appear in Journal of Signal Processing Systems,
Springer, New York 2010.
Apewokin, S., Valentine, B., Forsthoefel, D., Wills, L., Wills, S., and Gentile, A., “Embedded Real-
Time Surveillance Using Multimodal Mean Background Modeling,” Advances in Pattern
Recognition, Embedded Computer Vision, editors Kisačanin, B., Bhattacharyya, S., and Chai, S.,
Chapter 8, pages 163-175, Springer, London 2008.
ARM limited. White Paper: The ARM Cortex-A9 Processors. Available online (Aug. 31, 2009) at
Chen, T.P., Haussecker, H., Bovyrin, A., Belenov, R., Rodyushkin, K., Kuranov, A., and Eruhimov,
V., “Computer Vision Workload Analysis: Case Study of Video Surveillance Systems,” Intel
Technology Journal, Vol. 9, No. 2, (2005), 109-118.
Dou, Y., Deng, L., Xu, J., and Zheng, Y., “DMA Performance Analysis and Multi-core Memory
Optimization for SWIM Benchmark on the Cell Processor,” Proceedings of the International
Symposium on Parallel and Distributed Processing with Applications, ISPA '08, (2008), 170 –
Frantz, G. A., Lin, K-S., Reimer, J.B., and Bradley, J., "The Texas Instruments TMS320C25 Digital
Signal Processor," IEEE Micro. Vol. 6, No. 6, December (1986), 10-28.
Gschwind, M., et al., "A Novel SIMD Architecture for the Cell Heterogeneous Chip
Multiprocessor," Hot Chips 17, Aug. 2005.
Khailany, B., Dally, W., Kapasi, U., Mattson, P., Namkoong, J., Owens, J., Towles, B., Chang, A.,
Rixner, S., “Imagine: Media Processing with Streams,” IEEE Micro, vol. 21, no.2, Mar/Apr,
Kim, D., Managuli, R., and Kim, Y., “Data cache and direct memory access in programming
mediaprocessors,” IEEE Micro, vol. 21, no. 4, July-Aug. (2001), 33-42.
Kistler, M., et al., "Cell Multiprocessor Communication Network: Built for Speed," IEEE Micro,
May/June (2006), 10–23.
Lin, K., Huang, C., and Lo, C., “Design and Implementation of a Schedulable DMAC on an
AMBA-Based SOPC Platform,” IEEE Asia Pacific Conference on Circuits and Systems, APCCAS
, December (2006), 279 – 282.
Markatos, E. and Katevenis, M., “User-level DMA without operating system kernel modification,”
Third International Symposium on High-Performance Computer Architecture, February (1997),
Shida, S., Shibata, Y., Oguri, K., and Buell, D., “An optimization method of DMA transfer for a
general purpose reconfigurable machine,” International Conference on Field Programmable Logic
and Applications, FPL 2008, September (2008), 647 – 650.
Stauffer, C. and Grimson, W. E. L., “Learning Patterns of Activity Using Real-Time Tracking,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, No. 8, August (2000),
Toyama, K., Krumm, J., Brummitt, B., and Meyers, B., “Wallflower: Principles and Practices of
Background Maintenance,” in Proceedings of the International Conference on Computer Vision,
ICCV 1999, (1999), 255-261.
Loc Truong, “Low power consumption and a competitive price tag make the six-core
TMS320C6472 ideal for high-performance applications,” Texas Instruments, available online:
focus.ti.com, October (2009), 1-7.
Tumeo, A., Monchiero, M., Palermo, G., Ferrandi, F., and Sciuto, D., “Lightweight DMA
management mechanisms for multiprocessors on FPGA,” International Conference on
Application-Specific Systems, Architectures and Processors, ASAP 2008, July (2008), 275 – 280.
Vivek, P., Jiang, W., Zhou, Y., and Bianchini, R., “DMA-aware memory energy management,”
The Twelfth International Symposium on High-Performance Computer Architecture, HPCA 2006,
February (2006), 133 – 144.
Zinner, C., and Kubinger, W., “ROS-DMA: A DMA double buffering method for embedded image
processing with resource optimized slicing,” in Proc. 12th IEEE Real-Time and Embedded
Technology and Applications Symposium (RTAS), April (2006), 361-372.