Notice: The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarity and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

An Embedded Accelerator for Real Time Image Processing

Reiner Hartenstein, Jürgen Becker, Rainer Kress

Universitaet Kaiserslautern

Erwin-Schroedinger-Straße, D-67663 Kaiserslautern, Germany

Fax: ++49 631 205 2640, e-mail:

The paper presents an embedded reconfigurable accelerator, called Xputer, comprising a novel kind of sequencer hardware (Data Sequencer). For many real time signal processing, multimedia, and other high-performance applications this new data-driven architecture increases the performance of a single processor system enourmously by integrating it as a co-processor for accelerating computation-intensive parts of an application. Here the reconfigurable architecture incl. programming environment is described. Its use is illustrated with an automotive application requiring real time image processing.

1. Introduction

Many applications in signal and image processing require fast access to structured data. Most of the time, the data structures are arrays, either of simple scalar values or arrays of records. In many algorithms (e.g. FFT, or FIR filtering), the access sequence to the data is already known at compile time, because the index computations to the array elements are done by nested loops, which do not depend on the actual data values. This is the reason, why most digital signal processors [16] [17] [18] [19] contain independent address generating ALUs. They are used to compute the addresses of data for future use in parallel to the data manipulations on already fetched data. Typically, these address generators are capable to produce a linear address sequence with constant offsets between subsequent addresses, cyclic or modulo addressing for cyclic buffers, and bit-reversed addressing for FFT-type applications. In addition to linear address sequences, it is capable to access data in two-dimensional arrays from a single set of parameters. All of the address generators described above are capable to provide addresses for scalar values only. The frequent cases, where an algorithm requires access to structured data (e.g. a subarray moving as a sliding window across the whole data array in two-dimensional FIR filtering) require a new setup of the address sequence parameters quite often. This is because these address generators cannot make use of the hierarchy information, where a simple data access sequence (the accesses to the subarray or record elements) is repeated in a regular manner to access a large amount of structured data.

The new paradigm of Xputers [1] [6] [7] is based on reconfigurable logic supporting structural programming like the rALU concept (reconfigurable ALU) [6] or the rDPA approach (reconfigurable data path array) [12] [13]. This data-driven paradigm provides an address generating device to support structured data access for multi-dimensional arrays. We call this device Generic Address Generator, because it is used to produce generic address sequences ahead of the data manipulations without consuming instruction or memory bandwidth apart from a few parameter transfers. Multiple Generic Address Generators combine to a Data Sequencer, which controls data manipulations by the sequence in which data arrives at the rALU. An Xputer can be integrated as accelerator within a workstation environment for computation-intensive parts of an application.

For this new class of hardware platforms a new class of compilers is needed, which generate both, sequential and structural code: i. e. partitioning compilers. This paper introduces a parallelizing compilation method with two levels of partitioning: host/accelerator partitioning and a structural/sequential partitioning (second level).

The following section introduces the target hardware. In the second chapter the methods of the programming environment are sketched. The real time use of the Xputer as embedded accelerator is demonstrated by an application example from high-speed image analysis for automotive applications.

2. The Target Hardware

The target hardware consists of a host workstation that uses the Xputer [5] [6] [7] as a universal hardware accelerator (see figure 1). With Xputers the highest speed-up is obtained for algorithms, which iterate the same set of operations over a large amount of data. Such operations are mapped onto a reconfigurable parallel arithmetic-logic unit (rALU). All input and output data to the complex rALU operations is buffered in a scan window (SW), which is a kind of smart register file, the location sequence of which in memory space is determined by a data sequencer. All data in the scan windows can be accessed in parallel by the rALU

Figure 1. Host with several Xputer modules.

Each of up to seven Xputer modules run independently from each other and the host until a task is completed. Reconfigurations can be performed independently from other running tasks. Each module generates an interrupt to the host when the task is finished. The host may run concurrently to such tasks. The basic structure of an Xputer module consists of three major parts (Figure 1):

Within iterations the access sequence to data, typically organized in arrays, often shows regularity which allows to describe address sequences generically from a few parameters (see section 3.1). The sequence of memory locations (examples in figure 3) reached by such generic address sequences (GAS) we call scan patterns (SP). By this basic sequencing mechanism Xputers are deterministically data-driven (non-von Neumann): i. e. one or multiple (generic address generators) GAGs are used instead of an instruction sequencer, generating addresses only by hardware without eating memory cycles. During a SP operation execution is transport-triggered. The GAG operates in a two-stage pipeline. The first stage computes the position of the scan window. The second stage computes out of this position the relative addresses of the variables in the scan window (Figure 2).

Figure 2. Generic Address Generator (GAG)

The data sequencer is implemented with reconfigurable logic (the control logic (Figure 2) by a Xilinx XC4013). rDPA components will be used for GAG datapaths. Thus in addition to the pipeline a fine grained parallelism is achieved. The rDPA is also the data manipulator of the data-driven rALU. The rDPA was designed to evaluate any arithmetic and logic expression from a high level description. Therefore the granularity of the basic operation is increased from the bitwise level to wordlevel with possible operations such as addition or multiplication. This greatly simplifies the automatic mapping onto the architecture by the DPSS (Data Path Synthesis System) [8]. A regular structure like in systolic arrays is required for the scalability, also across chip boundaries.The rDPA is an integrated array [8] [12] [13] of reconfigurable ALU-like 32-bit processing elements (PEs) running in parallel. The rDPA is faster and more area efficient than commercial FPGAs.

For more details about the Xputer hardware components see [1] [6] [7] [10].

Figure 3. Scan pattern examples a) video scan b) JPEG zig-zag scan c) linear scan: scan window shown

3. The Programming Environment

For the above hardware platform a partitioning and parallelizing compilation environment CoDe-X is being implemented which accepts X-C source programs (Figure 4. ). X-C (Xputer-C) is a C dialect. CoDe-X consists of a 1st level partitioner (partially implemented), a GNU C compiler, and an X-C compiler (fully implemented). The X-C source input is partitioned in a first level into a part for execution on the host (host tasks, also permitting dynamic structures and operating system calls) and a part for execution on the Xputer (Xputer tasks). Parts for Xputer execution are expressed in a X-C subset, which lacks only dynamic structures, but includes language features to express scan patterns [2], [21]. At second level this input is partitioned by the X-C compiler in a sequential part for the DS, and a structural part for the rDPA. Experienced users may use special MoPL library functions [2] to take full advantage of the high acceleration factors possible by the Xputer paradigm (see figure 4).Profiling-driven Partitioning (1st level)

The Xputer tasks are the candidates for the performance analysis on the host and on the Xputer. The user may advise the partitioner about parts to be Xputer-executed in any case. For library functions a list of their performance data is available. First, program parts using MoPL extensions or containing loops which can be transformed by the X-C compiler, are routed for Xputer execution. Such an Xputer job is divided into the following Xputer tasks, which can be computed on a single Xputer module:

Figure 4. Overview on the CoDe-X environment.

The restriction to seven nested loops is due to limited resources in the current Xputer prototype. The data sequencer of this prototype contains seven GAGs, which may run in parallel. An Xputer job is represented by a task graph, on which a data dependency analysis based on the GCD-test [26] is carried out. Thus tasks which may run in parallel are found. In this phase parameter-driven transformations are carried out in order to exploit best hardware utilization. For example, single nested loops of large tasks are transformed into double nested loops (called strip mining [14]), which permits parallel execution on different Xputer modules. The parameters describe the given hardware resources in order to achieve an optimized performance/area trade-off.

This technique can be used e.g. in image processing applications by dividing an image in stripes of equal sizes in order to manipulate these stripes concurrently [9]. The optimization techniques [14] of loop distribution (splitting of one loop into two loops computing the same), of loop fusion (transforming two adjacent loops into a single loops) and of loop interchanging (switching inner and outer loop) are also performed by the 1st level partitioner in order to exploit potential parallelism and to optimize the utilization of the Xputer hardware resources.

Profiling. For each Xputer task candidate, the worst case execution time on the accelerator as well as on the host is computed. The performance data for the accelerator is received from the X-C compiler by analyzing the execution time of the rALU operators and their iterations [8], [12].

Host performance is approximated by code examination. For each host type, another model is required. The timing behavior of the underlying hardware and operating system has to be deterministic and known. The operating system must provide static memory management, system calls should have a computable timing behavior, and no asynchronous interrupts should be allowed [20]. The techniques used are similar to these from Malik et. al. [15].

The profiler is also computing the overall execution time by using the Xputer as accelerator. The time includes delay times for synchronization, possible reconfigurations and memory re-mappings of the Xputer during run time (see equation (1)).


: sum of execution times of tasks executed on the host (HT)

: sum of execution times of tasks executed on the Xputer (XT)

: sum of delay times for reconfiguring the Xputer during run time

: sum of delay times for re-mapping the 2- dimensional organized Xputer data map during run time

: sum of delay times for synchronizing host/Xputer (operating system calls)

: sum of overlapping execution times between tasks executed simultaneously on host/Xputer

The overall execution time is determined by delay times of the tasks (host and Xputer), plus the above mentioned penalty times (see , equation (1)). Memory re-mappings are necessary, when two tasks use the same data onto the Xputer, but they are needing a different distribution of the data within the two-dimensional organized memory. Then a re-ordering of the data items must be performed before the execution of the second task starts, which must be also considered in equation (1). The value of is used as COST-function in the simulated annealing process [23] and has to be optimized by computing different partitionings of Xputer tasks in varying their task allocation between host and Xputer.

3.1 Resource-driven Partitioning (2nd level)

The X-C compiler realizes the 2nd level of partitioning and translates an X-C program into code which can be executed on the Xputer. The compiler performs a data and control flow analysis. First the control flow of the program graph is partitioned according to the algorithm of Tarjan [25] resulting in a partial execution order. This algorithm is partitioning the program graph into subgraphs, which are called Strongly Connected Components. These components correspond to connected statement sequences like fully nested loops for example, which are possibly parallelizable. The Xputer hardware resources are exploited in an optimized way by analyzing, if the structure of statement sequences can be mapped well to the Xputer hardware avoiding reconfigurations or idle hardware resources [21]. Xputers provide best parallelism at statement or expression level. So we try to vectorize the statement blocks in nested for-loops according to the vectorization algorithm of J. R. Allen and K. Kennedy [11], after a data dependency analysis has been performed [26].

The access sequences of up to 4 fully nested loops contenting variables with linear index functions can be handled by transforming them into parameters for configuring the GAGs (see figure 1). A configuration file for the X-C compiler gives the current restrictions of the hardware, e.g. number of GAGs available, size of rALU subnets etc., which allows a flexible handling of the compiler for future up-grades of the Xputer hardware. Further details and examples about the X-C compiler can be found in [8], [21], [22].

4. Application: Automotive Collision Detection

The availability of high-dynamic-range CMOS image sensors (HDRC) with random access to the pixel array serves as a prerequisite to automotive applications like automatic collision prevention [4]. A sensor with a high dynamic range is necessary to avoid the saturation known from CCD cameras in the case when a car leaves a tunnel, where the sensors have to be able to detect an approaching car against the bright sunlight. The real time analysis of the incoming data can be done more efficiently, if a random access to the sensor array allows to concentrate on the interesting portions of the image, which greatly reduces the data transfer rates required between the sensor and the image processing hardware. The design of a hardware to perform such a collision detection can be seen in figure 5

Figure 5. Automotive collision detection using the Xputer as embedded accelerator.

The Xputer data sequencer consists of several parallel generic address generators (GAGs). One GAG computes the address sequence for the rALU, which performs an edge detection by applying a two-dimensional FIR filter with appropriate weights. The filter equation is shown in figure 6. The kij are the coefficients of the filter and the wij contain the data at the corresponding scan window positions. Figure 7a shows equation mapped into the rDPA. The input registers of the DPUs named with wij represent the scan window positions. The input registers named with kij represent the registers with coefficients of the FIR filter loaded at configuration. For more details see [10]

Figure 6. Equation of 3 by 3 FIR filter.

The pixel array of the HDRC image sensors is accessed directly by one GAG (figure 5). First the whole pixel array is scanned with a video scan to detect the borders of the street. Dependent on their location data dependent scans are performed, which corresponds to local video scans along these borders. So only the relevant parts of the image are read by the first GAG. The rALU also transfers data to the GAG for controling these data dependent scans (residual control, figure 5) [1] [6] [7]. Always after 5 frames of scanning the relevant street borders, a video scan over the complete image is performed by the GAG and the edge detection algorithm is applied. This frame is necessary for identifying approaching cars etc. (see image analysis task of the host below) A second GAG (figure 5) reads all resulting edge enhanced images from the rALU and stores it into the Xputer data memory, which is accessible by the host for data exchanges with the embedded accelerator. The host CPU performs then the image analysis tasks, where the edges have to be combined to objects. These are identified to detect obstacles, like an approaching car, or the borders of the street, for example. Since these algorithms do not reveal address patterns known at compile time, they can be handled with less overhead by a von Neumann processor than by an address generator. If the car would go out of the borders (street), for example a controler connected to the steering mechanism of the car could correct the wrong way. Also if an approaching car is identified, this controler could react in a suitable way.

The hardware for the image preprocessing might be a reconfigurable hardware like the rDPA [12] [13], or the reconfigurable data-driven multiprocessor architecture proposed in [27]. The rDPA hardware for the image preprocessing is superior over special purpose hardware with regard to flexibility in the choice of algorithms to be run. The required data rates of 3.2 MBytes per second to process the contents of the complete sensor array of 128 by 256 pixels 100 times a second can be handled by this device. The 3 by 3 FIR-filter operation of a 1280 by 1024 8-bit grayscale image is computed in 0.108 seconds (manually optimized code), and in 0.954 seconds (code generated from programming environment) by the current prototype [10]. This allows to process even larger sensor arrays built from multiple devices, or higher frame processing rates than 100 per second.

Using two independent Generic Address Generator and a Host allows to perform the image analysis in a pipelined fashion, where the image acquisition, the image preprocessing, and the image analysis are done in parallel in different stages of the pipeline. This does not even increase the latency from the image acquisition to the collision detection, because all these steps would have to be done sequentially as well in a non-pipelined fashion, only at lower overall data rates or higher bandwidth requirements. The versatility of the scan patterns allows to run a wide variety of image preprocessing algorithms in an endless loop without requiring any download of parameters after the initial power-up configuration

Figure 7. a) Two dimensional FIR filter mapped into the rDPA, b) equivalent operation of the multiply-route operation, c) the corresponding scan window.

5. Conclusions

The use of the data driven Xputer architecture incl. programming environment as an embedded reconfigurable accelerator for real time image processing has been presented. The data driven mechanism is realized by a novel kind of sequencer hardware. This Data Sequencer is superior to the address generation units found in digital signal processors as well as other separately available address generation devices on the market. It provides the support to run complete image and signal processing algorithms on structured data without requiring an update of parameters after an initial configuration. Multiple Generic Address Generators can operate as a Data Sequencer in a system, either on shared memory buses or on separate buses, and automatically synchronizing for the highest collision-free data transfer rates. The usefulness of the approach has been illustrated by an example of automatic collision prevention for a car leaving a tunnel.

The flexibility of the programming environment and the reconfigurable hardware makes the Xputer an ideal candidate to implement several real time image or signal processing applications. One of the new features is the two level partitioning approach. C-programs are accepted and the profiling-driven host/accelerator partitioning for performance optimization as well as the resource-parameter-driven sequential/structural partitioning for accelerator programming are carried out.


[1] A. Ast, R. Hartenstein, H. Reinig, K. Schmidt, M. Weber: A General Purpose Xputer Architecture derived from DSP and Image Processing. In Bayoumi, M.A. (Ed.): VLSI Design Methodologies for Digital Signal Processing Architectures, Kluwer Academic Publishers 1994.

[2] A. Ast, J. Becker, R. W. Hartenstein, R. Kress, H. Reinig, K. Schmidt: Data-procedural Languages for FPL-based Machines; 4th Int. Workshop on Field Programmable Logic and Appl., FPL'94, Prague, Sept. 7-10, 1994, Lecture Notes in Computer Science, Springer, 1994

[3] U. Banerjee: Dependence Analysis for Supercomputing; Kluwer, 1988.

[4] H.G. Graf, B. Höfflinger, U. Seger, A. Siggelkow: Elektronisch sehen; Elektronik, Vol. 3/95, Franzis-Verlag, 1995

[5] R. W. Hartenstein, J. Becker, R. Kress, H. Reinig, K. Schmidt: A Reconfigurable Machine for Applications in Image and Video Compression; Conf. on Compression Techniques & Standards for Image & Video Compression, Amsterdam, Netherlands, March 1995

[6] R.W. Hartenstein, A.G. Hirschbiel, M. Weber: A Novel Paradigm of Parallel Computation and its Use to Implement Simple High Performance Hardware; InfoJapan'90- International Conference memorizing the 30th Anniversary of the Computer Society of Japan, Tokyo, Japan, 1990;

[7] see [6]: also in: Future Generation Computer Systems no. 7, pp. 181-198 (North-Holland, 1991/92)

[8] R. W. Hartenstein, R. Kress: A Datapath Synthesis System for the Reconfigurable Datapath Architecture; ASP-DAC'95, Nippon Convention Center, Makuhari, Chiba, Japan, Aug. 29 - Sept. 1, 1995

[9] Reiner W. Hartenstein, Jürgen Becker, Michael Herz, Rainer Kress, Ulrich Nageldinger: A Partitioning Programming Environment for a Novel Parallel Architecture; 10th International Parallel Processing Symposium (IPPS), Honolulu, Hawaii, April 1996

[10] Reiner W. Hartenstein, Jürgen Becker, Rainer Kress, Helmut Reinig: A Novel Machine Paradigm to Accelerate Scientific Computing; Special issue on Scientific Computing of Computer Science and Informatics journal, Computer Society of India, 1996

[11] K. Kennedy: Automatic Translation of Fortran Programs to Vector Form; Rice Technical Report 476-029-4, Rice University, Houston, October 1980

[12] R. Kress: A fast reconfigurable ALU for Xputers; Ph. D. dissertation (submitted), Kaiserslautern University, 1996

[13] R. Kress: Computing with reconfigurable ALU arrays; ITpress (planned for 1996)

[14] D. B. Loveman: Program Improvement by Source-to-Source Transformation; Journal of the Association for Computing Machinery, Vol. 24,No. 1, pp.121-145, January 1977

[15] S. Malik, W. Wolf, A. Wolfe, Y.-T. Li, T.-Y. Yen: Performance Analysis of Embedded Systems; NATO/ASI, Tremezzo, Italy, 1995, Kluwer Acad. Publishers, 1995

[16] N.N.: ADSP-21020 User's Manual, Analog Devices, 1991

[17] N.N.: DSP56000/DSP56001 Digital Signal Processor User's Manual, Motorola, 1990

[18] N.N.: DSP96002 IEEE Floating-Point Dual-Port Processor User's Manual, Motorola, 1989

[19] N.N.: TMS320C3x User's Guide, Texas Instruments, 1990

[20] P. Puschner, Ch. Koza: Calculating the Maximum Execution Time of Real-Time Programs; Journal of Real-Time Systems p. 159-176, Kluwer Academic Publishers 1989

[21] K. Schmidt: A Program Partitioning, Restructuring, and Mapping Method for Xputers; Ph.D. Thesis, University of Kaiserslautern, 1994

[22] K. Schmidt: Xputers: a high performance computing paradigm; ITpress (planned for 1996)

[23] N. A. Sherwani: Algorithms for Physical Design Automation; Kluwer Academic Publishers, Boston 1993

[24] Z. Shen, Z. Li, P.-C. Yew: An Empirical Study of Fortran Programs for Parallelizing Compiler; IEEE Trans. on Parallel and Distrib. Syst., Vol.1, No.3, pp. 356-364, July 90.

[25] R. E. Tarjan: Testing Flow Graph Reducibility; Journal of Computer and System Sciences 9, pp. 355-365, 1974

[26] M. Wolfe, C.-W. Tseng: The Power Test for Data Dependence; IEEE Transactions on Parallel and Distributed Systems, Vol. 3, No. 5, Sept. 1992

[27] Alfred K.W. Yeung, Jan M. Rabaey: A Reconfigurable Data-Driven Multiprocessor Architecture for Rapid Prototyping of High Throughput DSP Algorithms; 26th Hawaii Int'l. Conf. on System Sciences, Vol. 1, IEEE Computer Society Press, 1993

For a printed version, please contact