A number of widely used contemporary processors have instruction-set
extensions for improved performance in multi-media applications. The aim
is to allow operations to proceed on multiple pixels each clock cycle.
Such instruction-sets have been incorporated both in specialist DSPchips
such as the Texas C62xx (Texas Instruments, 1998) and in general purpose
CPU chips like the Intel IA32 (Intel, 2000) or the AMD K6 (Advanced
Micro Devices, 1999). These instruction-set extensions are typically
based on the Single Instruc- tion-stream Multiple Data-stream (SIMD)
model in which a single instruction causes the same mathematical
operation to be carried out on several operands, or pairs of operands,
at the same time. The level or parallelism supported ranges from two
floating point operations, at a time on the AMD K6 architecture to 16
byte operations at a time on the Intel P4 architecture. Whereas
processor architectures are moving towards greater levels of
parallelism, the most widely used programming languages such as C, Java
and Delphi are structured around a model of computation in which
operations takeplace on a single value at a time. This was appropriate
when processors worked this way, but has become an impediment to
programmers seeking to make use of the performance offered by
multi-media instruction -sets. The introduction of SIMD instruction sets
(Peleg et al.