THE DATA DELUGE: Looking to hardware for high-performance computing solutions

The past decade has witnessed an explosive growth of data from biology including genome projects, proteomics, protein structure determination, cellular regulatory mechanisms and the rapid expansion in digitization of patient biological data. But while the raw computational power as predicted by "Moore's Law" has led to the number of transistors that can be integrated doubling every 18 months, the genomic data at GenBank is doubling every 6 months. Post-genomic-era bioinformatics challenges are expected to require high-performance computing power of the order of petaflops or more.

The scale of data produced and analysis performed in most large labs is too much, however, for conventional PC memory, so scientists have traditionally responded by parallelizing their search jobs or analytical methods across a processor cluster. The costs associated with cluster acquisition and maintenance, however, are high, and inherent communication delays in cluster systems mean that when someone doubles the demand on the cluster, they have to more than double the size of the cluster to get the same level of performance, leading to diminishing returns on investment.

Hard(ware) solutions…

The vast majority of today's high-performance computing applications are built around fixed-architecture, microprocessor-based systems. This approach is highly sequential at its core in that each processing unit delivers a single data path moving across a memory array mapped onto one or two data ports. Because many of these bioinformatics tasks involve simple repeatable operations, custom hardware is a more effective solution. One such solution is field-programmable gate arrays (FPGAs), which are essentially programmable hardware processors that can be designed to maximize the memory bandwidth for the search algorithms being implemented for dramatic performance improvements. Thus, FPGA-based systems work much more efficiently than a traditional sequential microprocessor-based program.

For example, the internal memory in a single FPGA can have 888 independent data ports, providing over 7 terabytes/sec of memory bandwidth, comparable to a single data port and of the order of 100Gbytes/sec on a high-end microprocessor, a significant difference in performance. In terms of computation performance, a Pentium 4 processor operating at 2.4 GHz can typically generate 0.2 sustained Gigaflops in real applications and 4.8 theoretical peak Gigaflops while an FPGA card can provide 19 sustained Gigaflops and 38 theoretical peak Gigaflops.

But what does all of this power and efficiency mean in practical terms?

…for hard problems

If we compare the low-level devices typically used in a special—purpose supercomputing system, providing a fixed function for very high performance—to that of a reconfigurable supercomputing system, we can get an idea of what performance figures might be obtainable in future system performance.

Recently, we demonstrated how FPGA technology could be implemented in bioinformatics to solve real-world problems. In particular, we applied FPGAs to a protein analysis algorithm used to compare images of markers on 2-D gels, where the cross-referencing of differences in the positions and composition of the markers can be used to detect the presence of a disease or disorder. This process is computationally intensive and presents a major bottleneck to life science research.

Using Mitrionics' Mitrion software to port a protein analysis algorithm to Nallatech's FPGA computing platforms hosted in a Linux environment, we were able to show a 10- to 30-fold improvement in performance achieved, at a smaller footprint, lower weight, and lower power requirements than a conventional computing system. The demonstration system, based around the Xilinx Virtex 2VP70 FPGA, is scalable and each FPGA added to the system delivers an order of magnitude increase in processing throughput to the system, reducing runtime for analysis from days to hours for large data sets.

This demonstration also showed how an FPGA hardware platform can augment the capabilities of the type of traditional equipment used in labs all over the world and how that hardware can be programmed by algorithm designers using high-level software.

This latter point is critical, as programmability will be key to the adoption of FPGA computing in scientific research applications. Researchers in life sciences don't necessarily want to learn how low-level hardware design skills make their analysis algorithm run faster on a particular processing platform; they just want to be able to use their existing computing skills and access the computing power available from FPGAs.

Another example of how FPGA technology can be implemented in bioinformatics applications is the recent work that Nallatech did with researchers at the University of North Carolina-Charlotte to improve sequence alignments using the Smith-Waterman algorithm. Using the FPGA system, the researchers were able to complete their analyses of various sequences found in GenBank in a matter of seconds rather than several hours using a SunFire processor. This represents a 256-fold speed up in performance, meaning that researchers could carry out large batches of searches in a day that would have taken weeks or months previously using a traditional microprocessor-based solution. This acceleration in data analysis means that scientists should be able to draw more timely conclusions from their experiments, potentially improving the bottom lines of companies by increasing success rates and reducing costs.

Asking hard questions

However, not all mathematical algorithms are suitable for FPGAs. Lower clock speeds, memory technology constraints, data storage communication bandwidth and larger pipelines make FPGAs ineffective at implementing some functions. When deciding whether an algorithm can be ported onto FPGAs, several factors must be taken into account.

Does the algorithm require continuous accessing of data from large data sets? If yes, it will be difficult to keep the FPGA pipelines stoked as this will need to be provided over slow host interfaces. However, if the data sets are small enough to fit in local memory on the FPGA or memory on the same board as the FPGA, pipelines stand a better chance of remaining stoked. FPGAs have a massive amount of individual memory elements that can be accessed in parallel. Arranging the algorithm to process data in small chunks accessed via hundreds of internal memory banks allows most pipelines to remain stoked for short bursts of time. The limiting factor then becomes how quickly you can feed the individual memories with new data.

Is the algorithm recursive? Recursive routines port very poorly to FPGAs. Typical pipeline latency for a complicated algorithm can be several hundred clock cycles. Recursion will only allow one pass of data through a pipeline at any one time. Hence the pipeline latency will be incurred for each calculation, making the routine hundreds of times less efficient than if it were pipelined. Often a recursive algorithm can be rewritten to remove the recursion. This can often seem counterintuitive as recursion is often a way optimizing code for microprocessors. However, FPGAs can perform many operations in parallel and the overall calculation time for a more complicated pipelined algorithm is usually significantly less than its recursive counterpart.

Is double precision required? Double precision floating-point calculations require two to four times the resources required for a single precision calculation. The more double precision calculations required, the less operators that can be performed in parallel, reducing the performance improvement.

Programmability

One of the problems with using FPGAs is the fact that they are hard to program. Managing the complexity of FPGA hardware design in a predominantly software-driven application sector is a particular challenge for the development of FPGA computing applications.

FPGA computing is not currently at the maturity level of traditional microprocessor design flows. While major benefits can be leveraged for very challenging applications, for lower performance applications where implementation is possible in a single or very small number of microprocessors, that is often the most appropriate option. This lack of maturity means that FPGAs are generally more difficult to code.

Additionally, programming expertise is less widely available and hardware is not abstracted from users to the same extent as in microprocessor systems. However, these realities are significantly improving.

Many EDA tool providers are developing high-level development tools for high-performance, real-time and embedded systems applications running in FPGAs. Examples include Xilinx's System Generator, The Mathworks' MATLAB, and Mitrion's Mitiron-C. This is making application development easier, as is a growing base of FPGA programming expertise within the industry. Furthermore, for complex high-performance applications, the benefits of using FPGA computing outweigh the additional challenges of developing applications using the available toolsets and languages.

FPGA futures

From any viewpoint—cost, flexibility, capacity, efficiency and performance—FPGA technology is becoming the new "best practice" in the design and development of high-end, computer-intensive systems. In addition, design engineers see improved customization capability and faster development times with FPGAs, further contributing to lower system costs and improved application performance compared with conventional processor approaches.

Recognized worldwide as an industry expert on FPGA technologies, Dr. Malachy Devlin joined Nallatech as CTO in 1996. His specialties also include DSP algorithms, high-performance computing, high-performance embedded computing, embedded software, and distributed computing. He has also worked at the National Engineering Laboratory, Telia in Sweden and Hughes Microelectronics (now part of Raytheon). Devlin obtained his Ph.D. in signal processing from Strathclyde University.