bioinformatics

University of Maryland and Intel
BioBench/BioParallel: A Benchmark Suite for Bioinformatics Applications

Maryland Students: Kursad Albayraktaroglu
Maryland Professors: Manoj Franklin, Bruce Jacob, Chau-Wen Tseng, and Donald Yeung
Intel Researchers: Carole Dulong, Aamer Jaleel, Yimin Zhang

Brought to you by Maryland Memory-Systems Research


Overview of Data-Mining Applications and Their Characterization

Recent advances in bioinformatics and the significant increase in computational power available to researchers have made it possible to make better use of the vast amounts of genetic data that has been collected over the last two decades. As the uses of genetic data expand to include drug discovery and development of gene-based therapies, bioinformatics is destined to take its place in the forefront of scientific computing application domains. Despite the clear importance of this field, common bioinformatics applications and their implication on microarchitectural design have received scant attention from the computer architecture community so far. The availability of a common set of bioinformatics benchmarks could be the first step to motivate further research in this crucial area.

This website presents a collection of benchmark suites, each a different application area within the larger domain of data mining. The collection began in 2005 with the release of BioBench, a suite for bioinformatics workloads collected by students and facuty at the University of Maryland. The suite expanded in 2006 with the addition of parallel bioinformatics workloads collected by researchers at Intel (distinguished from the original BioBench suite as the BioParallel suite).

So far we have published the following reports on the benchmarks contained within the larger suite:

We are continuing to expand the suite and will typically release new benchmark sets once they have been intensively characterized.


BioBench: A Benchmark Suite for Bioinformatics Applications

BioBench is a bioinformatics suite assembled by members of the
Systems and Computer Architecture Lab (SCAL) in the Department of Electrical and Computer Engineering at the University of Maryland. This research work is directed by Profs. Manoj Franklin, Bruce Jacob, Chau-Wen Tseng, and Donald Yeung, and the various applications are being characterized as part of ongoing research activity on memory systems (e.g., DRAM systems).

The BioBench benchmarks are presented and first characterized in the following publication (should you wish to cite a source):

The application classes and the individual BioBench benchmarks selected to represent them are listed below.

Sequence Similarity Searching -- These applications are typically used to identify similarities between DNA or protein sequences, or to search for certain subsequences in large sequence databases. The similarity between two sequences (or the lack of it) can often reveal important clues about structural or functional relationships between them, and in some cases can provide important clues about common evolutionary roots of organisms. BioBench contains programs from both BLAST [3] and FASTA [12] suites for sequence similarity searching.

Phylogenetic Analysis -- This technique aims to discover how a group of related protein sequences were derived from common origins during the process of evolution. This information is frequently displayed as a hierarchical diagram called a phylogenetic tree. The discovery and visualization of such relationships between proteins offers important clues on how certain traits were passed from species to species.

Multiple Sequence Alignment -- This is the process of aligning more than two sequences to find regions of similarity. This kind of analysis is used to have a deeper understanding of similarity patterns that might suggest common origins between the proteins they code.

Sequence Profile Searching -- When an evolutionary diverse set of proteins are under investigation to find remotely related proteins, searching a sequence database for the consensus of a sequence family (a common signature of the family) can be more effective than searching the same database for individual sequences. This analysis approach is called sequence profile searching.

Genome-level Alignment -- Genome-level alignment algorithms and tools are used to align complete genomes of related species. Due to the sheer number of nucleotides in a complete genome, multi-sequence alignment algorithms and tools (which are more geared toward aligning single proteins or simple nucleotide sequences) can not be used effectively for this task. Genome-level alignment tools employ algorithms specifically developed for the purpose of pairwise alignment of very large nucleotide sequences.

Sequence Assembly -- These tools are used to generate sequence data from many small overlapping partial equences obtained by DNA sequencing hardware. Sequence assembly is a crucial step for using shotgun sequencing to obtain complete sequence data from physical DNA sequences.


BioParallel: A Suite of Parallel Bioinformatics Workloads

BioParallel is a parallel bioinformatics suite assembled by researchers at Intel, including
Carole Dulong and Yimin Zhang, who have graciously allowed us to release the suite to the community. These benchmarks are part of a larger work within Intel to collect, observe, and characterize a new and growing class of applications termed Recognition Mining Synthesis (RMS). This suite is a work in progress (i.e., there are more benchmarks to come).

The first members of the BioParallel benchmark suite are first characterized in the following publication (should you wish to cite a source):

The individual BioParallel benchmarks are listed below.


Download Benchmarks and Related Information

Publications are in PDF format; benchmarks are provided as gzipped tarballs.


    PAPERS:
    BioBench - "BioBench: A benchmark suite of bioinformatics applications." K. Albayraktaroglu, A. Jaleel, X. Wu, M. Franklin, B. Jacob, C.-W. Tseng, and D. Yeung. Proc. 2005 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2005), pp. 2-9. Austin TX, March 2005.
    BioParallel - "Last-level cache (LLC) performance of data-mining workloads on a CMP--A case study of parallel bioinformatics workloads." Aamer Jaleel, Matthew Mattina, and Bruce Jacob. Proc. 12th International Symposium on High Performance Computer Architecture (HPCA 2006), Austin TX, February 2006.

    BENCHMARKS:
    BioBench (If you use these benchmarks for any studies, please reference the ISPASS-2005 paper above -- Thank you.)
    001.mummer - MUMMER v. 3.14 (S. Kurtz, et al., Center for Bioinformatics Research, University of Hamburg DE)
    MUMMER is a genome-level alignment tool that has been used to assemble complete genomes.
    002.tigr - TIGR v. 2 (The Institute for Genomic Research, Rockville MD)
    The TIGR Assembler suite is in the class of sequence assembly applications.
    003.clustalw - CLUSTAL W (J. D. Thompson, et al., European Molecular Biology Lab, Heidelberg DE)
    CLUSTAL W builds on the CLUSTAL package and is currently the most commonly used application for multiple-sequence alignment.
    004.hmmer - HMMER v. 2.3 (National Institutes of Health, Bethesda MD)
    HMMER is a sequence profile searching package that uses profiles based on hidden Markov models to conduct searches against protein databases. This benchmark searches the SwissPROT protein database against the consensus of a small selection of protein sequences.
    005.blast 005.blast.input - BLAST v. 1.3 (National Institutes of Health, Bethesda MD)
    BLASTN and BLASTP are the most commonly used sequence-searching applications, used for DNA and protein sequence searching, respectively. The DNA and protein databases used by the behcmark are NCBI's NT (11GB) and NR (945MB) databases containing the full set of non-redundant DNA and protein sequences submitted to NCBI.
    Note: due to size constraints (12 GB), the databases are not available in the tarball.
    006.phylip - PHYLIP v. 3.5c (J. Felsenstein, University of Washington)
    PHYLIP is the most widely used phylogenetic analysis package and contains several programs to conduct different types of phylogenetic analysis. From this package, we use PROTPARS, a protein parsimony computation application.
    007.fasta - FASTA v. 3.4t21 (B. Pearson, University of Virginia)
    FASTA is the main search utility from University of Virginia's FASTA suite v3.4t21, an important sequence searching suite. To reflect the difference between protein and nucleotide (DNA) searches, our test cases use the FASTA application for searching against a DNA database and a protein database with suitable search sequences. The DNA database used in our study is a daily update file to the NCBI GenBank data repository (190MB), and the protein database used is the entire SwissPROT protein database (70MB).
    BioParallel (benchmarks made available March 24th 2006)
    101.genenet - GeneNet (Y. Chen, et al., Intel)
    GeneNet is used to measure the regulatory relationship between genes. The algorithm is written in C++ with some details implemented using Intel's open-source Probabilistic Networks Library (PNL). The training data input is the cell cycle data of Yeast (173 sequences). The total memory working set size of this application is 350MB.
    102.snp - SNP (X. Ma, et al., Intel)
    SNP is used to measure and understand the patterns of Single Nucleotide Polymorphisms (SNPs). The algorithm is written in C++ with some details implemented using the Probabilistic Networks Library (PNL). The training input is a 30MB freely downloadable data set from the HGBASE (Human Genic Bi- Alletic Sequences), a database of SNPs. There are a total of 616,179 SNPs sequences in the training data set and each sequence has a length of 50. The total memory working set size of this application is 170MB.
    103.semphy - SEMPHY (N. Friedman, et al., Hebrew University, Jerulsalem)
    SEMPHY is a tool for constructing phylogenetic trees, used to represent the relationship among different species and possibly describe the course of evolution. The algorithm is written in C++ and handles both DNA and protein sequences. The input data set are sequences from the Pfam database. The total memory working set size of this application is 90MB.
    104.svm - SVM (Y. Chen, et al., Intel)
    Support Vector Machines Recursive Feature Elimination (SVM-RFE) is used to eliminate gene redundancy from a given input data set in order to provide compact gene subsets. The algorithm is written in C++ and uses the Intel Math Kernel Library (MKL) to enhance performance. The input data set to the application is a microarray data set involving ovarian cancer. The ovarian data set contains 253 (tissue samples) x 15154 (genes) expression values. The total memory working set size of this application is 300MB.
    105.plsa - PLSA (Y. Chen, et al., Intel)
    Parallel Linear Space Alignment (PLSA) is used to identify the similarities or differences between two genetic sequences, e.g. DNA/protein sequences. The application is written in C++ and takes as inputs two sequences each 30,000 letters long. The total memory working set size of this application is 14MB.

Contact Information

The people that brought you these benchmarks can be contacted via email.

BioParallel

BioBench

Traditional correspondence can be sent to

Prof. Bruce Jacob
Dept. of Electrical & Computer Engineering
University of Maryland
College Park, MD 20742


Acknowledgments

This work is supported by Intel and the National Science Foundation.

References

[1] SGI Bioinformatics Performance Report. http://www.sgi.com/industries/sciences/chembio/pdf/bioper f01.pdf

[2] http://perfsuite.ncsa.uiuc.edu

[3] S. F. Altschul, W. Gish, W. Miller, E. W. Meyers, and D. J. Lipman. "Basic local alignment search tool", Journal of Molecular Biology, vol. 215, no.3, pp.403-410, October 1990.

[4] S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci. "A scalable cross-platform infrastructure for application performance tuning using hardware counters" in Proceedings of the 2000 ACM/IEEE Conference on Supercomputing, page 42, 2000.

[5] U. Catalyurek, E. Stahlberg, R. Ferreira, T. Kurc, and Joel Saltz, "Improving performance of multiple sequence alignment analysis in multi-client environments" in Online Proceedings of the 1st International Workshop on High Performance Computational Biology (HICOMB 2002).

[6] A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, and S.L. Salzberg, "Alignment of whole genomes", Nucleic Acids Research, vol. 27, no. 11, pp. 2369-2376, 1999.

[7] S. R. Eddy, "Profile hidden Markov models", Bioinformatics, vol. 14, no. 9, pp. 755-763, 1998.

[8] J. Felsenstein, "PHYLIP-phylogeny inference package (version 3.2)", Cladistics, vol. 5, pp.164-166, 1989.

[9] D. G. Higgins and P. M. Sharp, "CLUSTAL:a package for performing multiple sequence alignment on a microcomputer", Gene, vol.73, pp. 237-244,1988.

[10] C. Lee, M. Potkonjak, and W. H. Mangione-Smith, "Mediabench: A tool for evaluating and synthesizing multimedia and communicatons systems", in International Symposium on Microarchitecture (MICRO), pp. 330-335, 1997.

[11] J. M. May, "MPX: Software for multiplexing hardware performance counters in multithreaded programs" in Proceedings of the 15th International Parallel & Distributed Processing Systems Symposium (IPDPS), 2001.

[12] W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA, (8):2444?2448, April 1988.

[13] SPEC. SPEC Benchmark Suite Release 1.0. SPEC, 1989.

[14] B. Sprunt, "Managing the complexity of performance monitoring hardware: The brink and abyss approach", http://www.eg.bucknell.edu/~bsprunt/emon/brink_abyss/bri nk_abyss.shtm

[15] G. G. Sutton, O. White, M.D. Adams, and A.R. Kerlavage, "TIGR assembler: A new tool for assembling large shotgun sequencing projects", Genome Science and Technology, vol. 1, no. 2, pp. 9-19, 1995.

[16] J. D. Thompson, D. G. Higgins, and T.J. Gibson, "CLUSTAL W: improving the sensitivity of progressive multiple-sequence alignment through sequence weighting positions-specific gap penalties and weight matrix choice", Nucleic Acids Research, vol. 22, pp. 4673-4680, June 1994.

[17] TPC. TPC Benchmark A. Itom International Co., 1989.

[18] T. Wolf and M. Franklin, "Commbench - a telecommunications benchmark for network processors" in Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 154-162, 2000.

[19] T. K. Yap, O. Frieder, and R. L. Martino, "Parallel computation in biological sequence analysis", IEEE Transactions on Parallel and Distributed Systems, vol. 9, no. 3, pp. 283-294, 1998.