Lossless Compression Tools for Genomics Data

No Thumbnail Available
Journal Title
Journal ISSN
Volume Title
DNA is essential for all living beings since it carries the information blueprint and the mechanisms that translate the DNA code into instructions for how organisms live, reproduce, metabolize, mature, and ultimately die. DNA consists of chemical building blocks called nucleotides. Each nucleotide contains a phosphate group, a sugar group, and a nitrogen base. The four types of nitrogen bases are adenine (A), thymine (T), guanine (G), and cytosine (C). The complete DNA instruction book, or genome, for humans, is around 3 billion bases, and more than 99 percent of those bases are the same in all people. The order of these bases, or sequence, is what determines DNA's instructions, or genetic code. The process of extracting the precise order of DNA bases called DNA sequencing. The rapid advancement in the sequencing technologies has significantly accelerated the biomedical research and discovery. Nowadays, Next-generation sequencing technologies are producing genomic data at ever-increasing rates. It has become a challenge to store, transmit, and process the massive quantity of data, creating a vital need for a solution to deal with such data deluge. Compression is essential to address this deluge by reducing storage space, transmission bandwidth, and processing costs. A key factor is that the generated Next-generation sequencing data are highly redundant and thus can be efficiently compressed. The Next-generation sequencing data come in many different file formats, such as FASTA (Reference Genome), as well as FASTQ (Raw Data) and SAM (Alignments), which contain quality scores and metadata, in addition to the biological sequences. Currently, data centers are adopting either of the two general-purpose compressors: gzip or bzip2 to compress these data. Since these two are general-purpose compressors, they do not take advantage of genomic data properties such as the presence of a small size alphabet, many repeats, and redundancy. In this dissertation, we propose different compression tools for each of FASTA/MultiFASTA and FASTQ file formats. We first worked on improving the performance of the well-known compression technique, Huffman Tree Encoding. We built an Unbalanced Huffman Tree that is designed to utilize the genomic data properties. This method shows significant improvement over the standard Huffman Tree in terms of compression ratio. Furthermore, we successfully were able to create a nongreedy Huffman Tree, which is an optimal Huffman Tree, that improves the performance of our Unbalanced Huffman Tree Encoding even more. We also built a compressor that is based on multiple Huffman Trees, that proof to have a performance improvement over the single Huffman Tree.All three methods above are designed to compress FASTA and Multi-FASTA files only. Lastly, we developed a specialized FASTQ compressor that achieves the best compression ratio on the LS454, PacBio, and MinION data sets when compared with other state-of-the-art FASTQ compressors. This compressor works by first pre-processing the FASTQ file and split it into three main streams, then assigning different streams to different compressors to compress each one independently.