Researchers from Carnegie Mellon University have developed a new database search technique that allows to find DNA sequences in minutes instead of days. The method specializes in finding short reads generated by high-throughput sequencing. The findings were published in the journal Nature Biotechnology.
The amount of genetic data with potentially important clinical information is growing very quickly. Among other public repositories, the National Institutes of Health (NIH) maintain the Sequence Read Archive, a database of three petabases that contains useful information for basic and applied research. The database is massively used, despite its difficult search functions. A typical search where short reads of 50-200 base pairs are assembled to obtain a gene -10000 base pairs approximately- takes days.
Sequence Bloom Trees method to speed up searches
To improve database search speed, the Carnegie Mellon researchers developed the Sequence Bloom Trees (SBT), a method that queries thousands of short reads by sequence 162 times faster than current methods. The SBT looks in data archives for all reads that contain a given sequence. More than 2600 human blood, breast and brain RNA-seq experiments where analyzed using SBT, searching more than 214000 known transcripts. The test was done in less than four days, using a single CPU and 293 MB of RAM. Most searches were completed in 20 minutes, whereas with SRA-BLAST and STAR they would take 2.2 days and 921 days, respectively. The method also allows to perform 200000 simultaneous queries, further speeding the process.
The authors have released the software as open source code. It is available here.