Public databases such at the NIH Sequencing Read Archive (SRA) now contain hundreds of thousands of short-read sequencing experiments. A major challenge now is making that raw data accessible and useful for biological analysis — researchers must be able to find the relevant and related experiments on which to perform their analyses. A fundamental computational problem towards that effort is the problem of searching for short-read experiments by sequence. Specifically, given a query string Q and a very large collection of short-read sequencing experiments we want to quickly find the experiments that contain reads that make it likely that Q was among the sequences present, and we want to do this without appealing to a reference sequence (to support searching metagenomic and cancer experiments, for example). I will present a new approach, called Sequence Bloom Trees (Nature Biotechnology, 34, 300–302, 2016) for solving this problem. I will also present a subsequent refinement to this approach called Split Sequence Bloom Trees (to appear in RECOMB 2017). These approaches allow for the search of terabytes of raw short-read sequencing experiments in minutes using a single thread on a desktop-class computer.
This is joint work with Brad Solomon.