How to Download NGS Data - Using prefetch SRA and fastq-dump for Sequencing Reads
For every publication where Next-generation-sequencing data was obtained, that data was uploaded to NCBI‘s Short Read Archive (SRA). This share opened the possibility for other scientists to test the data, learn with that data, or use it in their own studies to search for other conclusions.
One can obtain all SRA available on https://www.ncbi.nlm.nih.gov/sra. Here, you can search for a specific SRA, or a species’ whole genomic sequencing data or even RNA-seq data. After you choose the SRA, you need to obtain the access ID (SRR), which is easily found in the SRA page you selected.
NCBI offered a command-line toolkit that allowed users to interact with the database and each SRA itself – the SRA Toolkit. It can be installed by running the following command:
$ sudo apt
install sra-toolkit
The two most used sra-toolkit commands are prefetch and fastq-dump. The prefetch command is used to download the compressed archives from SRA – the SRR archives – to local access. Here is an example you can follow, corresponding to a whole genome sequencing sample of a Quercus lobata tree:
$ prefetch SRR14546180
-O path/to/srr/output
After download, you can extract the fastq files from the SRR file using fastq-dump:
$ fastq-dump --split-3 --gzip --outdir /path/to/fq/output SRR14546180.sra
Here, --split-3 ensures that paired-end and single-end reads are extracted in separated files, and --gzip compresses each file.
After
obtaining the fastq.gz files, the next step is the quality control with FastQC.
Comments