How to Download NGS Data - Using prefetch SRA and fastq-dump for Sequencing Reads
One can obtain all SRA available on https://www.ncbi.nlm.nih.gov/sra. Here, you can search for a specific SRA, or a species’ whole genomic sequencing data or even RNA-seq data. After you choose the SRA, you need to obtain the access ID (SRR), which is easily found in the SRA page you selected.
NCBI offered a command-line toolkit that allowed users to interact with the database and each SRA itself – the SRA Toolkit. It can be installed by running the following command:
$ sudo apt
install sra-toolkit
The two most used sra-toolkit commands are prefetch and fastq-dump. The prefetch command is used to download the compressed archives from SRA – the SRR archives – to local access. Here is an example you can follow, corresponding to a whole genome sequencing sample of a Quercus lobata tree:
$ prefetch SRR14546180
-O path/to/srr/output
After download, you can extract the fastq files from the SRR file using fastq-dump:
$ fastq-dump --split-3 --gzip --outdir /path/to/fq/output SRR14546180.sra
Here, --split-3 ensures that paired-end and single-end reads are extracted in separated files, and --gzip compresses each file.
After
obtaining the fastq.gz files, the next step is the quality control with FastQC.
Comments