How to easily obtain a VCF file based on BAM files using bcftools
The VCF (Variant Call Format) is a widely used file format in bioinformatics for storing genetic variation data. It was developed as a standardized way to represent genetic variants such as SNPs (Single Nucleotide Polymorphisms), indels (insertions and deletions), and other types of genetic alterations. VCF files are typically the output of variant calling processes, which analyse DNA sequence data to identify differences between a sample and a reference genome.
Each entry in a VCF file represents a specific position in the genome where variation has been observed. The file format includes essential information such as:
- Chromosome
and Position: The genomic location of the variant.
- Reference
and Alternate Alleles: The reference base(s) from the reference genome
and the alternate base(s) observed in the sample.
- Genotype
Information: For each sample, the file includes genotype data that
indicate which alleles are present at that position.
VCF files are highly flexible, supporting the storage of
additional information like variant quality scores, depth of sequencing
coverage, and functional annotations. This makes the VCF format invaluable for
tasks ranging from clinical genomics to large-scale population studies.
By using tools like bcftools or GATK, researchers can generate and manipulate VCF files to analyse genetic variations, making them a cornerstone in modern genomic research.
In this tutorial we’ll be using bcftools to pileup, call, normalize and filter the variants based on BAM files. The command goes as following:
$ bcftools mpileup -Ou -f <path_to_ref_fasta> <sample1.bam>
<sample2.bam> … <samplen.bam> | bcftools call -mv -Ov | bcftools
norm -m -any | bcftools filter -e 'QUAL<20 || DP<10' -Ov > output.vcf
1. bcftools mpileup -Ou -f <path_to_ref_fasta>
<sample1.bam> <sample2.bam> ... <samplen.bam:
This command generates the mpileup (a format that lists all
bases mapped to a given position in the genome) for multiple BAM files.
-Ou: This option produces an uncompressed BCF output, which
is useful for piping the data to the next command.
-f <path_to_ref_fasta>: This specifies the path to the
reference genome FASTA file. The BAM files will be aligned against this
reference.
<sample1.bam> <sample2.bam> ... <samplen.bam>: This is the list of BAM files you want to analyze. You can input multiple BAM files at once to perform a joint variant calling analysis on multiple samples.
2. bcftools call -mv -Ov:
This command performs variant calling based on the mpileup
data, identifying positions in the genome where variants (e.g., SNPs, indels)
are present.
-m: This option is for multiallelic calling, meaning it can
call sites with more than two alleles.
-v: This filters out sites that have no variants (i.e., it
will only output variant sites).
-Ov: This specifies that the output should be in VCF
(Variant Call Format) in plain text.
3. bcftools norm -m -any
This command normalizes the variants in the VCF file, particularly useful for handling multiallelic variants.
-m -any: This option splits multiallelic variants into multiple biallelic variants. It ensures that each line of the VCF represents a single variant allele, which is helpful for downstream analysis.
4. bcftools filter -e 'QUAL<20 || DP<10' -Ov
This command applies a filter to the VCF file, removing
variants based on quality and depth criteria.
-e 'QUAL<20 || DP<10': This is an exclusion filter. It
filters out variants that have a quality score (QUAL) less than 20 or a depth
of coverage (DP) less than 10. Variants with low quality or insufficient
coverage are considered unreliable.
-Ov: This outputs the final result as a VCF in text format.
Comments