How to easily obtain a VCF file based on BAM files using bcftools

    The VCF (Variant Call Format) is a widely used file format in bioinformatics for storing genetic variation data. It was developed as a standardized way to represent genetic variants such as SNPs (Single Nucleotide Polymorphisms), indels (insertions and deletions), and other types of genetic alterations. VCF files are typically the output of variant calling processes, which analyse DNA sequence data to identify differences between a sample and a reference genome.

    Each entry in a VCF file represents a specific position in the genome where variation has been observed. The file format includes essential information such as:

  • Chromosome and Position: The genomic location of the variant.
  • Reference and Alternate Alleles: The reference base(s) from the reference genome and the alternate base(s) observed in the sample.
  • Genotype Information: For each sample, the file includes genotype data that indicate which alleles are present at that position.

    VCF files are highly flexible, supporting the storage of additional information like variant quality scores, depth of sequencing coverage, and functional annotations. This makes the VCF format invaluable for tasks ranging from clinical genomics to large-scale population studies.

    By using tools like bcftools or GATK, researchers can generate and manipulate VCF files to analyse genetic variations, making them a cornerstone in modern genomic research.

    In this tutorial we’ll be using bcftools to pileup, call, normalize and filter the variants based on BAM files. The command goes as following:


$ bcftools mpileup -Ou -f <path_to_ref_fasta> <sample1.bam> <sample2.bam> … <samplen.bam> | bcftools call -mv -Ov | bcftools norm -m -any | bcftools filter -e 'QUAL<20 || DP<10' -Ov > output.vcf

    

    1. bcftools mpileup -Ou -f <path_to_ref_fasta> <sample1.bam> <sample2.bam> ... <samplen.bam:

This command generates the mpileup (a format that lists all bases mapped to a given position in the genome) for multiple BAM files.

-Ou: This option produces an uncompressed BCF output, which is useful for piping the data to the next command.

-f <path_to_ref_fasta>: This specifies the path to the reference genome FASTA file. The BAM files will be aligned against this reference.

<sample1.bam> <sample2.bam> ... <samplen.bam>: This is the list of BAM files you want to analyze. You can input multiple BAM files at once to perform a joint variant calling analysis on multiple samples.

    

    2. bcftools call -mv -Ov:

This command performs variant calling based on the mpileup data, identifying positions in the genome where variants (e.g., SNPs, indels) are present.

-m: This option is for multiallelic calling, meaning it can call sites with more than two alleles.

-v: This filters out sites that have no variants (i.e., it will only output variant sites).

-Ov: This specifies that the output should be in VCF (Variant Call Format) in plain text.

 

    3. bcftools norm -m -any

This command normalizes the variants in the VCF file, particularly useful for handling multiallelic variants.

-m -any: This option splits multiallelic variants into multiple biallelic variants. It ensures that each line of the VCF represents a single variant allele, which is helpful for downstream analysis.


    4. bcftools filter -e 'QUAL<20 || DP<10' -Ov

This command applies a filter to the VCF file, removing variants based on quality and depth criteria.

-e 'QUAL<20 || DP<10': This is an exclusion filter. It filters out variants that have a quality score (QUAL) less than 20 or a depth of coverage (DP) less than 10. Variants with low quality or insufficient coverage are considered unreliable.

-Ov: This outputs the final result as a VCF in text format.

Comments

Popular posts from this blog

Welcome!