How to plot a PCA from BAM files using ngsTools
Principal Component Analysis (PCA) is a statistical method used to reduce the dimensions of large datasets, increasing interpretability while minimizing information loss. When applied to BAM files, PCA can help identify patterns of genetic variation across samples. By transforming complex, multidimensional data from BAM files into a set of orthogonal (independent) variables known as principal components, you can more easily discern genetic similarities and differences among samples. This process is particularly useful in population genomics to study genetic diversity and population structure.
To plot a PCA based on BAM files, you firstly need to install ngsTools (you can follow the instructions here).
In the following tutorial we will rely on genotype likelihoods rather than hard genotype calls. Genotype likelihoods have the advantage of incorporating the uncertainty associated with genotype calls from sequence data, allowing for more accurate and robust genetic analyses. This is particularly useful in low coverage sequencing or in the presence of sequencing errors, as it avoids making definitive genotype assignments prematurely.
Sample data used in this tutorial can be downloaded here.
First, we need to obtain the genotype probabilities (.geno file, in binary format) and the allele frequencies (.mafs file):
$ ngsTools/angsd/./angsd -P 4 -b <path_to_bamlist>/bam.list -ref <path_to_reference_genome>/chr1.fna -out <path_to_geno>/samples -minInd 6 -doCounts 1 -GL 1 -doMajorMinor 1 -doMaf 1 -skipTriallelic 1 -doGeno 32 -doPost 1
In this case:
-P 4: Specifies the number of threads to use for the analysis, allowing parallel processing to speed up the computation. Here, it is set to use 4 threads.
-b: Points to a file that contains a list of BAM files to be analyzed. Each file path in this list should be on a separate line.
-ref: Specifies the reference genome file in FASTA format against which the BAM files will be compared.
-out: Defines the output directory and prefix for the files generated by ANGSD.
-minInd: Sets the minimum number of individuals that must have data for a position to be analyzed.
-doCounts 1: Enables the counting of the number of reads supporting each allele. This is crucial for downstream allele frequency estimation.
-GL 1: Specifies the method for calculating genotype likelihoods. The value 1 usually denotes the use of the SAMtools approach.
-doMajorMinor 1: Instructs ANGSD to infer the major and minor alleles based on allele frequency.
-doMaf 1: Activates minor allele frequency estimation based on the genotype likelihoods calculated.
-skipTriallelic 1: This option tells ANGSD to skip sites that appear to have more than two alleles (triallelic or more), simplifying the analysis to biallelic sites only.
-doGeno 32: Outputs genotype probabilities. The number 32 specifically instructs the program to write out posterior probabilities of genotypes in beagle format.
-doPost 1: Directs the calculation of posterior probabilities of genotypes under the Hardy-Weinberg equilibrium assumption.
We extract the genotype probabilities:
$ gunzip <path_to_geno>/samples.geno.gz
We will utilize ngsCovar to estimate the covariance matrix between individuals based on genotype probabilities. Subsequently, this matrix will be decomposed into principal components for investigating population structure. But first, we will need to obtain the number of sites we have. For that we have the allele frequencies file:
$ NSITES=`zcat <path_to_geno>/samples.mafs.gz | tail -n+2 | wc -l`
$ echo $NSITES
Now we can compute the principal components:
$ ngsTools/ngsPopGen/./ngsCovar -probfile <path_to_geno>/samples.geno -outfile <path_to_pc>/samples.covar -nind 6 -nsites $NSITES -call 0 -norm 0
In this case:
-probfile: Specifies the input file containing genotype probabilities.
-outfile: Defines the output file for the covariance matrix that ngsCovar will produce.
-nind: Indicates the number of individuals (samples) included in the analysis
-nsites: Specifies the number of genomic sites (positions) considered in the analysis.
-call 0: This option determines how genotype calls are used in the covariance matrix calculation. A setting of 0 typically means that no hard genotype calling is performed; instead, genotype probabilities are used directly. This approach can be more robust in datasets with varying quality or depth.
-norm 0: Dictates whether to normalize the data matrix before the covariance calculation. When set to 0, it indicates that no normalization will be performed. Normalization can affect the scaling of principal components and might be skipped based on specific analytical needs or dataset characteristics.
Finally, we conduct eigenvector decomposition and visualize the results on a map. Initially, we must generate a Plink cluster-like file that specifies the population labels for each sample. Here is an example for the sample data we are using (samples.clst):
FID IID CLUSTER
1 1 sample_1
2 1 sample_2
3 1 sample_3
4 1 sample_4
5 1 sample_5
6 1 sample_6
We plot the PCA in PDF format using the following command, in this case the first and second components (-c 1-2):
$ Rscript ngsTools/Scripts/plotPCA.R -i <path_to_pc>/samples.covar -c 1-2 -a <path_to_clst>/samples.clst -o <path_to_output>/samples_1_2.pdf
Comments