Posts

Nucleotide diversity - π

Image
Nucleotide diversity (π) measures the mean nucleotide differences per site between two randomly chosen sequences from a population. In simpler terms, it is the probability of two alleles being different at a given nucleotide. The formula that represents π is the following: where: n is the number of sequences or individuals in the population, d ij is the number of nucleotide differences between sequences i and j , L is the alignment length, ( n 2) is the number of possible pairs of sequences. Nucleotide diversity values of 0.01-0.05 are considered high (found in Drosophila , for example), while 0.001 is considered low (found in humans).

How to Download NGS Data - Using prefetch SRA and fastq-dump for Sequencing Reads

Image
  For every publication where Next-generation-sequencing data was obtained, that data was uploaded to NCBI‘s Short Read Archive (SRA). This share opened the possibility for other scientists to test the data, learn with that data, or use it in their own studies to search for other conclusions. One can obtain all SRA available on  https://www.ncbi.nlm.nih.gov/sra . Here, you can search for a specific SRA, or a species’ whole genomic sequencing data or even RNA-seq data. After you choose the SRA, you need to obtain the access ID (SRR), which is easily found in the SRA page you selected. NCBI offered a command-line toolkit that allowed users to interact with the database and each SRA itself – the SRA Toolkit. It can be installed by running the following command: $ sudo apt install sra-toolkit The two most used sra-toolkit commands are  prefetch  and  fastq-dump . The  prefetch  command is used to download the compressed archives from SRA – the SRR archives...

How to obtain an Admixture bar plot using ANGSD (ngsTools)

 Admixture bar plots are used to visualize the genetic structure of populations by assigning proportions of an individual's genome to different ancestral populations (K). Here are some key concepts: Ancestral Populations (K): The number of distinct genetic populations assumed in the analysis. Users must choose a value for K, which represents the number of ancestral populations. Individuals: Each individual in the dataset is represented as a vertical bar in the plot. Ancestry Proportions: The colors within each bar indicate the proportion of an individual's genome that comes from each ancestral population. To obtain an Admixture bar plot, you firstly need to install ngsTools (instructions here). This software uses genotype likelihoods rather than hard genotype calls. The analysis is based on BAM files. Sample data used in this tutorial can be downloaded  here . Admixture proportions can be estimated from genotype likelihoods using ngsTools.  Here  are instructions for...

How to plot a PCA from BAM files using ngsTools

Image
Principal Component Analysis (PCA) is a statistical method used to reduce the dimensions of large datasets, increasing interpretability while minimizing information loss. When applied to BAM files, PCA can help identify patterns of genetic variation across samples. By transforming complex, multidimensional data from BAM files into a set of orthogonal (independent) variables known as principal components, you can more easily discern genetic similarities and differences among samples. This process is particularly useful in population genomics to study genetic diversity and population structure. To plot a PCA based on BAM files, you firstly need to install ngsTools (you can follow the instructions  here ). In the following tutorial we will rely on genotype likelihoods rather than hard genotype calls. Genotype likelihoods have the advantage of incorporating the uncertainty associated with genotype calls from sequence data, allowing for more accurate and robust genetic analyses. Th...

How to install ngsTools

To install ngsTools, you must first install several dependency libraries and software packages: $ sudo apt update $ sudo apt install git gsl-bin libgsl-dbg libgsl-dev libgslcblas0 gcc zlib1g-dev libbz2-dev liblzma-dev libcurl4-openssl-dev coreutils samtools perl r-base $ sudo cpan Getopt::Long && sudo cpan Graph::Easy && sudo cpan Math::BigFloat && sudo cpan IO::Zlib $ R -e "install.packages(c('optparse', 'tools', 'ggplot2', 'reshape2', 'plyr', 'gtools', 'LDheatmap', 'ape', 'grid', 'methods', 'phangorn', 'plot3D'))" Now you can install ngsTools. Ensure that your terminal is directed to the folder where you want ngsTools to be installed: $ git clone --recursive https://github.com/mfumagalli/ngsTools.git $ cd ngsTools $ make

How to install Samtools using a Package Manager and Building from Source

  Option 1) Using a Package Manager: On Ubuntu/Debian:   $ sudo apt-get update $ sudo apt-get install samtools O n CentOS/RedHat: $ sudo yum install samtools   Option 2) Building from Source: Make sure you have all the dependencies installed. You can install them by running: $ sudo apt-get update $ sudo apt-get install build-essential zlib1g-dev libncurses-dev libbz2-dev liblzma-dev libcurl4-openssl-dev To install Samtools: $ cd path/to/installation/directory $ git clone --recurse-submodules https://github.com/samtools/samtools.git $ git clone --recurse-submodules https://github.com/samtools/htslib.git $ cd samtools $ make $ sudo make install

How to easily obtain a VCF file based on BAM files using BCFtools

Image
The VCF (Variant Call Format) is a widely used file format in bioinformatics for storing genetic variation data. It was developed as a standardized way to represent genetic variants such as SNPs (Single Nucleotide Polymorphisms), indels (insertions and deletions), and other types of genetic alterations. VCF files are typically the output of variant calling processes, which analyse DNA sequence data to identify differences between a sample and a reference genome. Each entry in a VCF file represents a specific position in the genome where variation has been observed. The file format includes essential information such as: Chromosome and Position: The genomic location of the variant. Reference and Alternate Alleles: The reference base(s) from the reference genome and the alternate base(s) observed in the sample. Genotype Information: For each sample, the file includes genotype data that indicate which alleles are present at that position. VCF files are highly flexible, supporting the stor...