Software

UPDATE: CheckM version 2 is now available pre-publication at https://www.biorxiv.org/content/10.1101/2022.07.11.499243v1 and https://github.com/chklovski/CheckM2

CheckM provides a set of tools for assessing the quality of genomes recovered from isolates, single cells, or metagenomes. It provides robust estimates of genome completeness and contamination by using collocated sets of genes that are ubiquitous and single-copy within a phylogenetic lineage. Assessment of genome quality can also be examined using plots depicting key genomic characteristics (e.g., GC, coding density) which highlight sequences outside the expected distributions of a typical genome. CheckM also provides tools for identifying genome bins that are likely candidates for merging based on marker set compatibility, similarity in genomic characteristics, and proximity within a reference genome tree.

Please see https://ecogenomics.github.io/CheckM/ or http://genome.cshlp.org/content/25/7/1043.full

Aviary is an end to end genome-centric metagenomics workflow. Novel and established methods are used assemble long read, short read or hybrid sequence datasets. Resulting contigs are binned using a large suite of primary metagenomic binners (including Rosella) and ensemble binning. Finalized bins are also assessed for quality using CheckM, and assigned taxonomic ranks using GTDB-tk. Various other statistics are also reports like recovered diversity at various steps using SingleM and coverage statistics using CoverM.

Please see https://github.com/rhysnewell/aviary

CoverM aims to be a configurable, easy to use and fast read coverage calculator focused on metagenomics applications. Calculating coverage by read mapping, its input can either be BAM files sorted by reference, or raw reads and reference FASTA sequences. CoverM calculates the coverage either or genomes/MAGs (coverm genome) or individual contigs (coverm contig).

Please see https://github.com/wwood/CoverM

Rosella is a binning algorithm, grouping together contigs from a metagenomic assembly into draft genomes. It often outperforms other currently available methods on both simulated and real datasets, as at early 2021.

Please see https://github.com/rhysnewell/rosella

Lorikeet predicts variants directly from metagenomic data, and links them together into strain level genotypes.

Please see https://github.com/rhysnewell/Lorikeet

Kingfisher is a quick and flexible program for procurement of sequence files from public data sources, including the European Nucleotide Archive (ENA), NCBI SRA, Amazon AWS and Google Cloud.

It attempts to download data from a series of methods, which it attempts in order until one works. Then the downloaded data is converted to an output SRA / FASTQ / FASTA / GZIP file format as required. Both download and extraction phases are usually quicker than using the NCBI’s SRA toolkit.

Please see https://github.com/wwood/kingfisher-download

Galah aims to be a more scalable metagenome assembled genome (MAG) dereplication method. That is, it clusters microbial genomes together based on their average nucleotide identity (ANI), and chooses a single member of each cluster as the representative.

Please see https://github.com/wwood/galah

SingleM is a tool to find the abundances of discrete operational taxonomic units (OTUs) directly from shotgun metagenome data, without heavy reliance of reference sequence databases. It is able to differentiate closely related species even if those species are from lineages new to science.

Please see https://github.com/wwood/singlem

Sandpiper is a website which catalogues the diversity of microbial life on Earth, through application of SingleM to publicly available metagenomes.

Please see https://sandpiper.qut.edu.au

GraftM is a meta-omic tool that identifies and classifies marker genes in short read datasets (metagenomes and metatranscriptomes), as well as assembled contigs, whole genomes and protein sequences. GraftM outputs a taxonomic/functional summary table, a krona plot, as well as various other run statistics. Both unaligned and aligned “hit” sequences are provided. GraftM is designed for speed and accuracy: it is able to find marker genes in a 200Mb of assembled metagenome in <20 sec, and compares favourably with similar tools in accuracy benchmarking.

Please see http://geronimp.github.io/graftM or https://doi.org/10.1093/nar/gky174

OrfM is a simple and not slow open reading frame (ORF) caller. No bells or whistles like frameshift detection, just a straightforward goal of returning a FASTA file of open reading frames over a certain length from a FASTA/Q file of nucleotide sequences.

Please see https://github.com/wwood/OrfM or http://bioinformatics.oxfordjournals.org/content/early/2016/06/02/bioinf…