Algorithm development for microbial datasets

The development of novel algorithms is central to many aspects of our work. We have developed a series of bioinformatic tools which are freely available to the community.

Genome-centric metagenomics

The development of techniques to recover genomes from microbial communities has been revolutionary for the field of microbiology, allowing insight into the function and physiology of microorganisms that cannot be cultured. However, the efficacy of these tools remains variable. Our goal is to increase the accuracy of genome recovery, including improving techniques for the grouping of assembled contigs into draft genomes (‘binning’, implemented in Rosella), quality assessment of draft genomes (implemented in CheckM) and relative abundance estimation (implemented in CoverM and SingleM). We also seek to extend the ability of these methods to work on genomic replicons other than microbial genomes, such as plasmids and viruses.

Strain-level resolution

The abundance and activity of specific microbial strains has important implications for the entire community (e.g. methane production rate or relationship with host immune response) but we have not had the ability or tools to fully understand these mechanisms. We aim to increase the resolution of metagenome-based studies from the species- to strain-level (e.g. through Lorikeet), develop scalable methods of characterising plasmids and other non-chromosome replicons (e.g. viruses and phage), and integrate transcriptomic, proteomic and metabolic analyses to develop a standard approach for in-depth characterisation of complex microbial communities. Each of these tasks will result in general purpose software tools for the field of microbiology, and drive applied outcomes across the centre’s themes.

The global microbiome and big data


The use of metagenomics to study microbial communities is growing at an exponential pace (see figure left), and so increasingly requires software tools that can operate at scale. We are developing scalable tools for analysing metagenomics datasets, including Kingfisher to procure metagenome datasets and SingleM to profile them. SingleM will provide searchable community profiles of all public metagenomes, both host-associated and environmental. This massive dataset covers >700 trillion base pairs (Tbp) of metagenomic sequence data and is enabling many ecological and evolutionary questions to be answered for the first time using cloud-based workflows, machine learning and custom designed sequence search algorithms.

 

Image credit: https://pair-code.github.io/understanding-umap/

Chief Investigators

Team

  • Samuel Aroney
  • Peter Sternes
  • Alexei Chklovski
  • Rhys Newell
  • Rossen Zhao
  • Brett Babec

Funding / Grants

  • USA National Science Foundation Biology Integration Institutes #2022070
  • ARC Future Fellowship #FT210100521 (Dr Woodcroft)


Algorithm development for microbial datasets