Bioinformatics

From ActinoBase
Revision as of 09:25, 15 November 2024 by Matt Hutchings (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

2024 SSAMM - Bioinformatics Tools and Online Resources

Strains

Streptomyces.org Links to resources including StrepDB and a free download of the Streptomyces manual.

BacDive (The Bacterial Diversity Metadatabase) The world’s largest database for standardized bacterial information. Look up your strain, its morphology, culture and growth condition, DNA sequence accessions, chemotaxonomic features etc.

Formerly PNU (Prokaryotic Nomenclature Up-to-date) LPSN (List of Prokaryotic names with Standing in Nomenclature). Used to check the nomenclature/taxonomic standing of your strain.

MediaDive The world's largest collection of cultivation media. Your own media can be added and could be a first step in standardisation of bioassay data reportings.

StrainInfo Provides a resolution of microbial strain identifiers by storing culture collection numbers, their relations between strain collections, and culture-associated data.

Synthetic biology

De Novo DNA Use for calculating RBS and promoter strengths, and also provides predictions on mRNA stability (note: now requires payment to use).

DNA Chisel Python library for optimising DNA sequences with respect to a set of constraints and optimization objectives.

Genomics

KEGG (Kyoto Encyclopedia of Genes and Genomes) Gene and pathway information of multiple organisms.

Pyteomics Nice python packages for working with proteomics data.

CheckM Assesses the quality (contamination and completeness) of genome bins in metagenomics. Also useful for regularly assembled genomes.

MAUVE Tool to align whole genomes.

MeDuSa Draft genome scaffolding.

QUAST Quality Assessment Tool for Genome Assemblies

Tablet Genome viewer (especially good for visualising mapped reads).

Ideel Quick test for interrupted ORFs in microbial genomes.

Prokka Tool to annotate bacterial, archaeal and viral genomes.

Proksee Genome assembly, annotation and visualization, featuring interactive circular and linear genome maps.

Roary Pan-genome pipeline to calculate the pan-genome of an input set of genomes.

Codoff a program to measure the irregularity of the codon usage for a single genomic region (e.g. a BGC, phage, etc.) relative to the full genome.

GenoVi An automated customizable circular genome visualizer for bacteria and archaea.

Biosynthetic gene cluster tools

antiSMASH Very well-supported tool for BGC identification in bacterial, fungal and plant genomes.

antiSMASH-db Precomputed antiSMASH results for >200,000 BGC regions. Enables targeted searches and downloading of BGC regions.

MultiSMASH Pipeline for large-scale antiSMASH across multiple genomes.

BiG-SCAPE Tool to generate and visualise sequence similarity networks of BGCs.

BiG-FAM Searchable database linked to antiSMASH annotations of over 1.2 million BGCs. Can search classes for individual BGCs or "Gene Cluster Families" (GCFs).

BiG-SLiCE A scalable tool to map the diversity of BGCs.

BiG-MEx A tool for the mining of BGC domains and classes in metagenomic data.

cblaster Tool for finding clusters of co-located homologous sequences in BLAST searches.

clinker Comparison and publication-quality visualisation of synteny between BGCs. Can visualise output of cblaster (above) results.

CAGECAT Online CompArative GEne Cluster Analysis Toolbox. An online implementation of cblaster and clinker.

EFI-GNT Assesses gene neighbourhoods - tool associated with EFI-EST (below).

WebFlaGs Provides information about gene neighbourhoods associated with a set of protein accessions.

DeepBGC DeepBGC detects BGCs in bacterial and fungal genomes using deep learning. You can train the tool on your own data.

GECCO BGC identification using a machine learning approach.

GATOR-GC Flexible tool for targeted genome mining of BGCs, genomic islands, resistance islands, and more. Allows for a customised approach to exploring and comparing gene clusters across any chosen genomic database.

2ndFind Tool to find specialised metabolite BGCs in bacterial or eukaryotic genome sequence. The tool finds secondary metabolism proteins using Pfam domains found in specialised metabolism proteins. Useful for rapid Pfam analysis of BGCs.

BGCFlow A systematic workflow for the analysis of biosynthetic gene clusters across large collections of genomes. Integrates many bioinformatic tools.

lsaBGC Suite for pan-BGC-omics analysis

Zol (& fai) large-scale targeted detection and evolutionary investigation of gene clusters (i.e. BGCs, phages, etc.)

Non-ribosomal peptide and polyketide tools

DeepT2 Prediction of the Ksβ sequence and the discovery of a novel T2PKSs.

NRPS-PKS NRPS-PKS domain identification and specificity code prediction.

NaPDoS2 The Natural Product Domain Seeker rapidly detects and classifies ketosynthase and condensation domains from sequence data. It uses a phylogeny-based classification scheme to predict specific domain functions.

RiPP tools

RODEO and RODEO advanced Online tool for RiPP identification, and can also be used to retrieve and analyse gene neighbourhoods.

BAGEL Online tool for the detection of bacteriocins and RiPP BGCs in bacterial (meta-)genomes.

RiPPMiner Prediction of RiPP BGCs, RiPP families and RiPP product structures.

seq2ripp Identifies RiPPs from paired genomic and mass spectrometry data.

HypoRiPPAtlas Database of hypothetical RiPPs from RefSeq microbial genomes and plant transcriptomes. Data acquired using seq2ripp.

decRiPPter Data-driven Exploratory Class-independent RiPP TrackER. Machine learning tool for the detection of RiPP precursor peptides and associated BGCs.

Protein structure and sequence analysis

https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb ColabFold] User-friendly protein structure and complex prediction using AlphaFold2 and Alphafold2-multimer.

AlphaFold3 Online server for acquiring AlphaFold3 structure and complex predictions.

FoldSeek Online server that enables fast protein structure alignments against large protein structure databases.

EFI-EST Generates protein similarity sequence networks (SSNs).

NCBI Conserved Domain Search Search for conserved domains in proteins. Can analyse batches of protein sequences.

Antibiotic resistance

ARTS A tool for looking for potential resistance genes/duplicated housekeeping genes in BGCs, as well as potentially unique metabolism.

The Comprehensive Antibiotic Database (CARD) A rigorously curated collection of characterised resistance determinants and associated antibiotics. It also incorporates a tool called the Resistance Gene Identifier, which can predict resistance genes present in an input genome based on homology and SNP models.

Systematics/Phylogeny

Automated Multi-Locus Species Tree (autoMLST) Online tool for automatic generation of species phylogeny with reference organisms.

autoMLST command line Command line version for local installations of autoMLST.

PhyloPhlAn An integrated pipeline for large-scale phylogenetic profiling of genomes and metagenomes.

iTOL Editing and visualization of phylogenetic/phylogenomic trees.

Type (Strain) Genome Server Assesses whether a submitted strain is a novel species based on whole genome analysis. It also creates a whole genome phylogeny with closely related strains.

Genome Taxonomy Database (GTDB) Species classification

raxmlGUI 2.0 GUI interface for using RaxML for phylogenetic inference using maximum likelihood (personal experience is that it can struggle with large trees on a personal computer - simply a challenge of processing power)

GGDC (Genome-to-Genome Distance Calculator) Calculate phylogenetic trees and pairwise gene sequence similarities. Also includes a tool for Single-Gene Trees.

Natural product databases

The Natural Product Atlas Searchable community-led directory of NPs.

LOTUS Community-curated open source project for the storage, search and analysis of natural products data.

Dictionary of Natural Products Searchable directory of NPs (full version requires payment but some functions available for free).

Natural Product Magnetic Resonance Database Project (NP-MRD) Online database for community deposition of NMR data relating to natural products.

Metabolomics

GNPS analyse and compare mass spectrometry (MS/MS) data - within your dataset, with standards and with community datasets (deposited in MASSive database) there are many integrated tools (some of which are listed below), ReDu lets you reanalyse your data, MS2LDA (below), feature based molecular network (allows you to characterise your data into molecular types), DEREPLICATOR+ (in silico annotation) etc.

SIRIUS Tool for the structure prediction/elucidation of novel molecules using MS data.

MetaboAnalyst Online platform for the analysis and interpretation of metabolomics data and integration with other omics data.

MS-Dial Open source metabolomics analysis software.

mzMine3 Open source metabolomics analysis software that supports multimodal data analysis for various instrumental setups, including LC and GC–MS, and ion mobility.

MSL2DA decomposes molecular fragmentation data derived from large metabolomics experiments into annotated Mass2Motifs (substructures).

MassQL Formulates queries to find these patterns in raw mass spectrometry (set e.g. retention time, scan number etc) data.

MATCHMS Imports, processes, cleans, and compare mass spectrometry data (MS/MS).

MOLNET ENHANCER Combines the outputs from molecular networking, MS2LDA, in silico annotation tools (such as Network Annotation Propagation or DEREPLICATOR), and automated chemical classification to provide a more comprehensive chemical overview of metabolomics data.

NP3 MS Workflow Open-source tool for analysing untargeted metabolomic data from LC–MS/MS, designed to help identify bioactive natural products from complex mixtures. It offers customizable steps and clear statistical outputs, with features like MS2 spectrum processing, [M + H]+ ion deconvolution, and structural annotation.

SNAP-MS Tool integrated into the NP Atlas for the prediction of natural product compound classes using mass spectrometry data.

Data visualisation

Jalview Java-based GUI tool for sequence analysis (alignments etc) and visualisation.

ChimeraX Open-access protein structure analysis and visualisation software.

Awesome genome visualisation Resource listing many different genome visualisation tools.

Anvi'o An advanced analysis and visualization platform for ‘omics data.

General bioinformatic sequence tools/resources

Galaxy User-friendly server with a lot of bioinformatic tools.

SeqKit Various sequence file manipulation tools in Terminal.

Entrez Direct (EDirect) Provides access to the NCBI's suite of interconnected databases (publication, sequence, structure, gene, variation, expression, etc.) from a command line: useful to automatically retrieve information from genomes/metagenomes.

CIPRES Portal for online phylogenetic tools that runs jobs on NSF resources in the USA.

KBase KBase integrates a variety of data and analysis tools to perform sophisticated systems biology analyses.

Snakemake Make pipelines out of multiple tools.

Other

FeGenie HMM-based identification and categorization of iron genes and iron gene operons in genomes and metagenomes.

RadicalSAM.org Online database of radical SAM enzymes categorised into sequence similarity networks. Integrates further information, including genomic neighbourhoods and associated conserved domains.

Blender 3D Animation tool that can be used to create scientific animations. There are hundreds of tutorials for Blender to get started, but these are a great starting point for a science-focussed approach:

https://www.youtube.com/watch?v=CfkjBoOaw0g
https://www.youtube.com/watch?v=zXJKYvuCPYY