So many protein structure prediction tools – what is the difference?

Today, COSMIC2 released the fifth protein structure prediction tool – OmegaFold. What is the difference between all of these tools? Here is a quick rundown to compare/contrast the structure prediction tools on COSMIC2:

AlphaFold2

Summary. Predicts the structure of single polypeptides using exhaustive multiple sequence alignments with the full code base related to the performance of AlphaFold2 at the CASP protein structure prediction competition.

Why you should use it. Allows you to try out the ‘full’ AlphaFold2 algorithm.

Limitations. Takes a long time due to the multiple sequence alignment step. 3X-5X longer than ColabFold.

ColabFold

Summary. Predicts the structure of either single polypeptides or protein complexes. Leverages faster sequence alignments to speed up prediction

Why you should use it. Faster than AlphaFold but will provide slightly different results due to differences in the multiple sequence alignment step. Conveniently works on either single polypeptides or multiple sequences within an input FASTA file.

Limitations. Utilizes different multiple sequence alignment steps than AlphaFold.

AlphaFold Multimer

Summary. Predicts the structure of protein-protein complexes after training AlphaFold2 on protein complexes.

Why you should use it. Allows you to test how AlphaFold2’s implementation on protein complexes predicts protein-protein complexes.

Limitations. Like AlphaFold2, the multiple sequence alignment step takes longer than ColabFold.

OmegaFold

Summary. Predicts single polypeptide structures without multiple sequence alignments, directly operating on input amino acid sequence.

Why you should use it. Much faster than ColabFold and AlphaFold and does not require extensive sequence coverage. This may help if you work with proteins that are divergent and fail with AlphaFold/ColabFold. Also allows for larger sequences than AlphaFold/ColabFold (up to 4096 amino acids).

Limitations. For proteins with large sequence coverage, OmegaFold appears to perform worse than ColabFold/AlphaFold without multiple sequence alignments.

IgFold

Summary. Predicts the structure of Fab regions of antibodies.

Why you should use it. IgFold performs better than AlphaFold on predicted Fab structures.

Limitations. Works only on Fab structures.

 

 

 

Uploading cryoSPARC-extracted particle stacks

Have you wanted to upload cryoSPARC-extracted stacks into COSMIC2? You’re in luck, we just finished documentation for how to do this!

Short answer – place a .star file that you convert from a refinement .cs file using pyem into the extracted particle directory, and then you can upload with Globus. That’s it!

PLUS we have pyem as a tool on COSMIC2 already, so you can do your conversion also on COSMIC2.

Have fun! Take a look here for a tutorial on how to upload a cryoSPARC-extracted particle stack. 

New tool added: AlphaFold2

We’re excited to add a new tool to COSMIC² – AlphaFold2! AlphaFold2 is a highly accurate protein structure prediction algorithm from the DeepMind team. The basics of AlphaFold2 rely on multiple sequence alignments, protein structure databases, and neural network architecting.

While AlphaFold2 is very powerful, it cannot be run on a laptop because of large databases (2.2 TB) and the requirement for CPU/GPUs during prediction. After the release of the code, community-led efforts to put AlphaFold2 on Google’s Colab python-based notebooks for running GPU jobs helped many scientists predict structures of interest. For example, wrapping a number of commands and software together, ColabFold helps users without expertise in bioinformatics to utilize protein structure prediction algorithms.

Even though ColabFold enabled many scientists to run AlphaFold2 on their protein of interest, we wanted to offer the community access to the full AlphaFold2 software package and databases. Since we already run jobs at the San Diego Supercomputer Center Expanse supercomputer, we are happy to host and run this software for everyone.

Learn more on how to run AlphaFold2 on COSMIC2 here!

New tool added: crYOLO

We are excited to add crYOLO to our list of supported tools on the platform.

crYOLO is a deep-learning-based particle picking program that requires no parameter tuning or optimization. Instead, crYOLO utilizes a pre-trained general model that works for 95% of use cases. We routinely use this picker for all varieties of single particles – try it out!

 

New tool added: ISAC

We are excited to add a new tool to COSMIC²: ISAC (iterative stable alignment and classification).  ISAC performs 2D classification and averaging of single particles, but it approaches this in a different manner than algorithms in RELION or cryoSPARC. A fundamental premise of ISAC is to identify reproducible and stable class averages.  This feature of ISAC makes it particularly amenable for performing 2D analysis of negative stain or cryoEM datasets with structurally heterogeneous samples.

For example, the power of ISAC facilitated averaging the flexible dynein motor protein complex (Chowdhury et al. 2015, Supplemental Figure 2):

 

To help users with this tool, we use RELION extracted particle stacks as inputs and then convert the stack to the appropriate format for running ISAC (EMAN2/Sparx/SPHIRE database files).

Try it out! Please send questions or comments to cosmic2support@umich.edu.

New tools added: LocBfactor, LocSpiral, LocOccupancy

We have added new 3D volume enhancement tools to COSMIC²:

  • LocSpiral
  • LocBfactor
  • LocBfactor Estimation
  • LocOccupancy

Preprint:

S. Kaur, J. Gomez-Blanco, A. Khalifa, S. Adinarayanan, R. Sanchez-Garcia, D. Wrapp, J. S. McLellan, K. H. Bui, J. Vargas, Local computational methods to improve the interpretability and analysis of cryo-EM maps. bioRxiv

These tools provide local sharpened, which helps to have a consistent sharpening across a given 3D reconstruction. For example, typically there are flexible regions in a reconstruction that require different filters + B-factor sharpening in order to visualize for model building. These tools allow users to have different sharpening values for different areas of the map.

As an example, here is an example from the print highlighting the improvement of map density for  Spike protein:

Try it out! Please send questions or comments to cosmic2support@umich.edu.

New tool added: cryoDRGN

We are excited to include cryoDRGN into the suite of software available on the COSMIC² science gateway.  cryoDRGN is a deep-learning-based heterogenous reconstruction algorithm that utilizes variational autoencoders to embed single-particle images into a latent space for decoding into conformations and populations.

What does this mean for your single-particle cryoEM project? cryoDRGN allows you to assess underlying heterogeneity after performing a consensus 3D refinement into a single 3D reconstruction. An example of how this can be used was highlighted in the preprint, where the authors separated distinct conformations of assembling ribosomes based on the latent space-embedding:

As explained by the authors on their Github repo page, this is experimental and requires empirically determining the best way for the variational autoencoder to encode and decode the data into latent space. Let us know if it works for you and if you have additional features or options to help you use cryoDRGN.

Links:

To help users navigate outputs from cryoDRGN, be sure to check out our Youtube tutorial on downloading and visualizing the results.

New tool added: DeepEMhancer

We have added a new tool to the platform: DeepEMhancer. DeepEMhancer utilizes a deep-learning-based approach for non-linear processing of 3D reconstructions to produce a sharpening-like effect on the data.

We are seeing great results so far on datasets we’ve analyzed in the Cianfrocco lab as well as that shown by Dr. Oliver Clarke on Twitter:

–> Preprint publication for DeepEMhancer on bioRxiv

–> Read about the tool on COSMIC² here

Step-by-step tutorial added

To help users learn about data formatting, job submission, and job retrieval, we have created a publicly shared Globus endpoint for a small dataset (~1000 particles) from Thermoplasma acidophilum 20S proteasome (T20S) from Campbell et al. 2015, a dataset that is despite into the EMPIAR cryoEM archive as EMPIAR-10025.

After successful completion, users will generate 2D class averages of T20S such these shown here:

–> Link to Tutorial

Video tutorials added

To help engage new users of the gateway we have started a series of video tutorials highlighting how to use the gateway. They are incorporated throughout the site and you can also find them on our Youtube playlist!

Singularity on COSMIC2

We are really excited to start using Singularity containers on COSMIC². For those who don’t know, Singularity (and it’s predecessor, Docker) are methods to ‘containerize’ your software. This means you can design and build a custom operating system with all of the correct software dependencies

We have been wanting to do this for a little while, but were finally forced when we started to incorporate crYOLO into our software platform. In short, since crYOLO runs deep learning software Tensorflow, we needed to be running the latest version of Linux CenOS 7. However, SDSC Comet is still running CentOS 6, which means that we were at an impasse for running this software.

Enter Singularity – these containers allowed us to install Ubuntu and crYOLO into its own ‘image’, which is a standalone environment capable of running crYOLO anywhere. With this new image, all we have to do in order to run crYOLO on any CPU machine is type:

$ singularity exec sdsc-comet-ubuntu-cryolo-cpu.simg cryolo_predict.py -c config.json -w gmodel_phosnet_20181221_loss0037.h5 -i micrographs/ -o micrographs/cryolo -t 0.2

Which is a big step for those who have tried this before!

Since the built images are ~7 GB, we can’t share them directly on Github, so instead we are sharing the definition files. Please take a look and try it out if you are so inclined!

Benchmarking RELION2 GPU-accelerated jobs on Comet-GPU nodes

Below you will find the results of running the standard RELION2 benchmark on a number of different node configurations to optimize speed vs. ‘cost’ (in service units, SUs).

Optimal configuration for COSMIC² users: 8 x K80 GPUs (which is across 2 nodes) is worth it – nearly double the speed, but only a fraction more SUs to pay for it.

Fastest analysis: 12 x P100 GPUs (which made it also the most ‘expensive’)

RELION benchmarking test set (link)

  • Job type: RELION 3D Classification – v2.1.b1; 25 iterations
  • Data info: 105,247 particles; 360 x 360 pixels
  • Elapsed time: 3 hr 14 min
  • Compute type: GPU (4 x P100)
  • SUs: 19.5

 

  • Job type: RELION 3D Classification – v2.1.b1; 25 iterations
  • Data info: 105,247 particles; 360 x 360 pixels
  • Elapsed time: 1 hr 43 min
  • Compute type: GPU (8 x P100)
  • SUs: 21

 

  • Job type: RELION 3D Classification – v2.1.b1; 25 iterations
  • Data info: 105,247 particles; 360 x 360 pixels
  • Elapsed time: 1 hr 25 min
  • Compute type: GPU (12 x P100)
  • SUs: 23

 

  • Job type: RELION 3D Classification – v2.1.b1; 25 iterations
  • Data info: 105,247 particles; 360 x 360 pixels
  • Elapsed time: 3 hr 42 min
  • Compute type: GPU (4 x K80)
  • SUs: 15

 

  • Job type: RELION 3D Classification – v2.1.b1; 25 iterations
  • Data info: 105,247 particles; 360 x 360 pixels
  • Elapsed time: 2 hr 2 min
  • Compute type: GPU (8 x K80)
  • SUs: 16

 

  • Job type: RELION 3D Classification – v2.1.b1; 25 iterations
  • Data info: 105,247 particles; 360 x 360 pixels
  • Elapsed time: 1 hr 42 min
  • Compute type: GPU (12 x K80)
  • SUs: 20.4