Bio-transformers

Table of contents

Description
Installation
Usage
Roadmap
Citations
License

Bio-transformers

bio-transformers is a python wrapper on top of the ESM/Protbert model, which are Transformers protein language model, trained on millions on proteins and used to predict embeddings. This package provide other functionalities (like compute the loglikelihood of a protein) or compute embeddings on multiple-gpu.

You can find the original repo here :

ESM
Protbert

Installation

It is recommended to work with conda environnements in order to manage the specific dependencies of the package.

  conda create --name bio-transformers python=3.7 -y
  conda activate bio-transformers
  pip install bio-transformers

Usage

Quick start

The main class BioTranformers allow the developper to use Protbert and ESM backend

>>from biotransformers import BioTransformers
>>BioTransformers.list_backend()
Use backend in this list :

  *   esm1_t34_670M_UR100
  *   esm1_t6_43M_UR50S
  *   esm1b_t33_650M_UR50S
  *   esm_msa1_t12_100M_UR50S
  *   protbert
  *   protbert_bfd

Embeddings

Choose a backend and pass a list of sequences of Amino acids to compute the embeddings. By default, the compute_embeddings function return the <CLS> token embedding. You can add a pooling_list in addition , so you can compute the mean of the tokens embeddings.

from biotransformers import BioTransformers

sequences = [
        "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG",
        "KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE",
    ]

bio_trans = BioTransformers(backend="protbert")
embeddings = bio_trans.compute_embeddings(sequences, pooling_list=['mean'])

cls_emb = embeddings['cls']
mean_emb = embeddings['mean']

Pseudo-Loglikelihood

The protein loglikelihood is a metric which estimates the joint probability of observing a given sequence of amino-acids. The idea behind such an estimator is to approximate the probability that a mutated protein will be “natural”, and can effectively be produced by a cell.

These metrics rely on transformers language models . These models are trained to predict a “masked” amino-acid in a sequence. As a consequence, they can provide us an estimate of the probability of observing an amino-acid given the “context” (the surrounding amino-acids). By multiplying individual probabilities computed for a given amino-acid given its context, we obtain a pseudo-likelihood, which can be a candidate estimator to approximate a sequence stability.

from biotransformers import BioTransformers

sequences = [
        "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG",
        "KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE",
    ]

bio_trans = BioTransformers(backend="protbert",device="cuda:0")
loglikelihood = bio_trans.compute_loglikelihood(sequences)

Roadmap:

Support multi-gpu forward
support MSA transformers
add compute_accuracy functionnality
support finetuning of model

Citations

License

This source code is licensed under the Apache 2 license found in the LICENSE file in the root directory.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/workflows		.github/workflows
.source/_static		.source/_static
.vscode		.vscode
biotransformers		biotransformers
docs		docs
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
environment.yaml		environment.yaml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bio-transformers

Installation

Usage

Quick start

Embeddings

Pseudo-Loglikelihood

Roadmap:

Citations

License

About

Uh oh!

Releases

Packages

Languages

License

sailfish009/bio-transformers

Folders and files

Latest commit

History

Repository files navigation

Bio-transformers

Installation

Usage

Quick start

Embeddings

Pseudo-Loglikelihood

Roadmap:

Citations

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages