Table of contents
bio-transformers is a python wrapper on top of the ESM/Protbert model, which are Transformers protein language model, trained on millions on proteins and used to predict embeddings. This package provide other functionalities (like compute the loglikelihood of a protein) or compute embeddings on multiple-gpu.
You can find the original repo here :
It is recommended to work with conda environnements in order to manage the specific dependencies of the package.
conda create --name bio-transformers python=3.7 -y
conda activate bio-transformers
pip install bio-transformers
The main class BioTranformers
allow the developper to use Protbert and ESM backend
>>from biotransformers import BioTransformers
>>BioTransformers.list_backend()
Use backend in this list :
* esm1_t34_670M_UR100
* esm1_t6_43M_UR50S
* esm1b_t33_650M_UR50S
* esm_msa1_t12_100M_UR50S
* protbert
* protbert_bfd
Choose a backend and pass a list of sequences of Amino acids to compute the embeddings.
By default, the compute_embeddings
function return the <CLS>
token embedding.
You can add a pooling_list
in addition , so you can compute the mean of the tokens embeddings.
from biotransformers import BioTransformers
sequences = [
"MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG",
"KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE",
]
bio_trans = BioTransformers(backend="protbert")
embeddings = bio_trans.compute_embeddings(sequences, pooling_list=['mean'])
cls_emb = embeddings['cls']
mean_emb = embeddings['mean']
The protein loglikelihood is a metric which estimates the joint probability of observing a given sequence of amino-acids. The idea behind such an estimator is to approximate the probability that a mutated protein will be “natural”, and can effectively be produced by a cell.
These metrics rely on transformers language models . These models are trained to predict a “masked” amino-acid in a sequence. As a consequence, they can provide us an estimate of the probability of observing an amino-acid given the “context” (the surrounding amino-acids). By multiplying individual probabilities computed for a given amino-acid given its context, we obtain a pseudo-likelihood, which can be a candidate estimator to approximate a sequence stability.
from biotransformers import BioTransformers
sequences = [
"MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG",
"KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE",
]
bio_trans = BioTransformers(backend="protbert",device="cuda:0")
loglikelihood = bio_trans.compute_loglikelihood(sequences)
- Support multi-gpu forward
- support MSA transformers
- add compute_accuracy functionnality
- support finetuning of model
This source code is licensed under the Apache 2 license found in the LICENSE
file in the root directory.