From 033282d022070b55140b3073f3d6e5d667965561 Mon Sep 17 00:00:00 2001 From: t-kimber Date: Tue, 10 Aug 2021 14:05:05 +0200 Subject: [PATCH 1/7] :tada: start branch From 965d57ec573c70a0fb1cf92375d6c2aedf34ce04 Mon Sep 17 00:00:00 2001 From: t-kimber Date: Mon, 23 Aug 2021 22:37:14 +0200 Subject: [PATCH 2/7] retrieve klifs pocket sequence. --- .../T024_kinase_similarity_sequence/README.md | 78 ++++ .../data/README.md | 6 + .../images/README.md | 5 + .../talktorial.ipynb | 345 ++++++++++++++++++ 4 files changed, 434 insertions(+) create mode 100644 teachopencadd/talktorials/T024_kinase_similarity_sequence/README.md create mode 100644 teachopencadd/talktorials/T024_kinase_similarity_sequence/data/README.md create mode 100644 teachopencadd/talktorials/T024_kinase_similarity_sequence/images/README.md create mode 100644 teachopencadd/talktorials/T024_kinase_similarity_sequence/talktorial.ipynb diff --git a/teachopencadd/talktorials/T024_kinase_similarity_sequence/README.md b/teachopencadd/talktorials/T024_kinase_similarity_sequence/README.md new file mode 100644 index 00000000..a38d14a4 --- /dev/null +++ b/teachopencadd/talktorials/T024_kinase_similarity_sequence/README.md @@ -0,0 +1,78 @@ +
+ +Thank you for contributing to TeachOpenCADD! + +
+ + +
+ +Set up your PR: Please check out our issue on how to set up a PR for new talktorials, including standard checks and TODOs. + +
+ + +# T000 · Talktorial topic title + +Authors: + +- First and last name, year(s) of contribution, lab, institution +- First and last name, year(s) of contribution, lab, institution + + +*The examples used in this talktorial template are taken from [__Talktorial T001__](https://github.com/volkamerlab/teachopencadd/blob/master/teachopencadd/talktorials/T001_query_chembl/talktorial.ipynb) and [__Talktorial T002__](https://github.com/volkamerlab/teachopencadd/blob/master/teachopencadd/talktorials/T002_compound_adme/talktorial.ipynb).* + + +
+ +Cross-referencing talktorials: If you want to cross-reference to existing talktorials in your notebook, please use the following formatting: Talktorial T000. + +
+ + +## Aim of this talktorial + +Add a short summary of this talktorial's content. + + +### Contents in *Theory* + +_Add Table of Contents (TOC) for Theory section._ + +* ChEMBL database +* Compound activity measures + + +
+ +Sync TOC with section titles: These points should refer to the headlines of your Theory section. + +
+ + +### Contents in *Practical* + +_Add Table of Contents (TOC) for Practical section._ + +* Connect to ChEMBL database +* Load and draw molecules + + +
+ +Sync TOC with section titles: These points should refer to the headlines of your Practical section. + +
+ + +### References + +* Paper +* Tutorial links +* Other useful resources + +*We suggest the following citation style:* +* Keyword describing resource: Journal (year), volume, pages (link to resource) + +*Example:* +* ChEMBL web services: [Nucleic Acids Res. (2015), 43, 612-620](https://academic.oup.com/nar/article/43/W1/W612/2467881) diff --git a/teachopencadd/talktorials/T024_kinase_similarity_sequence/data/README.md b/teachopencadd/talktorials/T024_kinase_similarity_sequence/data/README.md new file mode 100644 index 00000000..cc6d1e4c --- /dev/null +++ b/teachopencadd/talktorials/T024_kinase_similarity_sequence/data/README.md @@ -0,0 +1,6 @@ +# Data + +This folder stores input and output data for the Jupyter notebook. + +- `xxx.csv`: Describe data. +- `xxx.sdf`: Describe data. diff --git a/teachopencadd/talktorials/T024_kinase_similarity_sequence/images/README.md b/teachopencadd/talktorials/T024_kinase_similarity_sequence/images/README.md new file mode 100644 index 00000000..d4ebaa47 --- /dev/null +++ b/teachopencadd/talktorials/T024_kinase_similarity_sequence/images/README.md @@ -0,0 +1,5 @@ +# Talktorial title + +## Images + +This folder stores images used in the Jupyter notebook. diff --git a/teachopencadd/talktorials/T024_kinase_similarity_sequence/talktorial.ipynb b/teachopencadd/talktorials/T024_kinase_similarity_sequence/talktorial.ipynb new file mode 100644 index 00000000..d5f4557d --- /dev/null +++ b/teachopencadd/talktorials/T024_kinase_similarity_sequence/talktorial.ipynb @@ -0,0 +1,345 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# T024 · Kinase similarity: sequence\n", + "\n", + "Authors:\n", + "\n", + "- Talia B. Kimber, 2021, [Volkamer lab, Charité](https://volkamerlab.org/)\n", + "- Dominique Sydow, 2021, [Volkamer lab, Charité](https://volkamerlab.org/)\n", + "- Andrea Volkamer, 2021, [Volkamer lab, Charité](https://volkamerlab.org/)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Aim of this talktorial\n", + "\n", + "Add a short summary of this talktorial's content." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Contents in *Theory*\n", + "\n", + "* Kinase dataset\n", + "* Kinase similarity descriptor: XXX" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Contents in *Practical*\n", + "\n", + "* Retrieve and preprocess data\n", + "* Show kinase coverage\n", + "* Compare kinases\n", + "* Visualize similarity as kinase matrix\n", + "* Visualize similarity as phylogenetic tree" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### References\n", + "\n", + "* Kinase dataset: [Molecules (2021), 26(3), 629](https://www.mdpi.com/1420-3049/26/3/629) \n", + "* Kinase similarity descriptor: XXX" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Theory" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Kinase dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will use nine kinases from [Molecules (2021), 26(3), 629](https://www.mdpi.com/1420-3049/26/3/629) because:\n", + "\n", + "> We aggregated the investigated kinases in “profiles” (Table 2). Profile 1 combined EGFR and ErbB2 as targets (indicated by a ‘+’) and BRAF (from rapidly accelerated fibrosarcoma isoform B) as a (general) anti-target (designated by a ‘—’). Out of similar considerations, Profile 2 consisted of EGFR and PI3K as targets and BRAF as anti-target. This profile is expected to be more challenging as PI3K is an atypical kinase and thus less similar to EGFR than for example ErbB2 used in Profile 1. Profile 3, comprised of EGFR and VEGFR2 as targets and BRAF as anti-target, was contrasted with the hit rate that we found with a standard docking against the single target VEGFR2 (Profile 4).\n", + "> To broaden the comparison and obtain an estimate for the promiscuity of each compound, the kinases CDK2 (cyclic-dependent kinase 2), LCK (lymphocyte-specific protein tyrosine kinase), MET (mesenchymal-epithelial transition factor) and p38α (p38 mitogen activated protein kinase α) were included in the experimental assay panel and the structure-based bioinformatics comparison as commonly used anti-targets." + ] + }, + { + "attachments": { + "814048cb-e723-4b10-b9f7-2b56848688d9.png": { + "image/png": "" + } + }, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![image.png](attachment:814048cb-e723-4b10-b9f7-2b56848688d9.png)\n", + "\n", + "*Figure 1:* \n", + "Kinases used in this notebook, taken from [Molecules (2021), 26(3), 629](https://www.mdpi.com/1420-3049/26/3/629) (Table 1)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Kinase similarity descriptor: sequence\n", + "\n", + "Describe the dataset describing kinase similarity and how we use it.\n", + "\n", + "- XXX = KLIFS pocket sequence" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Practical" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "# !pip install flake8 pycodestyle_magic\n", + "%load_ext pycodestyle_magic\n", + "%pycodestyle_on" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "\n", + "import pandas as pd\n", + "import requests" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "HERE = Path(_dh[-1])\n", + "DATA = HERE / \"data\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Retrieve and preprocess data" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "query_kinases = ['EGFR',\n", + " 'ErbB2',\n", + " 'BRAF',\n", + " 'CDK2',\n", + " 'LCK',\n", + " 'MET',\n", + " 'p38a',\n", + " 'KDR',\n", + " 'p110a']" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "def klifs_pocket_sequence(kinase_name):\n", + " \"\"\"\n", + " Retrieves the pocket sequence from KLIFS using the API.\n", + "\n", + " Parameters\n", + " ----------\n", + " kinase_name : str\n", + " The name of the kinase of interest.\n", + "\n", + " Returns\n", + " -------\n", + " str :\n", + " The 85 residues pocket sequence from KLIFS,\n", + " if the kinase name is valid, None otherwise.\n", + " \"\"\"\n", + " response = requests.get(f\"https://klifs.net/api/\"\n", + " f\"kinase_ID?kinase_name={kinase_name}\"\n", + " f\"&species=HUMAN\")\n", + "\n", + " if response.status_code == 200:\n", + " return response.json()[0]['pocket']\n", + " else:\n", + " print(f'KLIFS failed for kinase {kinase_name}')\n", + " return None" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "KVLGSGAFGTVYKVAIKELEILDEAYVMASVDPHVCRLLGIQLITQLMPFGCLLDYVREYLEDRRLVHRDLAARNVLVITDFGLA\n", + "KVLGSGAFGTVYKVAIKVLEILDEAYVMAGVGPYVSRLLGIQLVTQLMPYGCLLDHVREYLEDVRLVHRDLAARNVLVITDFGLA\n", + "QRIGSGSFGTVYKVAVKMLAFKNEVGVLRKTRVNILLFMGYAIVTQWCEGSSLYHHLHIYLHAKSIIHRDLKSNNIFLIGDFGLA\n", + "EKIGEGTYGVVYKVALKKITAIREISLLKELNPNIVKLLDVYLVFEFLH-QDLKKFMDAFCHSHRVLHRDLKPQNLLILADFGLA\n", + "ERLGAGQFGEVWMVAVKSLAFLAEANLMKQLQQRLVRLYAVYIITEYMENGSLVDFLKTFIEERNYIHRDLRAANILVIADFGLA\n", + "EVIGRGHFGCVYHCAVKSLQFLTEGIIMKDFSPNVLSLLGILVVLPYMKHGDLRNFIRNYLASKKFVHRDLAARNCMLVADFGLA\n", + "SPVGSGAYGSVCAVAVKKLRTYRELRLLKHMKENVIGLLDVYLVTHLMG-ADLNNIVKCYIHSADIIHRDLKPSNLAVILDFGLA\n", + "KPLGRGAFGQVIEVAVKMLALMSELKILIHIGLNVVNLLGAMVIVEFCKFGNLSTYLRSFLASRKCIHRDLAARNILLICDFGLA\n", + "CRIMSSAKRPLWLIIFKNGDLRQDMLTLQIIRLRMLPYGCLVGLIEVVRSHTIMQIQCKATFI--LGIGDRHNSNIMVHIDFGHF\n" + ] + } + ], + "source": [ + "for kinase in query_kinases:\n", + " print(klifs_pocket_sequence(kinase))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Show kinase coverage" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Compare kinases" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Visualize similarity as kinase matrix" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Visualize similarity as phylogenetic tree" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Discussion\n", + "\n", + "Wrap up the talktorial's content here and discuss pros/cons and open questions/challenges." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Quiz\n", + "\n", + "Ask three questions that the user should be able to answer after doing this talktorial. Choose important take-aways from this talktorial for your questions.\n", + "\n", + "1. Question\n", + "2. Question\n", + "3. Question" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.10" + }, + "toc-autonumbering": true, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "state": {}, + "version_major": 2, + "version_minor": 0 + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From 16b0c73af61fac3471264f9147bc702c5b3543dc Mon Sep 17 00:00:00 2001 From: t-kimber Date: Tue, 24 Aug 2021 09:49:25 +0200 Subject: [PATCH 3/7] Add identity sequence comparison. --- .../talktorial.ipynb | 286 ++++++++++++++++-- 1 file changed, 263 insertions(+), 23 deletions(-) diff --git a/teachopencadd/talktorials/T024_kinase_similarity_sequence/talktorial.ipynb b/teachopencadd/talktorials/T024_kinase_similarity_sequence/talktorial.ipynb index d5f4557d..291f9eac 100644 --- a/teachopencadd/talktorials/T024_kinase_similarity_sequence/talktorial.ipynb +++ b/teachopencadd/talktorials/T024_kinase_similarity_sequence/talktorial.ipynb @@ -132,7 +132,9 @@ "from pathlib import Path\n", "\n", "import pandas as pd\n", - "import requests" + "import numpy as np\n", + "import requests\n", + "import biotite.sequence.align as align" ] }, { @@ -149,7 +151,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Retrieve and preprocess data" + "### Retrieve sequences from KLIFS" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We start by listing the kinases of interest." ] }, { @@ -169,6 +178,13 @@ " 'p110a']" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We use KLIFS' API to retrieve the 85-long pocket sequence for each kinase." + ] + }, { "cell_type": "code", "execution_count": 5, @@ -207,53 +223,131 @@ "metadata": {}, "outputs": [ { - "name": "stdout", - "output_type": "stream", - "text": [ - "KVLGSGAFGTVYKVAIKELEILDEAYVMASVDPHVCRLLGIQLITQLMPFGCLLDYVREYLEDRRLVHRDLAARNVLVITDFGLA\n", - "KVLGSGAFGTVYKVAIKVLEILDEAYVMAGVGPYVSRLLGIQLVTQLMPYGCLLDHVREYLEDVRLVHRDLAARNVLVITDFGLA\n", - "QRIGSGSFGTVYKVAVKMLAFKNEVGVLRKTRVNILLFMGYAIVTQWCEGSSLYHHLHIYLHAKSIIHRDLKSNNIFLIGDFGLA\n", - "EKIGEGTYGVVYKVALKKITAIREISLLKELNPNIVKLLDVYLVFEFLH-QDLKKFMDAFCHSHRVLHRDLKPQNLLILADFGLA\n", - "ERLGAGQFGEVWMVAVKSLAFLAEANLMKQLQQRLVRLYAVYIITEYMENGSLVDFLKTFIEERNYIHRDLRAANILVIADFGLA\n", - "EVIGRGHFGCVYHCAVKSLQFLTEGIIMKDFSPNVLSLLGILVVLPYMKHGDLRNFIRNYLASKKFVHRDLAARNCMLVADFGLA\n", - "SPVGSGAYGSVCAVAVKKLRTYRELRLLKHMKENVIGLLDVYLVTHLMG-ADLNNIVKCYIHSADIIHRDLKPSNLAVILDFGLA\n", - "KPLGRGAFGQVIEVAVKMLALMSELKILIHIGLNVVNLLGAMVIVEFCKFGNLSTYLRSFLASRKCIHRDLAARNILLICDFGLA\n", - "CRIMSSAKRPLWLIIFKNGDLRQDMLTLQIIRLRMLPYGCLVGLIEVVRSHTIMQIQCKATFI--LGIGDRHNSNIMVHIDFGHF\n" - ] + "data": { + "text/plain": [ + "{'EGFR': 'KVLGSGAFGTVYKVAIKELEILDEAYVMASVDPHVCRLLGIQLITQLMPFGCLLDYVREYLEDRRLVHRDLAARNVLVITDFGLA',\n", + " 'ErbB2': 'KVLGSGAFGTVYKVAIKVLEILDEAYVMAGVGPYVSRLLGIQLVTQLMPYGCLLDHVREYLEDVRLVHRDLAARNVLVITDFGLA',\n", + " 'BRAF': 'QRIGSGSFGTVYKVAVKMLAFKNEVGVLRKTRVNILLFMGYAIVTQWCEGSSLYHHLHIYLHAKSIIHRDLKSNNIFLIGDFGLA',\n", + " 'CDK2': 'EKIGEGTYGVVYKVALKKITAIREISLLKELNPNIVKLLDVYLVFEFLH-QDLKKFMDAFCHSHRVLHRDLKPQNLLILADFGLA',\n", + " 'LCK': 'ERLGAGQFGEVWMVAVKSLAFLAEANLMKQLQQRLVRLYAVYIITEYMENGSLVDFLKTFIEERNYIHRDLRAANILVIADFGLA',\n", + " 'MET': 'EVIGRGHFGCVYHCAVKSLQFLTEGIIMKDFSPNVLSLLGILVVLPYMKHGDLRNFIRNYLASKKFVHRDLAARNCMLVADFGLA',\n", + " 'p38a': 'SPVGSGAYGSVCAVAVKKLRTYRELRLLKHMKENVIGLLDVYLVTHLMG-ADLNNIVKCYIHSADIIHRDLKPSNLAVILDFGLA',\n", + " 'KDR': 'KPLGRGAFGQVIEVAVKMLALMSELKILIHIGLNVVNLLGAMVIVEFCKFGNLSTYLRSFLASRKCIHRDLAARNILLICDFGLA',\n", + " 'p110a': 'CRIMSSAKRPLWLIIFKNGDLRQDMLTLQIIRLRMLPYGCLVGLIEVVRSHTIMQIQCKATFI--LGIGDRHNSNIMVHIDFGHF'}" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" } ], "source": [ + "kinase_sequences = {}\n", "for kinase in query_kinases:\n", - " print(klifs_pocket_sequence(kinase))" + " kinase_sequences[kinase] = klifs_pocket_sequence(kinase)\n", + "kinase_sequences" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Show kinase coverage" + "### Compare kinases" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 7, "metadata": {}, "outputs": [], - "source": [] + "source": [ + "def sequence_similarity(kinase_name1, kinase_name2, type_=\"identity\"):\n", + " \"\"\"\n", + " Compares two sequences using a given metric.\n", + "\n", + " kinase_name1, kinase_name2 : str\n", + " The two names of the kinases for comparison.\n", + "\n", + " type_ : str ? default = identity\n", + "\n", + " Returns\n", + " -------\n", + " float :\n", + " The similarity between the pocket sequences of the two kinases.\n", + " \"\"\"\n", + " sequence_1 = klifs_pocket_sequence(kinase_name1)\n", + " sequence_2 = klifs_pocket_sequence(kinase_name2)\n", + "\n", + " if len(sequence_1) != len(sequence_1):\n", + " print(\"Mismatch in sequence lengths.\")\n", + " return None\n", + " else:\n", + " if type_ == \"identity\":\n", + " # True is the character is the same, False otherwise\n", + " is_match = np.compare_chararrays(np.array(list(sequence_1)),\n", + " np.array(list(sequence_2)),\n", + " cmp=\"==\",\n", + " rstrip=True)\n", + " similarity_normed = sum(is_match)/len(sequence_1)\n", + " return similarity_normed\n", + " else:\n", + " print(\"type not implemented yet.\")\n", + " return None" + ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Compare kinases" + "Let's look at the sequence similarity between EGFR and MET:" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 8, "metadata": {}, - "outputs": [], - "source": [] + "outputs": [ + { + "data": { + "text/plain": [ + "0.4588235294117647" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sequence_similarity(\"EGFR\", \"MET\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As expected, the similarity between a kinase and itself leads the highest possible score:" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1.0" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sequence_similarity(\"EGFR\", \"EGFR\")" + ] }, { "cell_type": "markdown", @@ -262,6 +356,152 @@ "### Visualize similarity as kinase matrix" ] }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[1. , 0.89411765, 0.37647059, 0.31764706, 0.44705882,\n", + " 0.45882353, 0.38823529, 0.47058824, 0.11764706],\n", + " [0.89411765, 1. , 0.4 , 0.32941176, 0.42352941,\n", + " 0.47058824, 0.4 , 0.43529412, 0.11764706],\n", + " [0.37647059, 0.4 , 1. , 0.32941176, 0.38823529,\n", + " 0.37647059, 0.37647059, 0.4 , 0.15294118],\n", + " [0.31764706, 0.32941176, 0.32941176, 1. , 0.37647059,\n", + " 0.36470588, 0.47058824, 0.34117647, 0.10588235],\n", + " [0.44705882, 0.42352941, 0.38823529, 0.37647059, 1. ,\n", + " 0.4 , 0.38823529, 0.43529412, 0.14117647],\n", + " [0.45882353, 0.47058824, 0.37647059, 0.36470588, 0.4 ,\n", + " 1. , 0.36470588, 0.47058824, 0.10588235],\n", + " [0.38823529, 0.4 , 0.37647059, 0.47058824, 0.38823529,\n", + " 0.36470588, 1. , 0.38823529, 0.14117647],\n", + " [0.47058824, 0.43529412, 0.4 , 0.34117647, 0.43529412,\n", + " 0.47058824, 0.38823529, 1. , 0.15294118],\n", + " [0.11764706, 0.11764706, 0.15294118, 0.10588235, 0.14117647,\n", + " 0.10588235, 0.14117647, 0.15294118, 1. ]])" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "table = np.zeros((len(query_kinases), len(query_kinases)))\n", + "for i, kinase_name1 in enumerate(query_kinases):\n", + " for j, kinase_name2 in enumerate(query_kinases):\n", + " table[i, j] = sequence_similarity(kinase_name1,\n", + " kinase_name2)\n", + "table" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " A C D E F G H I K L M N P Q R S T V W Y B Z X *\n", + "A 4 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -2 -2 -1 0 -4\n", + "C 0 9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -2 -3 -3 -2 -4\n", + "D -2 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -3 4 1 -1 -4\n", + "E -1 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -2 1 4 -1 -4\n", + "F -2 -2 -3 -3 6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 3 -3 -3 -1 -4\n", + "G 0 -3 -1 -2 -3 6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -3 -1 -2 -1 -4\n", + "H -2 -3 -1 0 -1 -2 8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 2 0 0 -1 -4\n", + "I -1 -1 -3 -3 0 -4 -3 4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 -3 -3 -1 -4\n", + "K -1 -3 -1 1 -3 -2 -1 -3 5 -2 -1 0 -1 1 2 0 -1 -2 -3 -2 0 1 -1 -4\n", + "L -1 -1 -4 -3 0 -4 -3 2 -2 4 2 -3 -3 -2 -2 -2 -1 1 -2 -1 -4 -3 -1 -4\n", + "M -1 -1 -3 -2 0 -3 -2 1 -1 2 5 -2 -2 0 -1 -1 -1 1 -1 -1 -3 -1 -1 -4\n", + "N -2 -3 1 0 -3 0 1 -3 0 -3 -2 6 -2 0 0 1 0 -3 -4 -2 3 0 -1 -4\n", + "P -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 -1 -2 -1 -1 -2 -4 -3 -2 -1 -2 -4\n", + "Q -1 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 1 0 -1 -2 -2 -1 0 3 -1 -4\n", + "R -1 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 -1 -1 -3 -3 -2 -1 0 -1 -4\n", + "S 1 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 1 -2 -3 -2 0 0 0 -4\n", + "T 0 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 0 -2 -2 -1 -1 0 -4\n", + "V 0 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 -3 -1 -3 -2 -1 -4\n", + "W -3 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 2 -4 -3 -2 -4\n", + "Y -2 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 7 -3 -2 -1 -4\n", + "B -2 -3 4 1 -3 -1 0 -3 0 -4 -3 3 -2 0 -1 0 -1 -3 -4 -3 4 1 -1 -4\n", + "Z -1 -3 1 4 -3 -2 0 -3 1 -3 -1 0 -1 3 0 0 -1 -2 -3 -2 1 4 -1 -4\n", + "X 0 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 -1 -1 0 0 -1 -2 -1 -1 -1 -1 -4\n", + "* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1\n" + ] + } + ], + "source": [ + "# Obtain BLOSUM62\n", + "substitution_matrix = align.SubstitutionMatrix.std_protein_matrix()\n", + "print(substitution_matrix)" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "biotite.sequence.align.SubstitutionMatrix" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "type(substitution_matrix)" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "substitution_matrix.is_symmetric()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Show kinase coverage" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, { "cell_type": "code", "execution_count": null, From 23c0225c9f817122a570cee29292fca0c871ba0a Mon Sep 17 00:00:00 2001 From: t-kimber Date: Tue, 24 Aug 2021 12:37:11 +0200 Subject: [PATCH 4/7] add substitution metric. --- .../talktorial.ipynb | 427 ++++++++++++++++-- 1 file changed, 389 insertions(+), 38 deletions(-) diff --git a/teachopencadd/talktorials/T024_kinase_similarity_sequence/talktorial.ipynb b/teachopencadd/talktorials/T024_kinase_similarity_sequence/talktorial.ipynb index 291f9eac..fc3eab00 100644 --- a/teachopencadd/talktorials/T024_kinase_similarity_sequence/talktorial.ipynb +++ b/teachopencadd/talktorials/T024_kinase_similarity_sequence/talktorial.ipynb @@ -73,25 +73,31 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We will use nine kinases from [Molecules (2021), 26(3), 629](https://www.mdpi.com/1420-3049/26/3/629) because:\n", + "We will use nine kinases from [Molecules (2021), 26(3), 629](https://www.mdpi.com/1420-3049/26/3/629), which aimed to understand kinase similarities within different combinations of kinase on- and off-targets (also called anti-targets):\n", "\n", - "> We aggregated the investigated kinases in “profiles” (Table 2). Profile 1 combined EGFR and ErbB2 as targets (indicated by a ‘+’) and BRAF (from rapidly accelerated fibrosarcoma isoform B) as a (general) anti-target (designated by a ‘—’). Out of similar considerations, Profile 2 consisted of EGFR and PI3K as targets and BRAF as anti-target. This profile is expected to be more challenging as PI3K is an atypical kinase and thus less similar to EGFR than for example ErbB2 used in Profile 1. Profile 3, comprised of EGFR and VEGFR2 as targets and BRAF as anti-target, was contrasted with the hit rate that we found with a standard docking against the single target VEGFR2 (Profile 4).\n", - "> To broaden the comparison and obtain an estimate for the promiscuity of each compound, the kinases CDK2 (cyclic-dependent kinase 2), LCK (lymphocyte-specific protein tyrosine kinase), MET (mesenchymal-epithelial transition factor) and p38α (p38 mitogen activated protein kinase α) were included in the experimental assay panel and the structure-based bioinformatics comparison as commonly used anti-targets." - ] - }, - { - "attachments": { - "814048cb-e723-4b10-b9f7-2b56848688d9.png": { - "image/png": "" - } - }, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "![image.png](attachment:814048cb-e723-4b10-b9f7-2b56848688d9.png)\n", + " \n", "\n", - "*Figure 1:* \n", - "Kinases used in this notebook, taken from [Molecules (2021), 26(3), 629](https://www.mdpi.com/1420-3049/26/3/629) (Table 1)." + "> We aggregated the investigated kinases in “profiles”. Profile 1 combined **EGFR** and **ErbB2** as targets and **BRAF** as a (general) anti-target. Out of similar considerations, Profile 2 consisted of EGFR and **PI3K** as targets and BRAF as anti-target. This profile is expected to be more challenging as PI3K is an atypical kinase and thus less similar to EGFR than for example ErbB2 used in Profile 1. Profile 3, comprised of EGFR and **VEGFR2** as targets and BRAF as anti-target, was contrasted with the hit rate that we found with a standard docking against the single target VEGFR2 (Profile 4).\n", + "> To broaden the comparison and obtain an estimate for the promiscuity of each compound, the kinases **CDK2**, **LCK**, **MET** and **p38α** were included in the experimental assay panel and the structure-based bioinformatics comparison as commonly used anti-targets.\n", + "\n", + " \n", + "\n", + "*Table 1:* \n", + "Kinases used in this notebook, taken from [Molecules (2021), 26(3), 629](https://www.mdpi.com/1420-3049/26/3/629), with their synonyms, UniProt IDs, and kinase groups.\n", + "\n", + " \n", + "\n", + "| Kinase | Synonyms | UniProt ID | Group | Full kinase name |\n", + "|----------------------------|------------------------|------------|----------|--------------------------------------------------|\n", + "| EGFR | ErbB1 | P00533 | TK | Epidermal growth factor receptor |\n", + "| ErbB2 | Her2 | P04626 | TK | Erythroblastic leukemia viral oncogene homolog 2 |\n", + "| PI3K | PI3KCA, p110a | P42336 | Atypical | Phosphatidylinositol-3-kinase |\n", + "| VEGFR2 | KDR | P35968 | TK | Vascular endothelial growth factor receptor 2 |\n", + "| BRAF | - | P15056 | TKL | Rapidly accelerated fibrosarcoma isoform B |\n", + "| CDK2 | - | P24941 | CMGC | Cyclic-dependent kinase 2 |\n", + "| LCK | - | P06239 | TK | Lymphocyte-specific protein tyrosine kinase |\n", + "| MET | - | P08581 | TK | Mesenchymal-epithelial transition factor |\n", + "| p38a | MAPK14 | Q16539 | CMGC | p38 mitogen activated protein kinase α |" ] }, { @@ -217,6 +223,13 @@ " return None" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's see how these pocket sequence look like." + ] + }, { "cell_type": "code", "execution_count": 6, @@ -252,7 +265,22 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Compare kinases" + "### Sequence similarity" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Given two kinases, we create a function which computes the sequence similarity between them using one of the two measures discussed in the theory, namely the identity or the substitution." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Identity score\n", + "We first define a function which compares element wise elements as described in the theory." ] }, { @@ -260,6 +288,98 @@ "execution_count": 7, "metadata": {}, "outputs": [], + "source": [ + "def identity_score(sequence1, sequence2):\n", + " \"\"\"\n", + " sequence1 :\n", + " sequence2 :\n", + " \"\"\"\n", + " # True is the character is the same, False otherwise\n", + " return np.compare_chararrays(sequence1,\n", + " sequence2,\n", + " cmp=\"==\",\n", + " rstrip=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Substitution score\n", + "We now define the function which is more specific to amino acids grouping and use the `biotite` library and retrieve the substitution matrix." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "3:80: E501 line too long (90 > 79 characters)\n", + "6:1: W293 blank line contains whitespace\n", + "11:25: W291 trailing whitespace\n" + ] + } + ], + "source": [ + "def substitution_score(sequence1,\n", + " sequence2,\n", + " substitution_matrix=align.SubstitutionMatrix.std_protein_matrix()):\n", + " \"\"\"\n", + " ADD\n", + " \n", + " Parameters\n", + " ----------\n", + " sequence1 :\n", + " sequence2 :\n", + " substitution_matrix: \n", + " Default align.SubstitutionMatrix.std_protein_matrix() from biotite\n", + " Obtain BLOSUM62\n", + " Returns\n", + " -------\n", + " \"\"\"\n", + " # Retrieve np.array from substitution matrix\n", + " score_matrix = substitution_matrix.score_matrix()\n", + "\n", + " # Retireve the letter\n", + " letter_alphabet = substitution_matrix.get_alphabet1()\n", + "\n", + " # Map letter to index\n", + " dict_letters = {}\n", + " for i, letter in enumerate(letter_alphabet.get_symbols()):\n", + " dict_letters[letter] = i\n", + "\n", + " match_score = match_score = np.zeros(len(sequence1))\n", + " for i, (character_seq1, character_seq2) in enumerate(zip(sequence1,\n", + " sequence2)):\n", + " ind1 = dict_letters[character_seq1]\n", + " ind2 = dict_letters[character_seq2]\n", + " match_score[i] = score_matrix[ind1, ind2]\n", + " return match_score" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Kinase comparison" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Given two kinases, we create a function which computes the sequence similarity between them using one of the two measures discussed in the theory, namely the identity or the substitution." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], "source": [ "def sequence_similarity(kinase_name1, kinase_name2, type_=\"identity\"):\n", " \"\"\"\n", @@ -282,16 +402,18 @@ " print(\"Mismatch in sequence lengths.\")\n", " return None\n", " else:\n", + " seq_array1 = np.array(list(sequence_1))\n", + " seq_array2 = np.array(list(sequence_2))\n", " if type_ == \"identity\":\n", - " # True is the character is the same, False otherwise\n", - " is_match = np.compare_chararrays(np.array(list(sequence_1)),\n", - " np.array(list(sequence_2)),\n", - " cmp=\"==\",\n", - " rstrip=True)\n", + " is_match = identity_score(seq_array1, seq_array2)\n", " similarity_normed = sum(is_match)/len(sequence_1)\n", " return similarity_normed\n", + " elif type_ == \"substitution\":\n", + " match_score = substitution_score(seq_array1,\n", + " seq_array2)\n", + " return match_score\n", " else:\n", - " print(\"type not implemented yet.\")\n", + " print(\"Type not defined.\")\n", " return None" ] }, @@ -304,7 +426,34 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([ 1., 4., 2., 6., -1., 6., -2., 6., 6., -1., 4., 7., -1.,\n", + " -1., 4., 3., 5., 0., 4., 2., 0., 4., -1., 5., 0., -1.,\n", + " 3., 5., -1., 0., -1., 0., 7., 1., 4., -1., -1., 4., 4.,\n", + " 6., 4., -2., 1., 3., -1., -1., -1., 5., -1., -1., 6., -3.,\n", + " 4., -2., 1., 3., 3., 5., 0., 7., 4., -1., 0., 2., 2.,\n", + " 0., 4., 8., 5., 6., 4., 4., 4., 5., 6., -1., 2., 1.,\n", + " 3., 0., 6., 6., 6., 4., 4.])" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "a = sequence_similarity(\"EGFR\", \"MET\", \"substitution\")\n", + "a" + ] + }, + { + "cell_type": "code", + "execution_count": 11, "metadata": {}, "outputs": [ { @@ -313,13 +462,14 @@ "0.4588235294117647" ] }, - "execution_count": 8, + "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "sequence_similarity(\"EGFR\", \"MET\")" + "a = sequence_similarity(\"EGFR\", \"MET\", \"identity\")\n", + "a" ] }, { @@ -331,7 +481,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 12, "metadata": {}, "outputs": [ { @@ -340,7 +490,7 @@ "1.0" ] }, - "execution_count": 9, + "execution_count": 12, "metadata": {}, "output_type": "execute_result" } @@ -349,6 +499,30 @@ "sequence_similarity(\"EGFR\", \"EGFR\")" ] }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([5., 4., 4., 6., 4., 6., 4., 6., 6., 5., 4., 7., 5., 4., 4., 4., 5.,\n", + " 5., 4., 5., 4., 4., 6., 5., 4., 7., 4., 5., 4., 4., 4., 6., 7., 8.,\n", + " 4., 9., 5., 4., 4., 6., 4., 5., 4., 4., 5., 5., 4., 5., 7., 6., 6.,\n", + " 9., 4., 4., 6., 7., 4., 5., 5., 7., 4., 5., 6., 5., 5., 4., 4., 8.,\n", + " 5., 6., 4., 4., 4., 5., 6., 4., 4., 4., 4., 5., 6., 6., 6., 4., 4.])" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sequence_similarity(\"EGFR\", \"EGFR\", type_=\"substitution\")" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -358,7 +532,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 14, "metadata": {}, "outputs": [ { @@ -384,7 +558,7 @@ " 0.10588235, 0.14117647, 0.15294118, 1. ]])" ] }, - "execution_count": 10, + "execution_count": 14, "metadata": {}, "output_type": "execute_result" } @@ -398,9 +572,70 @@ "table" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Substitution" + ] + }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array(['K', 'V', 'L', 'G', 'S', 'G', 'A', 'F', 'G', 'T', 'V', 'Y', 'K',\n", + " 'V', 'A', 'I', 'K', 'E', 'L', 'E', 'I', 'L', 'D', 'E', 'A', 'Y',\n", + " 'V', 'M', 'A', 'S', 'V', 'D', 'P', 'H', 'V', 'C', 'R', 'L', 'L',\n", + " 'G', 'I', 'Q', 'L', 'I', 'T', 'Q', 'L', 'M', 'P', 'F', 'G', 'C',\n", + " 'L', 'L', 'D', 'Y', 'V', 'R', 'E', 'Y', 'L', 'E', 'D', 'R', 'R',\n", + " 'L', 'V', 'H', 'R', 'D', 'L', 'A', 'A', 'R', 'N', 'V', 'L', 'V',\n", + " 'I', 'T', 'D', 'F', 'G', 'L', 'A'], dtype=' Date: Tue, 24 Aug 2021 14:59:23 +0200 Subject: [PATCH 5/7] finalize similarity matrix. --- .../talktorial.ipynb | 891 +++++++++++------- 1 file changed, 525 insertions(+), 366 deletions(-) diff --git a/teachopencadd/talktorials/T024_kinase_similarity_sequence/talktorial.ipynb b/teachopencadd/talktorials/T024_kinase_similarity_sequence/talktorial.ipynb index fc3eab00..30ab25e0 100644 --- a/teachopencadd/talktorials/T024_kinase_similarity_sequence/talktorial.ipynb +++ b/teachopencadd/talktorials/T024_kinase_similarity_sequence/talktorial.ipynb @@ -139,6 +139,7 @@ "\n", "import pandas as pd\n", "import numpy as np\n", + "import seaborn as sns\n", "import requests\n", "import biotite.sequence.align as align" ] @@ -272,7 +273,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Given two kinases, we create a function which computes the sequence similarity between them using one of the two measures discussed in the theory, namely the identity or the substitution." + "Given two kinases, we create functions which account for identity or subsituition similarity, as described in the theory." ] }, { @@ -280,7 +281,7 @@ "metadata": {}, "source": [ "#### Identity score\n", - "We first define a function which compares element wise elements as described in the theory." + "We first define a function which compares element-wise characters in two sequences." ] }, { @@ -291,8 +292,21 @@ "source": [ "def identity_score(sequence1, sequence2):\n", " \"\"\"\n", - " sequence1 :\n", + " Computes the element-wise binary similarity between two sequences.\n", + "\n", + " Parameters\n", + " ----------\n", + " sequence1 : np.array\n", + " An array of character describing the first sequence.\n", " sequence2 :\n", + " An array of character describing the second sequence.\n", + "\n", + " Returns\n", + " -------\n", + " np.array :\n", + " The bool array for each character.\n", + " 1 if the elements are identical,\n", + " 0 otherwise.\n", " \"\"\"\n", " # True is the character is the same, False otherwise\n", " return np.compare_chararrays(sequence1,\n", @@ -313,31 +327,26 @@ "cell_type": "code", "execution_count": 8, "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "3:80: E501 line too long (90 > 79 characters)\n", - "6:1: W293 blank line contains whitespace\n", - "11:25: W291 trailing whitespace\n" - ] - } - ], + "outputs": [], "source": [ "def substitution_score(sequence1,\n", " sequence2,\n", - " substitution_matrix=align.SubstitutionMatrix.std_protein_matrix()):\n", + " substitution_matrix=align.\n", + " SubstitutionMatrix.std_protein_matrix()):\n", " \"\"\"\n", - " ADD\n", - " \n", + " Retrieve #TODO\n", + "\n", " Parameters\n", " ----------\n", - " sequence1 :\n", + " sequence1 : np.array\n", + " An array of character describing the first sequence.\n", " sequence2 :\n", - " substitution_matrix: \n", - " Default align.SubstitutionMatrix.std_protein_matrix() from biotite\n", - " Obtain BLOSUM62\n", + " An array of character describing the second sequence.\n", + " substitution_matrix:\n", + " A substituition matrix specific to amino acids.\n", + " The default is align.SubstitutionMatrix.std_protein_matrix()\n", + " from biotite, which represents BLOSUM62.\n", + "\n", " Returns\n", " -------\n", " \"\"\"\n", @@ -358,6 +367,8 @@ " ind1 = dict_letters[character_seq1]\n", " ind2 = dict_letters[character_seq2]\n", " match_score[i] = score_matrix[ind1, ind2]\n", + " # TODO normalize?\n", + " # TODO check for * VS -\n", " return match_score" ] }, @@ -372,7 +383,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Given two kinases, we create a function which computes the sequence similarity between them using one of the two measures discussed in the theory, namely the identity or the substitution." + "Given two kinases, we create a function which computes the sequence similarity between them using one of the two measures, the identity or the substitution." ] }, { @@ -388,7 +399,9 @@ " kinase_name1, kinase_name2 : str\n", " The two names of the kinases for comparison.\n", "\n", - " type_ : str ? default = identity\n", + " type_ : str\n", + " The type of metric to compute the similarity.\n", + " The default is `identity`.\n", "\n", " Returns\n", " -------\n", @@ -411,7 +424,8 @@ " elif type_ == \"substitution\":\n", " match_score = substitution_score(seq_array1,\n", " seq_array2)\n", - " return match_score\n", + " similarity = sum(match_score)\n", + " return similarity\n", " else:\n", " print(\"Type not defined.\")\n", " return None" @@ -430,25 +444,19 @@ "metadata": {}, "outputs": [ { - "data": { - "text/plain": [ - "array([ 1., 4., 2., 6., -1., 6., -2., 6., 6., -1., 4., 7., -1.,\n", - " -1., 4., 3., 5., 0., 4., 2., 0., 4., -1., 5., 0., -1.,\n", - " 3., 5., -1., 0., -1., 0., 7., 1., 4., -1., -1., 4., 4.,\n", - " 6., 4., -2., 1., 3., -1., -1., -1., 5., -1., -1., 6., -3.,\n", - " 4., -2., 1., 3., 3., 5., 0., 7., 4., -1., 0., 2., 2.,\n", - " 0., 4., 8., 5., 6., 4., 4., 4., 5., 6., -1., 2., 1.,\n", - " 3., 0., 6., 6., 6., 4., 4.])" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" + "name": "stdout", + "output_type": "stream", + "text": [ + "Pocket sequence similarity between EGFR and MET kinases: 205.0 using substitution.\n" + ] } ], "source": [ - "a = sequence_similarity(\"EGFR\", \"MET\", \"substitution\")\n", - "a" + "EGFR_MET_seq_similarity = sequence_similarity(\"EGFR\",\n", + " \"MET\",\n", + " \"substitution\")\n", + "print(f\"Pocket sequence similarity between EGFR and MET kinases: \"\n", + " f\"{EGFR_MET_seq_similarity} using substitution.\")" ] }, { @@ -457,19 +465,19 @@ "metadata": {}, "outputs": [ { - "data": { - "text/plain": [ - "0.4588235294117647" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" + "name": "stdout", + "output_type": "stream", + "text": [ + "Pocket sequence similarity between EGFR and MET kinases: 0.46 using identity.\n" + ] } ], "source": [ - "a = sequence_similarity(\"EGFR\", \"MET\", \"identity\")\n", - "a" + "EGFR_MET_seq_similarity = sequence_similarity(\"EGFR\",\n", + " \"MET\",\n", + " \"identity\")\n", + "print(f\"Pocket sequence similarity between EGFR and MET kinases: \"\n", + " f\"{EGFR_MET_seq_similarity:.2f} using identity.\")" ] }, { @@ -485,18 +493,17 @@ "metadata": {}, "outputs": [ { - "data": { - "text/plain": [ - "1.0" - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" + "name": "stdout", + "output_type": "stream", + "text": [ + "Pocket sequence similarity between EGFR itself: 1.00 using identity.\n" + ] } ], "source": [ - "sequence_similarity(\"EGFR\", \"EGFR\")" + "EGFR_seq_similarity = sequence_similarity(\"EGFR\", \"EGFR\")\n", + "print(f\"Pocket sequence similarity between EGFR itself: \"\n", + " f\"{EGFR_seq_similarity:.2f} using identity.\")" ] }, { @@ -505,22 +512,18 @@ "metadata": {}, "outputs": [ { - "data": { - "text/plain": [ - "array([5., 4., 4., 6., 4., 6., 4., 6., 6., 5., 4., 7., 5., 4., 4., 4., 5.,\n", - " 5., 4., 5., 4., 4., 6., 5., 4., 7., 4., 5., 4., 4., 4., 6., 7., 8.,\n", - " 4., 9., 5., 4., 4., 6., 4., 5., 4., 4., 5., 5., 4., 5., 7., 6., 6.,\n", - " 9., 4., 4., 6., 7., 4., 5., 5., 7., 4., 5., 6., 5., 5., 4., 4., 8.,\n", - " 5., 6., 4., 4., 4., 5., 6., 4., 4., 4., 4., 5., 6., 6., 6., 4., 4.])" - ] - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" + "name": "stdout", + "output_type": "stream", + "text": [ + "Pocket sequence similarity between EGFR itself: 429.00 using substitution.\n" + ] } ], "source": [ - "sequence_similarity(\"EGFR\", \"EGFR\", type_=\"substitution\")" + "EGFR_seq_similarity = sequence_similarity(\"EGFR\", \"EGFR\", type_=\"substitution\")\n", + "print(f\"Pocket sequence similarity between EGFR itself: \"\n", + " f\"{EGFR_seq_similarity:.2f} using substitution.\")\n", + "# TODO: normalize" ] }, { @@ -534,49 +537,13 @@ "cell_type": "code", "execution_count": 14, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "array([[1. , 0.89411765, 0.37647059, 0.31764706, 0.44705882,\n", - " 0.45882353, 0.38823529, 0.47058824, 0.11764706],\n", - " [0.89411765, 1. , 0.4 , 0.32941176, 0.42352941,\n", - " 0.47058824, 0.4 , 0.43529412, 0.11764706],\n", - " [0.37647059, 0.4 , 1. , 0.32941176, 0.38823529,\n", - " 0.37647059, 0.37647059, 0.4 , 0.15294118],\n", - " [0.31764706, 0.32941176, 0.32941176, 1. , 0.37647059,\n", - " 0.36470588, 0.47058824, 0.34117647, 0.10588235],\n", - " [0.44705882, 0.42352941, 0.38823529, 0.37647059, 1. ,\n", - " 0.4 , 0.38823529, 0.43529412, 0.14117647],\n", - " [0.45882353, 0.47058824, 0.37647059, 0.36470588, 0.4 ,\n", - " 1. , 0.36470588, 0.47058824, 0.10588235],\n", - " [0.38823529, 0.4 , 0.37647059, 0.47058824, 0.38823529,\n", - " 0.36470588, 1. , 0.38823529, 0.14117647],\n", - " [0.47058824, 0.43529412, 0.4 , 0.34117647, 0.43529412,\n", - " 0.47058824, 0.38823529, 1. , 0.15294118],\n", - " [0.11764706, 0.11764706, 0.15294118, 0.10588235, 0.14117647,\n", - " 0.10588235, 0.14117647, 0.15294118, 1. ]])" - ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ - "table = np.zeros((len(query_kinases), len(query_kinases)))\n", + "kinase_similarity_matrix = np.zeros((len(query_kinases), len(query_kinases)))\n", "for i, kinase_name1 in enumerate(query_kinases):\n", " for j, kinase_name2 in enumerate(query_kinases):\n", - " table[i, j] = sequence_similarity(kinase_name1,\n", - " kinase_name2)\n", - "table" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Substitution" + " kinase_similarity_matrix[i, j] = sequence_similarity(kinase_name1,\n", + " kinase_name2)" ] }, { @@ -586,14 +553,171 @@ "outputs": [ { "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
EGFRErbB2BRAFCDK2LCKMETp38aKDRp110a
EGFR1.0000000.8941180.3764710.3176470.4470590.4588240.3882350.4705880.117647
ErbB20.8941181.0000000.4000000.3294120.4235290.4705880.4000000.4352940.117647
BRAF0.3764710.4000001.0000000.3294120.3882350.3764710.3764710.4000000.152941
CDK20.3176470.3294120.3294121.0000000.3764710.3647060.4705880.3411760.105882
LCK0.4470590.4235290.3882350.3764711.0000000.4000000.3882350.4352940.141176
MET0.4588240.4705880.3764710.3647060.4000001.0000000.3647060.4705880.105882
p38a0.3882350.4000000.3764710.4705880.3882350.3647061.0000000.3882350.141176
KDR0.4705880.4352940.4000000.3411760.4352940.4705880.3882351.0000000.152941
p110a0.1176470.1176470.1529410.1058820.1411760.1058820.1411760.1529411.000000
\n", + "
" + ], "text/plain": [ - "array(['K', 'V', 'L', 'G', 'S', 'G', 'A', 'F', 'G', 'T', 'V', 'Y', 'K',\n", - " 'V', 'A', 'I', 'K', 'E', 'L', 'E', 'I', 'L', 'D', 'E', 'A', 'Y',\n", - " 'V', 'M', 'A', 'S', 'V', 'D', 'P', 'H', 'V', 'C', 'R', 'L', 'L',\n", - " 'G', 'I', 'Q', 'L', 'I', 'T', 'Q', 'L', 'M', 'P', 'F', 'G', 'C',\n", - " 'L', 'L', 'D', 'Y', 'V', 'R', 'E', 'Y', 'L', 'E', 'D', 'R', 'R',\n", - " 'L', 'V', 'H', 'R', 'D', 'L', 'A', 'A', 'R', 'N', 'V', 'L', 'V',\n", - " 'I', 'T', 'D', 'F', 'G', 'L', 'A'], dtype='\n", + "#T_9913a_row0_col0, #T_9913a_row1_col1, #T_9913a_row2_col2, #T_9913a_row3_col3, #T_9913a_row4_col4, #T_9913a_row5_col5, #T_9913a_row6_col6, #T_9913a_row7_col7, #T_9913a_row8_col8 {\n", + " background-color: #008000;\n", + " color: #f1f1f1;\n", + "}\n", + "#T_9913a_row0_col1, #T_9913a_row1_col0 {\n", + " background-color: #1c8e1c;\n", + " color: #f1f1f1;\n", + "}\n", + "#T_9913a_row0_col2, #T_9913a_row5_col2, #T_9913a_row6_col2, #T_9913a_row7_col3 {\n", + " background-color: #add5ad;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row0_col3 {\n", + " background-color: #b3d8b3;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row0_col4 {\n", + " background-color: #97ca97;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row0_col5 {\n", + " background-color: #8ec58e;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row0_col6, #T_9913a_row2_col4, #T_9913a_row4_col6, #T_9913a_row6_col4, #T_9913a_row7_col6 {\n", + " background-color: #a7d2a7;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row0_col7, #T_9913a_row5_col7 {\n", + " background-color: #92c892;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row0_col8, #T_9913a_row1_col8 {\n", + " background-color: #e8f2e8;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row1_col2, #T_9913a_row2_col7, #T_9913a_row3_col5, #T_9913a_row5_col3, #T_9913a_row6_col5, #T_9913a_row7_col2 {\n", + " background-color: #a6d2a6;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row1_col3, #T_9913a_row2_col3 {\n", + " background-color: #b1d7b1;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row1_col4, #T_9913a_row4_col5 {\n", + " background-color: #9dcd9d;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row1_col5, #T_9913a_row6_col3, #T_9913a_row7_col5 {\n", + " background-color: #8bc48b;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row1_col6, #T_9913a_row2_col5, #T_9913a_row4_col3, #T_9913a_row5_col4 {\n", + " background-color: #a4d0a4;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row1_col7, #T_9913a_row4_col7 {\n", + " background-color: #9ccd9c;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row2_col0 {\n", + " background-color: #a6d1a6;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row2_col1, #T_9913a_row6_col1 {\n", + " background-color: #a0cea0;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row2_col6, #T_9913a_row3_col4 {\n", + " background-color: #aad3aa;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row2_col8, #T_9913a_row7_col8 {\n", + " background-color: #dfeddf;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row3_col0 {\n", + " background-color: #b5d9b5;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row3_col1 {\n", + " background-color: #b2d7b2;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row3_col2 {\n", + " background-color: #badbba;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row3_col6, #T_9913a_row5_col0 {\n", + " background-color: #90c790;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row3_col7 {\n", + " background-color: #b7dab7;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row3_col8, #T_9913a_row5_col8, #T_9913a_row8_col0, #T_9913a_row8_col1, #T_9913a_row8_col2, #T_9913a_row8_col3, #T_9913a_row8_col4, #T_9913a_row8_col5, #T_9913a_row8_col6, #T_9913a_row8_col7 {\n", + " background-color: #ebf3eb;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row4_col0 {\n", + " background-color: #93c893;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row4_col1 {\n", + " background-color: #9acb9a;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row4_col2, #T_9913a_row6_col7 {\n", + " background-color: #a9d3a9;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row4_col8, #T_9913a_row6_col8 {\n", + " background-color: #e1eee1;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row5_col1, #T_9913a_row7_col0 {\n", + " background-color: #8dc58d;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row5_col6 {\n", + " background-color: #aed5ae;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row6_col0 {\n", + " background-color: #a3d0a3;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row7_col1 {\n", + " background-color: #96c996;\n", + " color: #000000;\n", + "}\n", + "#T_9913a_row7_col4 {\n", + " background-color: #9bcc9b;\n", + " color: #000000;\n", + "}\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
 EGFRErbB2BRAFCDK2LCKMETp38aKDRp110a
EGFR1.0000.8940.3760.3180.4470.4590.3880.4710.118
ErbB20.8941.0000.4000.3290.4240.4710.4000.4350.118
BRAF0.3760.4001.0000.3290.3880.3760.3760.4000.153
CDK20.3180.3290.3291.0000.3760.3650.4710.3410.106
LCK0.4470.4240.3880.3761.0000.4000.3880.4350.141
MET0.4590.4710.3760.3650.4001.0000.3650.4710.106
p38a0.3880.4000.3760.4710.3880.3651.0000.3880.141
KDR0.4710.4350.4000.3410.4350.4710.3881.0000.153
p110a0.1180.1180.1530.1060.1410.1060.1410.1531.000
\n" + ], "text/plain": [ - "array(['E', 'V', 'I', 'G', 'R', 'G', 'H', 'F', 'G', 'C', 'V', 'Y', 'H',\n", - " 'C', 'A', 'V', 'K', 'S', 'L', 'Q', 'F', 'L', 'T', 'E', 'G', 'I',\n", - " 'I', 'M', 'K', 'D', 'F', 'S', 'P', 'N', 'V', 'L', 'S', 'L', 'L',\n", - " 'G', 'I', 'L', 'V', 'V', 'L', 'P', 'Y', 'M', 'K', 'H', 'G', 'D',\n", - " 'L', 'R', 'N', 'F', 'I', 'R', 'N', 'Y', 'L', 'A', 'S', 'K', 'K',\n", - " 'F', 'V', 'H', 'R', 'D', 'L', 'A', 'A', 'R', 'N', 'C', 'M', 'L',\n", - " 'V', 'A', 'D', 'F', 'G', 'L', 'A'], dtype='" ] }, "execution_count": 16, @@ -629,258 +1015,38 @@ } ], "source": [ - "b = np.array(list(kinase_sequences[\"MET\"]))\n", - "b" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " A C D E F G H I K L M N P Q R S T V W Y B Z X *\n", - "A 4 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -2 -2 -1 0 -4\n", - "C 0 9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -2 -3 -3 -2 -4\n", - "D -2 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -3 4 1 -1 -4\n", - "E -1 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -2 1 4 -1 -4\n", - "F -2 -2 -3 -3 6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 3 -3 -3 -1 -4\n", - "G 0 -3 -1 -2 -3 6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -3 -1 -2 -1 -4\n", - "H -2 -3 -1 0 -1 -2 8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 2 0 0 -1 -4\n", - "I -1 -1 -3 -3 0 -4 -3 4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 -3 -3 -1 -4\n", - "K -1 -3 -1 1 -3 -2 -1 -3 5 -2 -1 0 -1 1 2 0 -1 -2 -3 -2 0 1 -1 -4\n", - "L -1 -1 -4 -3 0 -4 -3 2 -2 4 2 -3 -3 -2 -2 -2 -1 1 -2 -1 -4 -3 -1 -4\n", - "M -1 -1 -3 -2 0 -3 -2 1 -1 2 5 -2 -2 0 -1 -1 -1 1 -1 -1 -3 -1 -1 -4\n", - "N -2 -3 1 0 -3 0 1 -3 0 -3 -2 6 -2 0 0 1 0 -3 -4 -2 3 0 -1 -4\n", - "P -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 -1 -2 -1 -1 -2 -4 -3 -2 -1 -2 -4\n", - "Q -1 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 1 0 -1 -2 -2 -1 0 3 -1 -4\n", - "R -1 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 -1 -1 -3 -3 -2 -1 0 -1 -4\n", - "S 1 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 1 -2 -3 -2 0 0 0 -4\n", - "T 0 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 0 -2 -2 -1 -1 0 -4\n", - "V 0 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 -3 -1 -3 -2 -1 -4\n", - "W -3 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 2 -4 -3 -2 -4\n", - "Y -2 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 7 -3 -2 -1 -4\n", - "B -2 -3 4 1 -3 -1 0 -3 0 -4 -3 3 -2 0 -1 0 -1 -3 -4 -3 4 1 -1 -4\n", - "Z -1 -3 1 4 -3 -2 0 -3 1 -3 -1 0 -1 3 0 0 -1 -2 -3 -2 1 4 -1 -4\n", - "X 0 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 -1 -1 0 0 -1 -2 -1 -1 -1 -1 -4\n", - "* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1\n" - ] - } - ], - "source": [ - "# Obtain BLOSUM62\n", - "substitution_matrix = align.SubstitutionMatrix.std_protein_matrix()\n", - "print(substitution_matrix)" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(numpy.ndarray, (24, 24))" - ] - }, - "execution_count": 18, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "score_matrix = substitution_matrix.score_matrix()\n", - "type(score_matrix), score_matrix.shape" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": {}, - "outputs": [], - "source": [ - "letter_alphabet = substitution_matrix.get_alphabet1()" - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'A': 0,\n", - " 'C': 1,\n", - " 'D': 2,\n", - " 'E': 3,\n", - " 'F': 4,\n", - " 'G': 5,\n", - " 'H': 6,\n", - " 'I': 7,\n", - " 'K': 8,\n", - " 'L': 9,\n", - " 'M': 10,\n", - " 'N': 11,\n", - " 'P': 12,\n", - " 'Q': 13,\n", - " 'R': 14,\n", - " 'S': 15,\n", - " 'T': 16,\n", - " 'V': 17,\n", - " 'W': 18,\n", - " 'Y': 19,\n", - " 'B': 20,\n", - " 'Z': 21,\n", - " 'X': 22,\n", - " '*': 23}" - ] - }, - "execution_count": 20, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "dict_letters = {}\n", - "for i, letter in enumerate(letter_alphabet.get_symbols()):\n", - " dict_letters[letter] = i\n", - "dict_letters" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": {}, - "outputs": [], - "source": [ - "match_score = np.zeros(len(a))\n", - "for i, (character_seq1, character_seq2) in enumerate(zip(a, b)):\n", - " # print(character_seq1, character_seq2)#, type(character_seq1))\n", - " ind1 = dict_letters[character_seq1]\n", - " ind2 = dict_letters[character_seq2]\n", - " match_score[i] = score_matrix[ind1, ind2]" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "array([ 1., 4., 2., 6., -1., 6., -2., 6., 6., -1., 4., 7., -1.,\n", - " -1., 4., 3., 5., 0., 4., 2., 0., 4., -1., 5., 0., -1.,\n", - " 3., 5., -1., 0., -1., 0., 7., 1., 4., -1., -1., 4., 4.,\n", - " 6., 4., -2., 1., 3., -1., -1., -1., 5., -1., -1., 6., -3.,\n", - " 4., -2., 1., 3., 3., 5., 0., 7., 4., -1., 0., 2., 2.,\n", - " 0., 4., 8., 5., 6., 4., 4., 4., 5., 6., -1., 2., 1.,\n", - " 3., 0., 6., 6., 6., 4., 4.])" - ] - }, - "execution_count": 22, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "match_score" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(24, 24)" - ] - }, - "execution_count": 23, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "substitution_matrix.shape()" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "True" - ] - }, - "execution_count": 24, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "substitution_matrix.is_symmetric()" + "# Show matrix with background gradient\n", + "cm = sns.light_palette(\"green\", as_cmap=True)\n", + "kinase_similarity_matrix.style.\\\n", + " background_gradient(cmap=cm).\\\n", + " format(\"{:.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Show kinase coverage" + "### Save kinase distance matrix" ] }, { "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, + "execution_count": 17, "metadata": {}, "outputs": [], - "source": [] - }, - { - "cell_type": "markdown", - "metadata": {}, "source": [ - "### Visualize similarity as phylogenetic tree" + "kinase_similarity_matrix.to_csv(DATA / \"kinase_similarity_matrix_sequence.csv\")" ] }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Discussion\n", "\n", - "Wrap up the talktorial's content here and discuss pros/cons and open questions/challenges." + "Wrap up the talktorial's content here and discuss pros/cons and open questions/challenges.\n", + "\n", + "The kinase similarity matrix above will be reloaded in **Talktorial T028**, where we compare kinase similarities from different perspectives, including the pocket sequence perspective we have talked about in this talktorial." ] }, { @@ -895,13 +1061,6 @@ "2. Question\n", "3. Question" ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": { From 601309f328aa995263716ac508ae1256d72bfc40 Mon Sep 17 00:00:00 2001 From: t-kimber Date: Tue, 24 Aug 2021 14:59:39 +0200 Subject: [PATCH 6/7] add csv for similarity matrix. --- .../data/kinase_similarity_matrix_sequence.csv | 10 ++++++++++ 1 file changed, 10 insertions(+) create mode 100644 teachopencadd/talktorials/T024_kinase_similarity_sequence/data/kinase_similarity_matrix_sequence.csv diff --git a/teachopencadd/talktorials/T024_kinase_similarity_sequence/data/kinase_similarity_matrix_sequence.csv b/teachopencadd/talktorials/T024_kinase_similarity_sequence/data/kinase_similarity_matrix_sequence.csv new file mode 100644 index 00000000..ce6a35bf --- /dev/null +++ b/teachopencadd/talktorials/T024_kinase_similarity_sequence/data/kinase_similarity_matrix_sequence.csv @@ -0,0 +1,10 @@ +,EGFR,ErbB2,BRAF,CDK2,LCK,MET,p38a,KDR,p110a +EGFR,1.0,0.8941176470588236,0.3764705882352941,0.3176470588235294,0.4470588235294118,0.4588235294117647,0.38823529411764707,0.47058823529411764,0.11764705882352941 +ErbB2,0.8941176470588236,1.0,0.4,0.32941176470588235,0.4235294117647059,0.47058823529411764,0.4,0.43529411764705883,0.11764705882352941 +BRAF,0.3764705882352941,0.4,1.0,0.32941176470588235,0.38823529411764707,0.3764705882352941,0.3764705882352941,0.4,0.15294117647058825 +CDK2,0.3176470588235294,0.32941176470588235,0.32941176470588235,1.0,0.3764705882352941,0.36470588235294116,0.47058823529411764,0.3411764705882353,0.10588235294117647 +LCK,0.4470588235294118,0.4235294117647059,0.38823529411764707,0.3764705882352941,1.0,0.4,0.38823529411764707,0.43529411764705883,0.1411764705882353 +MET,0.4588235294117647,0.47058823529411764,0.3764705882352941,0.36470588235294116,0.4,1.0,0.36470588235294116,0.47058823529411764,0.10588235294117647 +p38a,0.38823529411764707,0.4,0.3764705882352941,0.47058823529411764,0.38823529411764707,0.36470588235294116,1.0,0.38823529411764707,0.1411764705882353 +KDR,0.47058823529411764,0.43529411764705883,0.4,0.3411764705882353,0.43529411764705883,0.47058823529411764,0.38823529411764707,1.0,0.15294117647058825 +p110a,0.11764705882352941,0.11764705882352941,0.15294117647058825,0.10588235294117647,0.1411764705882353,0.10588235294117647,0.1411764705882353,0.15294117647058825,1.0 From 53619b3fd45147c9f6c50767b9366d463686439d Mon Sep 17 00:00:00 2001 From: t-kimber Date: Tue, 24 Aug 2021 18:18:58 +0200 Subject: [PATCH 7/7] Add discussion and quiz --- .../talktorial.ipynb | 345 ++++++++++-------- 1 file changed, 189 insertions(+), 156 deletions(-) diff --git a/teachopencadd/talktorials/T024_kinase_similarity_sequence/talktorial.ipynb b/teachopencadd/talktorials/T024_kinase_similarity_sequence/talktorial.ipynb index 30ab25e0..fdf4ed02 100644 --- a/teachopencadd/talktorials/T024_kinase_similarity_sequence/talktorial.ipynb +++ b/teachopencadd/talktorials/T024_kinase_similarity_sequence/talktorial.ipynb @@ -19,7 +19,7 @@ "source": [ "## Aim of this talktorial\n", "\n", - "Add a short summary of this talktorial's content." + "In this talktorial, we investigate sequence similarity for kinases of interest. KLIFS' API is used to retrieve the 85 residues pocket sequence for each kinase. Two similarity measures are implemented: 1. the identity similarity which is based on character-wise discrepancy and 2. the substitution similarity which is amino acid specific." ] }, { @@ -29,7 +29,9 @@ "### Contents in *Theory*\n", "\n", "* Kinase dataset\n", - "* Kinase similarity descriptor: XXX" + "* Kinase similarity descriptor: sequence\n", + " * Identity score\n", + " * Substitution score" ] }, { @@ -38,11 +40,13 @@ "source": [ "### Contents in *Practical*\n", "\n", - "* Retrieve and preprocess data\n", - "* Show kinase coverage\n", - "* Compare kinases\n", + "* Retrieve sequences from KLIFS\n", + "* Sequence similarity\n", + " * Identity score\n", + " * Substitution score\n", + "* Kinase comparison\n", "* Visualize similarity as kinase matrix\n", - "* Visualize similarity as phylogenetic tree" + "* Save kinase distance matrix" ] }, { @@ -52,7 +56,10 @@ "### References\n", "\n", "* Kinase dataset: [Molecules (2021), 26(3), 629](https://www.mdpi.com/1420-3049/26/3/629) \n", - "* Kinase similarity descriptor: XXX" + "* KLIFS\n", + " * KLIFS URL: https://klifs.net/\n", + " * KLIFS database: [Nucleic Acid Res. (2020), 49(D1), D562-D569](https://doi.org/10.1093/nar/gkaa895)\n", + "* Substitution matrix: [PNAS (1992), 89(22), 10915-10919](https://doi.org/10.1073/pnas.89.22.10915)" ] }, { @@ -82,8 +89,8 @@ "\n", " \n", "\n", - "*Table 1:* \n", - "Kinases used in this notebook, taken from [Molecules (2021), 26(3), 629](https://www.mdpi.com/1420-3049/26/3/629), with their synonyms, UniProt IDs, and kinase groups.\n", + "*Table 1:*\n", + "Kinases used in this notebook, taken from [Molecules (2021), 26(3), 629](https://www.mdpi.com/1420-3049/26/3/629), with their synonyms, UniProt IDs, kinase groups, and full unabbreviated names.\n", "\n", " \n", "\n", @@ -106,16 +113,33 @@ "source": [ "### Kinase similarity descriptor: sequence\n", "\n", - "Describe the dataset describing kinase similarity and how we use it.\n", + "In this talktorial, the KLIFS' pocket sequence is used for two main reasons:\n", + "1. The sequence is of fixed length (it contains 85 residues), which makes computation for pairwise similarity between two sequences easy.\n", + "2. The binding pocket is where the action takes place. Why consider the full kinase sequence when an 85 residues sequence contains most relevant information?\n", "\n", - "- XXX = KLIFS pocket sequence" + "We now describe two ways to compare pocket sequences." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Practical" + "#### Identity score\n", + "A simple way of assessing the similarity between two sequences is to use the so-called identity score.\n", + "First, a match vector is created: it checks whether for each position the characters from the two sequences are identical. If there are, the entry is set to $1$, and $0$ otherwise.\n", + "\n", + "The identity score is computed by sum the elements in the match vector and normalizing the entry by the length of the sequence, which, in the case of KLIFS pocket sequence is $85$." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Substitution score\n", + "Although the identity score is an easy measure of similarity, it does not take into account the rate at which an amino acid may change into another and treats all residues uniformly.\n", + "\n", + "The substitution score takes the changes of the amino acids over evolutionary time into account. It makes use of a substitution matrix, where each entry gives a score between two amino acids.\n", + "In this talktorial, we use the BLOSUM substitution matrix [PNAS (1992), 89(22), 10915-10919](https://doi.org/10.1073/pnas.89.22.10915), implemented in biotite." ] }, { @@ -124,9 +148,14 @@ "metadata": {}, "outputs": [], "source": [ - "# !pip install flake8 pycodestyle_magic\n", - "%load_ext pycodestyle_magic\n", - "%pycodestyle_on" + "# TODO: add aggregation and normalization" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Practical" ] }, { @@ -189,7 +218,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We use KLIFS' API to retrieve the 85-long pocket sequence for each kinase." + "We use KLIFS' API to retrieve the $85$-long pocket sequence for each kinase." ] }, { @@ -228,7 +257,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Let's see how these pocket sequence look like." + "Let's look at these pocket sequences." ] }, { @@ -273,7 +302,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Given two kinases, we create functions which account for identity or subsituition similarity, as described in the theory." + "Given two kinases, we create functions which account for identity or substitution similarity, as described in the theory." ] }, { @@ -320,7 +349,7 @@ "metadata": {}, "source": [ "#### Substitution score\n", - "We now define the function which is more specific to amino acids grouping and use the `biotite` library and retrieve the substitution matrix." + "We now define the function which is more specific to amino acids grouping and use the `biotite` library for retrieving the BLOSUM substitution matrix." ] }, { @@ -334,7 +363,7 @@ " substitution_matrix=align.\n", " SubstitutionMatrix.std_protein_matrix()):\n", " \"\"\"\n", - " Retrieve #TODO\n", + " Retrieve the match score given the substitution matrix\n", "\n", " Parameters\n", " ----------\n", @@ -349,6 +378,8 @@ "\n", " Returns\n", " -------\n", + " np.array :\n", + " The vector of match score given the substitution matrix.\n", " \"\"\"\n", " # Retrieve np.array from substitution matrix\n", " score_matrix = substitution_matrix.score_matrix()\n", @@ -367,8 +398,6 @@ " ind1 = dict_letters[character_seq1]\n", " ind2 = dict_letters[character_seq2]\n", " match_score[i] = score_matrix[ind1, ind2]\n", - " # TODO normalize?\n", - " # TODO check for * VS -\n", " return match_score" ] }, @@ -376,7 +405,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "#### Kinase comparison" + "### Kinase comparison" ] }, { @@ -411,6 +440,12 @@ " sequence_1 = klifs_pocket_sequence(kinase_name1)\n", " sequence_2 = klifs_pocket_sequence(kinase_name2)\n", "\n", + " # Replace possible unavailable residue\n", + " # noted in KLIFS with \"-\"\n", + " # by the symbol \"*\" for biotite\n", + " sequence_1 = sequence_1.replace(\"-\", \"*\")\n", + " sequence_2 = sequence_2.replace(\"-\", \"*\")\n", + "\n", " if len(sequence_1) != len(sequence_1):\n", " print(\"Mismatch in sequence lengths.\")\n", " return None\n", @@ -484,7 +519,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "As expected, the similarity between a kinase and itself leads the highest possible score:" + "As expected, the similarity between a kinase and itself leads to the highest possible score:" ] }, { @@ -741,144 +776,144 @@ "data": { "text/html": [ "\n", - "\n", + "
\n", " \n", " \n", " \n", @@ -895,118 +930,118 @@ " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", " \n", " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", " \n", " \n", "
 
EGFR1.0000.8940.3760.3180.4470.4590.3880.4710.118EGFR1.0000.8940.3760.3180.4470.4590.3880.4710.118
ErbB20.8941.0000.4000.3290.4240.4710.4000.4350.118ErbB20.8941.0000.4000.3290.4240.4710.4000.4350.118
BRAF0.3760.4001.0000.3290.3880.3760.3760.4000.153BRAF0.3760.4001.0000.3290.3880.3760.3760.4000.153
CDK20.3180.3290.3291.0000.3760.3650.4710.3410.106CDK20.3180.3290.3291.0000.3760.3650.4710.3410.106
LCK0.4470.4240.3880.3761.0000.4000.3880.4350.141LCK0.4470.4240.3880.3761.0000.4000.3880.4350.141
MET0.4590.4710.3760.3650.4001.0000.3650.4710.106MET0.4590.4710.3760.3650.4001.0000.3650.4710.106
p38a0.3880.4000.3760.4710.3880.3651.0000.3880.141p38a0.3880.4000.3760.4710.3880.3651.0000.3880.141
KDR0.4710.4350.4000.3410.4350.4710.3881.0000.153KDR0.4710.4350.4000.3410.4350.4710.3881.0000.153
p110a0.1180.1180.1530.1060.1410.1060.1410.1531.000p110a0.1180.1180.1530.1060.1410.1060.1410.1531.000
\n" ], "text/plain": [ - "" + "" ] }, "execution_count": 16, @@ -1044,7 +1079,7 @@ "source": [ "## Discussion\n", "\n", - "Wrap up the talktorial's content here and discuss pros/cons and open questions/challenges.\n", + "In this talktorial, we investigate how sequences can be used to measure similarity between kinases. The focus is made of the pocket sequence, which is retrieve from KLIFS. Sequence similarity can be assessed using two scores: 1. the identity, which treats all amino acids uniformly, and 2. the substitution, which takes into account the rate of change of residues over evolutionary time.\n", "\n", "The kinase similarity matrix above will be reloaded in **Talktorial T028**, where we compare kinase similarities from different perspectives, including the pocket sequence perspective we have talked about in this talktorial." ] @@ -1055,11 +1090,9 @@ "source": [ "## Quiz\n", "\n", - "Ask three questions that the user should be able to answer after doing this talktorial. Choose important take-aways from this talktorial for your questions.\n", - "\n", - "1. Question\n", - "2. Question\n", - "3. Question" + "1. Should the full kinase sequence be used instead of the pocket sequence?\n", + "2. How does the similarity using identity behave with respect to mutations?\n", + "3. How does similarity using identity compare to similarity using substitution?" ] } ],