Using dbGaP2x, R package to explore and sort phenotypics data from dbGap

You can test this software:

Using binder, by clicking the "launch binder" badge above.
Using the dockerized version on your local device by running

docker run -p 80:8888 -v /var/run/docker.sock:/var/run/docker.sock -u root gversmee/dbgap2x

and then open your web browser at http://localhost, and use the password versmee

Using your local R by installing the package with

devtools::install_github("gversmee/dbGaP2x")

Introduction

Load the package

#devtools::install_github("gversmee/dbGaP2x", force = TRUE)
library(dbGaP2x)

Get the list of the function for this new package

lsf.str("package:dbGaP2x")

browse.dbgap : function (phs, jupyter = FALSE)  
browse.study : function (phs, jupyter = FALSE)  
consent.groups : function (phs)  
datatables.dict : function (phs)  
dbgap.data_dict : function (xml, dest)  
dbgap.decrypt : function (files, key = FALSE)  
dbgap.download : function (krt, key = FALSE)  
is.parent : function (phs)  
n.pop : function (phs, consentgroups = TRUE, gender = TRUE)  
n.tables : function (phs)  
n.variables : function (phs)  
parent.study : function (phs)  
phs.version : function (phs)  
search.dbgap : function (term, jupyter = FALSE)  
study.name : function (phs)  
sub.study : function (phs)  
variables.dict : function (phs)

1. Search for dbGap studies

Let's try to explore the "Jackson Heart Study" cohort that exists on dbGap.

The dbGap search engine can be tricky, that's why we created the function "browse.dbgap", who helps you find the studies related to the term that you search on your web browser.

Note that if you run this function in a jupyterhub environment, it will return a url since jupyterhub doesn't have access to your local browser.

search.dbgap("Jackson", jupyter = TRUE)

'https://www.ncbi.nlm.nih.gov/gap/?term=Jackson%5BStudy+Name%5D'

dbGap returns the list of the studies related to your term. As you see, there are 6 studies associated with the "Jackson Heart Study" (JHS). One of these study is the main one aka the "parent study", whereas the other ones are substudies. In this case, phs000286.v5.p1 is the parent study. Firslty, we can use the phs.version() function in order to be sure that this is the latest version of the study. We can abbreviate the phs name by giving just the digit, or we can use the full dbGap id.

phs.version("286")

'phs000286.v5.p1'

The is.parent() function is usefull to test if a study is a parent study or a substudy

is.parent("000286") # JHS main cohort
is.parent("phs499") # substudy "CARe" for JHS

TRUE

FALSE

If you don't know the parent study of a substudy, try parent.study()

parent.study("phs000499")

'phs000286.v5.p1'
'Jackson Heart Study (JHS) Cohort'

On the other side, use sub.study() to get the name and IDs of the substudies from a parent one

sub.study("286")

phs	name
phs000499.v3.p1	NHLBI Jackson Heart Study Candidate Gene Association Resource (CARe)
phs000498.v3.p1	Jackson Heart Study Allelic Spectrum Project
phs000402.v3.p1	NHLBI GO-ESP: Heart Cohorts Exome Sequencing Project (JHS)
phs001098.v1.p1	T2D-GENES Multi-Ethnic Exome Sequencing Study: Jackson Heart Study

If you want to get the name of a study from its dbGap id, use study.name()

study.name("286")

'Jackson Heart Study (JHS) Cohort'

Finally, you can watch your study on dbGap with browse.dbgap().

If a website exists for this study, you can browse it using browse.study()

browse.dbgap("286", jupyter = TRUE)
browse.study("286", jupyter = TRUE)

'https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000286.v5.p1'

'https://www.jacksonheartstudy.org'

2. Explore the characteristics of your study

For each dbGap study, there can be multiple consent groups that will have there specificities. Use consent.groups to know the number and the name of the consent groups in the study that you are exploring. Let's keep focusing on JHS.

JHS <- "phs000286"
consent.groups(JHS)

	shortName	longName
0	NRUP	Subjects did not participate in the study, did not complete a consent document and are included only for the pedigree structure and/or genotype controls, such as HapMap subjects
1	HMB-IRB-NPU	Health/Medical/Biomedical (IRB, NPU)
2	DS-FDO-IRB-NPU	Disease-Specific (Focused Disease Only, IRB, NPU)
3	HMB-IRB	Health/Medical/Biomedical (IRB)
4	DS-FDO-IRB	Disease-Specific (Focused Disease Only, IRB)

Use n.pop() to know the number of patient included in each groups

n.pop(JHS)
n.pop(JHS, consentgroups = FALSE)

consent_group	male	female	total
HMB-IRB	1860	2504	4549
HMB-IRB-NPU	264	505	802
DS-FDO-IRB-NPU	63	107	180
HMB-IRB	784	1232	2131
DS-FDO-IRB	173	289	489
TOTAL	3144	4637	8151

8151

Use n.tables() and n.variables() to get the number of datatables in your study and the total number of variables

(n.variables goes into the study files to count the actual number of variables)

n.tables(JHS)
n.variables(JHS)

66

4326

datatables.dict() will return a data frame with the datatables IDs (phtxxxxxx) and description of your study

tablesdict <- datatables.dict(JHS)
head(tablesdict)

pht	dt_study_name	dt_label
pht002539.v2	ESP_HeartGO_JHS_Subject_Phenotypes	Subject ID, ESP cohort, target capture used in sequencing, sequence center, race, sex, affection status, family medical history of stroke, participant medical history of asthma and COPD, ankle brachial index, artery disease status, atrioventricular block, blood pressure, body weight, height and BMI, coronary artery calcium, EKG, Framingham Risk Score, intimal-medial thickness, laboratory tests including basophils, eosinophils, neutrophils, lymphocytes, lymphocytes, blood fasting insulin and glucose, level of C-reactive protein, LDL, HDL, triglycerides, uric acid, urinary creatinine, serum creatinine, menopause, MI, FEV1, FVC, stroke status, type 2 diabetes, Wolff-Parkinson-White pattern, hormone replacement therapy, and smoking status of subjects participated in the "National Heart Lung and Blood Institute (NHLBI) GO-ESP: Heart Cohorts Component of the Exome Sequencing Project (JHS)" sub study of the "Jackson Heart Study (JHS) Cohort" project.
pht001948.v1	CSTA	Agatston score of all coronary section among participants of the Jackson Heart Study including adult 35-84 years old African Americans.
pht001947.v1	CSIA	Approach to life B. Life style among participants of the Jackson Heart Study including adult 35-84 years old African Americans.
pht001968.v1 8000	PPAA	Post physical activity monitoring among participants of the Jackson Heart Study including adult 35-84 years old African Americans.
pht001955.v1	ECHA	Echocardiographic abnormalities among participants of the Jackson Heart Study including adult 35-84 years old African Americans.
pht001952.v1	DPASS_DIET1	Dietary data (DPASS) among participants of the Jackson Heart Study including adult 35-84 years old African Americans.

variables.dict() will return a data frame with the variables IDs (phvxxxxxx), their name in the study, the datatable where they come from and their description

vardict <- variables.dict(JHS)
head(vardict)

dt_study_name	phv	var_name	var_desc
ESP_HeartGO_JHS_Subject_Phenotypes	phv00165323.v2	SUBJID	Subject ID
ESP_HeartGO_JHS_Subject_Phenotypes	phv00165322.v2	ESP_Cohort	Cohort name [JHS]
ESP_HeartGO_JHS_Subject_Phenotypes	phv00165324.v2	ESP_phenotype	ESP Phenotype group (phenotype that the sample was selected to be sequenced for) [EOMI_Control (Early MI control), LDL_Low, LDL_High, BP_Low (low blood pressure); BP_High (high blood pressure); DPR (Deeply Phenotyped Reference); BMI_High]
ESP_HeartGO_JHS_Subject_Phenotypes	phv00181282.v1	Sequence_center	Indicates where the sample was sequence at [Broad, UW]
ESP_HeartGO_JHS_Subject_Phenotypes	phv00181283.v1	Target	Indicates target capture used in sequencing
ESP_HeartGO_JHS_Subject_Phenotypes	phv00181284.v1	ESP_race_selfreport	Self report race [African American]

3. Extract your study

3.1. Get your dbGaP repository key

In order to download or decrypt your data from dbGap, you will need to request an access and to get a decryption key. Follow those steps to access your dbGaP repository key:

a. Go to https://www.ncbi.nlm.nih.gov/gap and click on "controlled access data"

b. Click on Log in to dbGaP

c. Identify yourself with your era common ID and password

d. Get a PI dbGaP repository key

In order to download the files and to decrypt them, you will need a decryption key. This key can be found on a PI dbGaP account, under Get no password dbGaP repository key

3.2. Decrypt the .ncbi_enc files

On dbGaP, the phenotypic files are encrypted. We created a decryption function that uses a dockerized version on sratoolkit. To use that function, you need to have docker installed on your device (www.docker.com). If you are using the dockerized version of this software (available at hub.docker.com/r/gversmee/dbgap2x), docker is already pre-installed, but you'll need to upload your key on the jupyter working directory. To try the function, we put some pre-encrypted files on the repo

key <- "path/to/your/key.ngc"
files <- "path/to/the/directory/of/your/encrypted/files"
dbgap.decrypt(files, key)

You should see a "decrypted_files" directory in the directory where your encrypted files are located

3.3. Download dbGaP files

a. Click on "file selector"

This gives you access to the dbGaP file selector where you can find all the files available for the selected project.

b. Filter by study accession

Here, we want to get the phenotypic data for the study "Early onset COPD", so after checking Study accession, we select "phs000946".

c. Filter again

Since we are only interested in getting the phenotypic data, let's filter by Content type and select phenotype individual-auxiliary and phenotype individual-traits

d. Select the files

Click on "+" to select all the files

e. Click on "Cart file"

This will downlaod a .krt file in your download folder

f. Download and decrypt the files with a simple command

key <- "path/to/your/key.ngc"
cart <- "path/to/your/cart/file.krt"
dbgap.download(cart, key)

You should see in your working directory a new one name dbGaP-*** that contains your files

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
R		R
Screenshots		Screenshots
encrypted_files		encrypted_files
man		man
sratoolkit_docker		sratoolkit_docker
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
Dockerfile		Dockerfile
LICENSE		LICENSE
Manual_dbGaP2x.pdf		Manual_dbGaP2x.pdf
Manual_dbGaP2x.tex		Manual_dbGaP2x.tex
NAMESPACE		NAMESPACE
README.md		README.md
dbGaP2x.Rproj		dbGaP2x.Rproj
dbGaP2x.ipynb		dbGaP2x.ipynb
start-notebook.sh		start-notebook.sh
start-singleuser.sh		start-singleuser.sh
start.sh		start.sh

License

gversmee/dbgap2x

Folders and files

Latest commit

History

Repository files navigation