8000 Added VDJ concatenation modules and added MuData object by saraterzo · Pull Request #435 · nf-core/scrnaseq · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Added VDJ concatenation modules and added MuData object #435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 30 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
5080be6
Added MuData object
SaraTerzol Feb 27, 2025
7852c46
Added config parameter to allow module structure
SaraTerzol Feb 28, 2025
4741d10
Merge branch 'dev' into MuData_implementation
matbonfanti Feb 28, 2025
9020e9f
Change script's name
SaraTerzol Mar 3, 2025
2a0331b
Modify module to handle empty files
SaraTerzol Mar 5, 2025
a19b49e
Merge remote-tracking branch 'refs/remotes/origin/MuData_implementati…
SaraTerzol Mar 5, 2025
56cd8cc
Modify the channel to handle the absence of the VDJ file
SaraTerzol Mar 11, 2025
7e2e38a
Merge branch 'dev' into MuData_implementation
saraterzo Mar 11, 2025
8e903af
Added version to modules
SaraTerzol Mar 11, 2025
2ceb3e9
Added version to modules
SaraTerzol Mar 11, 2025
654e65a
Update output documentation
SaraTerzol Mar 11, 2025
6a03158
Added MuData implementation description to output documentation
SaraTerzol Mar 11, 2025
755a69d
Added option for handling VDJ missing
SaraTerzol Mar 12, 2025
25519c4
Modify test to handle changes in output data
SaraTerzol Mar 13, 2025
71695ac
Added the option to create a MuData object only when Cell Ranger Mult…
SaraTerzol Mar 13, 2025
096e080
Change permission
SaraTerzol Mar 13, 2025
5c2ea4b
Trailing whitespace
SaraTerzol Mar 20, 2025
f53418d
Merge branch 'dev' into MuData_implementation
saraterzo Mar 26, 2025
2302b58
Merge branch 'dev' into MuData_implementation
saraterzo Mar 27, 2025
c783c0f
Adjust indentation
SaraTerzol Mar 27, 2025
a303817
Adjust indentation
SaraTerzol Mar 27, 2025
3b7d9a8
Remove test
SaraTerzol Mar 27, 2025
e311262
Trailing whitespace
SaraTerzol Mar 27, 2025
c0ad49c
Changed output channel
SaraTerzol Mar 28, 2025
1d773cc
Merge branch 'dev' into MuData_implementation
saraterzo Mar 31, 2025
3ccfae8
Remove checks on number of results and tasks executed
SaraTerzol Mar 31, 2025
f334e76
Added check on results and task
SaraTerzol Mar 31, 2025
5902d3d
Added nf-test.config
SaraTerzol Apr 1, 2025
af4576b
Remove check on tasks and results
SaraTerzol Apr 1, 2025
cb74297
Merge branch 'dev' into MuData_implementation
grst Apr 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
- [Cellranger ARC](#cellranger-arc)
- [Cellranger multi](#cellranger-multi)
- [Cellbender remove background filter](#cellbender-remove-background-filter)
- [VDJ Concatenation](#vdj_concatenation)
- [Multimodal Data implementation](#Multimodal_data_implementation)
- [Other output data](#other-output-data)
- [MultiQC](#multiqc)

Expand Down Expand Up @@ -133,6 +135,16 @@ The pipeline also possess a subworkflow imported from scdownstream to perform fi

- Contains the cellbender filtered matrices results generated by the remove background functionality.

## VDJ concatenation
The 'filtered_contig_annotation' tables are concatenated to generate a unified high-level annotation for each high-confidence cellular contig. It takes a filtered_contig_annotation.csv file for each sample and generates an AnnData object saved as .h5ad.

**Output directory: `results/concatenate`**

## Multimodal Data implementation
This step enables handling multimodal data. It takes .h5ad AnnData objects from both the Gene Expression (GEX), Cellular Indexing of Transcriptomes and Epitopes by sequencing (CITE) and V(D)J modalities and constructs MuData objects, which is then saved as .h5mu file.

**Output directory: `results/convert`**

## Other output data

**Output directory: `results/reference_genome`**
Expand Down
41 changes: 41 additions & 0 deletions modules/local/concatenate_vdj/main.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
process CONCATENATE_VDJ {
tag "$meta.id"
label 'process_single'

container = 'quay.io/biocontainers/scirpy:0.20.1--pyhdfd78af_0'

input:
tuple val(meta), path(input_vdj, stageAs: '?/*')

output:
tuple val(meta), path("*.vdj.h5ad") , emit: h5ad, optional: true
path "versions.yml", emit: versions

when:
task.ext.when == null || task.ext.when

script:
"""
export NUMBA_CACHE_DIR=/tmp
export MPLCONFIGDIR=/tmp
export XDG_CONFIG_HOME=/tmp

concatenate_vdj.py -ai ${input_vdj.join(' ')} -id ${meta.collect{ it.id }.join(' ')}

cat <<-END_VERSIONS >> versions.yml
"${task.process}":
concatenate_vdj.py --version >> versions.yml
END_VERSIONS

"""

stub:
"""
touch combined_vdj.h5ad

cat <<-END_VERSIONS > versions.yml
"${task.process}":
concatenate_vdj.py --version >> versions.yml
END_VERSIONS
"""
}
127 changes: 127 additions & 0 deletions modules/local/concatenate_vdj/resources/usr/bin/concatenate_vdj.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
#!/usr/bin/env python3
# ====================================================================================================================
# PRELIMINARIES
# ====================================================================================================================

# MODULE IMPORT
import warnings
import argparse # command line arguments parser
import pathlib # library for handle filesystem paths
import glob
import os
import scanpy as sc # single-cell data processing
import scirpy as ir # single-cell AIRR-data
import anndata as ad # store annotated matrix as anndata object


warnings.filterwarnings("ignore")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to filter only specific (expected) warnings, e.g. by category or message.


# PARAMETERS
# set script version number
VERSION = "0.0.1"


# ====================================================================================================================
# MAIN FUNCTION
# ====================================================================================================================

def main():
"""
This function concatenates csv files from vdj modality.
"""
# --------------------------------------------------------------------------------------------------------------------
# LIBRARY CONFIG
# --------------------------------------------------------------------------------------------------------------------

sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()

# --------------------------------------------------------------------------------------------------------------------
# INPUT FROM COMMAND LINE
# --------------------------------------------------------------------------------------------------------------------

# Define command line arguments with argparse

parser = argparse.ArgumentParser(prog='Concatenate_vdj', usage='%(prog)s [options]',description = "VDJ data concatenation",
epilog = "This function concatenated vdj filtered contig annotation files into a single csv files.")
parser.add_argument('-ai', '--input-vdj-dir', metavar='VDJ_INPUT_FILES',nargs='+',type=pathlib.Path, dest='input_vdj_files',
help="paths of existing directory containing vdj matrix files in csv format (including file names)")
parser.add_argument('-id', '--input-run-id', metavar='INPUT_RUN_ID', nargs='+', dest='input_run_id',
help="names of the run-id corresponding to the input adata")
parser.add_argument('-o', '--out', metavar='H5AD_OUTPUT_FILE', type=pathlib.Path, default="combined.vdj.h5ad",
help="name of the h5ad object containing the concatenated vdj table")
parser.add_argument('-v', '--version', action='version', version=VERSION)
args = parser.parse_args()

# --------------------------------------------------------------------------------------------------------------------
# DEFINE SAMPLES AND MTX PATHS
# --------------------------------------------------------------------------------------------------------------------

print("\n===== VDJ FILES =====")
input_vdj_file = args.input_vdj_files
input_run_id = args.input_run_id
output = args.out

# print info on the available matrices
print("Reading vdj matrix from the following files:")
for run, mtx in zip(input_run_id, input_vdj_file):
print(f"Run: {run:15s} - File: {mtx}")


# --------------------------------------------------------------------------------------------------------------------
# READ VDJ FILES
# --------------------------------------------------------------------------------------------------------------------
vdj_files = []
for folder in glob.glob("*/filtered_contig_annotations.csv"):
vdj_files.append(folder)

adata_vdj_list = []
if vdj_files:

for run, vdj in zip(input_run_id,vdj_files):
# Read folders with the filtered contigue annotation and store datasets in a dictionary
print("\n===== READING CONTIGUE ANNOTATION MATRIX =====")
print("\nProcessing filtered contigue table in folder ... ", end ='')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print("\nProcessing filtered contigue table in folder ... ", end ='')
print("\nProcessing filtered contig table in folder ... ", end ='')

if os.path.getsize(vdj) == 0:
print(f"Warning: {vdj} is empty and will be skipped.")
else:
adata_vdj= ir.io.read_10x_vdj(vdj)
print("Done!")
adata_vdj_list.append(adata_vdj)
else:
print("No valid input file provided. Skipping reading of the vdj annotation.")

# --------------------------------------------------------------------------------------------------------------------
# VDJ TABLE CONCATENATION
# --------------------------------------------------------------------------------------------------------------------

print("\n===== CONCATENATING VDJ TABLES =====")

if len(adata_vdj_list) == 0:
print("No valid files were found. Nothing to save.")
else:
if len(adata_vdj_list) == 1:
adata_vdj_concatenated = adata_vdj_list[0]
print("Only one non-empty file found. Saving the file as is without concatenation.")
else:
Comment on lines +103 to +106
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to special-case this? i.e. wouldn't ad.concat just work fine with a single file?

adata_vdj_concatenated = ad.concat(adata_vdj_list, join= "outer", merge ="same", label="sample",
keys= input_run_id, index_unique="_")

print(f"Concatenated vdj table for {len(input_run_id)} batched has {adata_vdj_concatenated.shape[0]} cells")
print("Done!")
# --------------------------------------------------------------------------------------------------------------------
# SAVE OUTPUT FILE
# --------------------------------------------------------------------------------------------------------------------

print("\n===== SAVING OUTPUT FILE =====")

print(f"Saving vdj table data in {output}")
adata_vdj_concatenated.write(output)
print("Done!")


#####################################################################################################


if __name__ == '__main__':
main()
44 changes: 44 additions & 0 deletions modules/local/convert_mudata/main.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
process CONVERT_MUDATA {
tag "$meta.id"
label 'process_single'

container = 'quay.io/biocontainers/scirpy:0.20.1--pyhdfd78af_0'

input:
tuple val(meta), path(input_h5ad)
tuple val(meta), path (input_vdj)

output:
tuple val(meta), path("*.mudata.h5mu") , emit: h5mu
path "versions.yml", emit: versions

when:
task.ext.when == null || task.ext.when

script:
def ai = input_vdj ? "-ai $input_vdj" : ''

"""
export NUMBA_CACHE_DIR=/tmp
export MPLCONFIGDIR=/tmp
export XDG_CONFIG_HOME=/tmp

convert_mudata.py -ad $input_h5ad $ai

cat <<-END_VERSIONS >> versions.yml
"${task.process}":
convert_mudata.py --version >> versions.yml
END_VERSIONS

"""

stub:
"""
touch matrix.mudata.h5mu

cat <<-END_VERSIONS > versions.yml
"${task.process}":
convert_mudata.py --version >> versions.yml
END_VERSIONS
"""
}
134 changes: 134 additions & 0 deletions modules/local/convert_mudata/resources/usr/bin/convert_mudata.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
#!/usr/bin/env python3
# ====================================================================================================================
# PRELIMINARIES
# ====================================================================================================================

# MODULE IMPORT
import warnings
import argparse # command line arguments parser
import pathlib # library for handle filesystem paths
import scanpy as sc # single-cell data processing
from mudata import MuData


warnings.filterwarnings("ignore")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to filter only specific (expected) warnings, e.g. by category or message.


# PARAMETERS
# set script version number
10000 VERSION = "0.0.1"


# ====================================================================================================================
# MAIN FUNCTION
# ====================================================================================================================

def main():
"""
This function creates a MuData object.
"""
# --------------------------------------------------------------------------------------------------------------------
# LIBRARY CONFIG
# --------------------------------------------------------------------------------------------------------------------

sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()

# --------------------------------------------------------------------------------------------------------------------
# INPUT FROM COMMAND LINE
# --------------------------------------------------------------------------------------------------------------------

# Define command line arguments with argparse

parser = argparse.ArgumentParser(prog='Create MuData object', usage='%(prog)s [options]',description = "MuData object convertion",
epilog = "This function creates a MuData object for storing GEX,VDJ and CITE-seq data.")
parser.add_argument('-ad','--input-gex-file',metavar= 'GEX_INPUT_FILES', type=pathlib.Path, dest='input_gex_files',
help="paths of existing count matrix files in h5ad format (including file names)")
parser.add_argument('-ai', '--input-vdj-file', metavar='VDJ_INPUT_FILES',type=pathlib.Path, dest='input_vdj_files',
default=pathlib.Path(''),help="paths of existing vdj matrix files in h5ad format (including file names)")
parser.add_argument('-o', '--out', metavar='MUDATA_OUTPUT_FILE', type=pathlib.Path, default="matrix.mudata.h5mu",
help="name of the muData object")
parser.add_argument('-v', '--version', action='version', version=VERSION)
args = parser.parse_args()

# --------------------------------------------------------------------------------------------------------------------
# DEFINE SAMPLES AND MTX PATHS
# --------------------------------------------------------------------------------------------------------------------

print("\n===== INPUT GEX and VDJ FILES =====")
input_gex_file = args.input_gex_files
input_vdj_file = args.input_vdj_files
output = args.out

# print info on the available matrices
print("Reading combined gex count matrix from the following file:")
print(f"-File {input_gex_file}")

print("Reading filtered annotation table from the following file:")
print(f"-File {input_vdj_file}")

# --------------------------------------------------------------------------------------------------------------------
# READ GEX AND AB FILES
# --------------------------------------------------------------------------------------------------------------------
if input_gex_file:
# Read folders with the MTX combined count matrice and store datasets in a dictionary
print("\n===== READING COMBINED MATRIX =====")
# read the gex count matrix for the combined samples and print some initial info
print("\nProcessing count matrix in folder ... ", end ='')
adata= sc.read_h5ad(input_gex_file)
print("Done!")
print(f"Gex count matrix for combined samples has {adata.shape[0]} cells and {adata.shape[1]} genes")
else:
print("No valid input file provided. Skipping reading of the count matrix.")

# --------------------------------------------------------------------------------------------------------------------
# READ VDJ FILES
# --------------------------------------------------------------------------------------------------------------------
if input_vdj_file and input_vdj_file != pathlib.Path(''):
# Read folders with the filtered contigue annotation and store datasets in a dictionary
print("\n===== READING CONTIGUE ANNOTATION MATRIX =====")
# read the filtered contigue annotation file for the combined samples and print some initial info
print("\nProcessing filtered contigue table in folder ... ", end ='')
adata_vdj= sc.read_h5ad(input_vdj_file)
print("Done!")
else:
print("No valid input file provided. Skipping reading of the vdj annotation.")

# --------------------------------------------------------------------------------------------------------------------
# CREATE MUDATA OBJECT
# --------------------------------------------------------------------------------------------------------------------
#Creates dictionary to store all modalities
modalities = {}
try:
# Add 'gex' modality if defined
if adata[:, adata.var["feature_types"] == "Gene Expression"].shape[1] > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if adata[:, adata.var["feature_types"] == "Gene Expression"].shape[1] > 0:
if any(adata.var["feature_types"] == "Gene Expression"]):

modalities["gex"] = adata[:, adata.var["feature_types"] == "Gene Expression"]
# Add 'pro' modality if defined
if adata[:, adata.var["feature_types"] == "Antibody Capture"].shape[1] > 0:
modalities["pro"] = adata[:, adata.var["feature_types"] == "Antibody Capture"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
modalities["pro"] = adata[:, adata.var["feature_types"] == "Antibody Capture"]
modalities["protein"] = adata[:, adata.var["feature_types"] == "Antibody Capture"]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe more explicit?

except NameError:
pass

try:
# Add 'airr' modality if defined
if adata_vdj is not None:
modalities["airr"] = adata_vdj
except NameError:
pass

# Creates MuData object
mdata = MuData(modalities)

# --------------------------------------------------------------------------------------------------------------------
# SAVE OUTPUT FILE
# --------------------------------------------------------------------------------------------------------------------

print("\n===== SAVING OUTPUT FILE =====")

print(f"Saving MuData object to file {output}")
mdata.write(output)
print("Done!")

#####################################################################################################

if __name__ == '__main__':
main()
Empty file modified modules/local/templates/mtx_to_h5ad_simpleaf.py
100755 → 100644
Empty file.
2 changes: 1 addition & 1 deletion nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
Default config options for all compute environments
----------------------------------------------------------------------------------------
*/

nextflow.enable.moduleBinaries = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not supposed to be used in nf-core pipelines, because it's not compatible with all platforms.
Therefore, either template or a global script in bin is still the way to go.

But I don't recall the details, maybe also ask again on slack if this is still a current requirement and what's the reasoning.

// Global default params, used in configs
params {

Expand Down
Loading
0