Restructure output TSVs #53

etal · 2024-08-09T15:27:38Z

Addresses v3.0 items related to the output TSV schema accommodating the new nomenclature: #29, #32, #34

Report changes directly following from the new schema and nomenclature: #36, #37, #38, #39, #40, #41

Also UI/UX changes in the Form Bio workflow.

Take the stem of each filename in the selected folder as the sample unique ID and sample display name for the input metadata TSV. Update help text for metadata input fields: sample_unique_id must be a substring of the input BAM/FASTQ filename; for folder input, these must be unique substrings so that IDs map to files 1:1.

Add an input metadata TSV for testing folder workflow input

TSV schema changes from the original alignments.tsv: - Rename "SampleID" to "sample_id" - Don't add columns "start" and "end"; they're identical to the existing "start0" and "end1"

alignments.tsv changes: - Rename column 'map_type' to 'map_ref_label' - Rename column 'map_subtype' to 'map_target_overlap' per_read.tsv changes: - Add columns 'read_ref_label' and 'read_target_overlap' - Repurpose columns 'assigned_type' and 'assigned_subtype' to use the more detailed new nomenclature, which depends on the given `vector_type` Update calculate_rdata.R and report.Rmd to handle the new TSV columns somewhat. Further report changes will be needed to display the new classification scheme more effectively.

Not yet included in the workflow.

- Sort Table 1 by reference label, then frequency (so all vectors are together) - Show Figure 1 as a stacked bar chart, stacking by assigned type - Show all subtypes in a single line plot

Previously they were left out of annotation.txt and therefore dropped by the scripts that relied on it for reference names/labels.

- Use assigned_subtype instead of map_target_overlap - Show unresolved-dimer in read-length violin plot - Drop redundant column "read_len" from per_read.tsv - Improve some plot aesthetics

i.e. don't emit "ssAAV" subset BAMs for a scAAV vector; they'll always be empty.

Rename nonmatch_stat.tsv.gz to nonmatch.tsv.gz

Also in ssAAV context: - Classify partial in + and - strands as other-vector:snapback (for now) - Classify other +/- orientation not otherwise accounted for as "unclassified"

- test: update expected flipflop.tsv row count - compress and rename flipflop(_assignments).tsv -> tsv.gz - update Nextflow scripts accordingly

- Roll calculate_rdata.R into create_report.R and report.Rmd - Update Nextflow and local integration test accordingly

- test: Update expected flipflop row count; 20 reads are now chimeric

alphabdiallo · 2024-10-14T12:11:00Z

params-local-folder.json

+  "helper_name": "pHelper",
+  "lambda_name": "Lambda",
+  "target_gap_threshold": 200,
+  "max_allowed_outside_vector": 10,


Do you compare results for expected results with max_allowed_outside_vector of 10?

No, it's just to exercise the code path and surface any potential crashes. The GHA-automated CI suite under test/ runs the script directly to check specific values and doesn't use the Nextflow engine. I'm aware nf-core has a Nextflow-native framework for testing Nextflow pipelines, but it looks complicated.

alphabdiallo · 2024-10-14T12:11:30Z

params-local-folder.json

+  "lambda_name": "Lambda",
+  "target_gap_threshold": 200,
+  "max_allowed_outside_vector": 10,
+  "max_allowed_missing_flanking": 70,


Do you compare results for expected results with max_allowed_missing_flanking of 70?

No, it's just to exercise the code path and surface any potential crashes.

alphabdiallo · 2024-10-14T12:27:26Z

src/guess_vector_type.py

+
+import pysam
+
+SC_MAX_THRESHOLD = 0.1


If we have a ss vector with a lot of snapbacks, are going to guess a sc AAV vector.
We can simplify this and use the vector length (i.e 2.4KB < sc )

alphabdiallo

Let's discuss how we are guessing the vector type and align before merging.

- Update laava.dockerfile and add __init__ to make src/*.py importable - Rename the first heuristic to guess_vector_type_mapping.py - Update Nextflow and make_report.sh to use the new heuristic

alphabdiallo

LGTM

etal force-pushed the schema3-tsv branch 3 times, most recently from 2c59d31 to ca9964a Compare August 14, 2024 00:26

etal added 4 commits August 15, 2024 20:38

Fix handling of seq_folder_input

86b10a7

Add an input metadata TSV for testing folder workflow input

test: Check diff against *.readsummary.tsv and *.alignments.tsv

c6e914a

calculate_rdata: group code by output TSVs

ee882c0

Remove summary.tsv output and write alignments.tsv from Python

e767fc0

TSV schema changes from the original alignments.tsv: - Rename "SampleID" to "sample_id" - Don't add columns "start" and "end"; they're identical to the existing "start0" and "end1"

etal force-pushed the schema3-tsv branch from ca9964a to e767fc0 Compare August 16, 2024 04:46

etal added 9 commits August 16, 2024 13:20

test: also check TSV line counts with 'make test'

1c2006f

Rename summarize_AAV_alignment.py -> summarize_alignment.py

e64ed5b

Remove sequence-error.tsv and some unused R code

7cb1b79

Add script guess_vector_type.py

f8c56dd

Not yet included in the workflow.

Makefile: don't clean test/build, that's another Makefile's job

60d19d4

Makefile: add deployment command "make formbio"

e8835ce

test/Makefile: fix

aa1a8bb

report: fix type/subtype tables and plots

4e7f179

- Sort Table 1 by reference label, then frequency (so all vectors are together) - Show Figure 1 as a stacked bar chart, stacking by assigned type - Show all subtypes in a single line plot

etal force-pushed the schema3-tsv branch from 0d9bdbb to 4e7f179 Compare August 23, 2024 21:57

etal added 8 commits August 27, 2024 15:31

nextflow.config: Respect the given container_version locally

de9af1d

Set effective count to 2 for single-stranded nonvector reads, too

344f9a1

summary.tsv: Rename column "read_ref_label" to "reference_label"

039f753

Pass through unlabeled reference names to the report

f369155

Previously they were left out of annotation.txt and therefore dropped by the scripts that relied on it for reference names/labels.

Report improvements w/ joined alignments-per_read dataframe

9bba348

- Use assigned_subtype instead of map_target_overlap - Show unresolved-dimer in read-length violin plot - Drop redundant column "read_len" from per_read.tsv - Improve some plot aesthetics

summarize_alignment: emit subset BAMs based on given vector_type

05a7918

i.e. don't emit "ssAAV" subset BAMs for a scAAV vector; they'll always be empty.

Add "strand" column to alignments.tsv (closes #32)

9062673

Compress alignments.tsv and per_read.tsv with gzip (closes #56)

a6c6b05

Rename nonmatch_stat.tsv.gz to nonmatch.tsv.gz

etal force-pushed the schema3-tsv branch from 9b2b7c0 to a6c6b05 Compare September 6, 2024 17:41

Classify left/right-snapback under ssAAV

0e8f1cf

Also in ssAAV context: - Classify partial in + and - strands as other-vector:snapback (for now) - Classify other +/- orientation not otherwise accounted for as "unclassified"

etal marked this pull request as ready for review September 11, 2024 17:10

etal force-pushed the schema3-tsv branch from 9dcdd53 to c22c78f Compare September 18, 2024 22:57

Rename flipflop output files; make assigned_subtype consistent

fbf906c

- test: update expected flipflop.tsv row count - compress and rename flipflop(_assignments).tsv -> tsv.gz - update Nextflow scripts accordingly

etal force-pushed the schema3-tsv branch 2 times, most recently from a3de2ae to cfad346 Compare September 19, 2024 22:36

etal added 6 commits September 30, 2024 11:30

Remove .Rdata intermediate file (closes #29)

7163a5c

- Roll calculate_rdata.R into create_report.R and report.Rmd - Update Nextflow and local integration test accordingly

Update workflow .md documentation

938f11e

report: Improve plot aesthetics

281c670

Enable "new" chimeric logic considering all supp alignments

8f2be91

- test: Update expected flipflop row count; 20 reads are now chimeric

If vector_type "unspecified", guess within the pipeline (closes #44)

cd4fae8

Move bin/*.sh to src/* to ensure PATH consistency

07fcb8e

etal force-pushed the schema3-tsv branch from 0cdf8a9 to 07fcb8e Compare September 30, 2024 18:31

etal requested a review from mcrocker-bioborg October 3, 2024 16:11

alphabdiallo reviewed Oct 14, 2024

View reviewed changes

etal added 2 commits October 14, 2024 11:12

Add guess_vector_type_length.py, another heuristic for sc/ss

16c2b24

- Update laava.dockerfile and add __init__ to make src/*.py importable - Rename the first heuristic to guess_vector_type_mapping.py - Update Nextflow and make_report.sh to use the new heuristic

README: exp 9E19 lain local CI tests vs. Nextflow examples

c764a2c

alphabdiallo approved these changes Oct 14, 2024

View reviewed changes

report: Show sample metadata on page 1 (#12, #59)

5512fcf

etal merged commit 479d80e into main Oct 15, 2024
3 checks passed

etal deleted the schema3-tsv branch October 15, 2024 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Restructure output TSVs #53

Restructure output TSVs #53

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!


		import pysam

		SC_MAX_THRESHOLD = 0.1

Restructure output TSVs #53

Restructure output TSVs #53

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!