8000 Restructure output TSVs by etal · Pull Request #53 · formbio/laava · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Restructure output TSVs #53

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 43 commits into from
Oct 15, 2024
Merged

Restructure output TSVs #53

merged 43 commits into from
Oct 15, 2024

Conversation

etal
Copy link
Contributor
@etal etal commented Aug 9, 2024

Addresses v3.0 items related to the output TSV schema accommodating the new nomenclature: #29, #32, #34

Report changes directly following from the new schema and nomenclature: #36, #37, #38, #39, #40, #41

Also UI/UX changes in the Form Bio workflow.

Take the stem of each filename in the selected folder as the sample
unique ID and sample display name for the input metadata TSV.

Update help text for metadata input fields: sample_unique_id must be a
substring of the input BAM/FASTQ filename; for folder input, these must
be unique substrings so that IDs map to files 1:1.
@etal etal force-pushed the schema3-tsv branch 3 times, most recently from 2c59d31 to ca9964a Compare August 14, 2024 00:26
etal added 4 commits August 15, 2024 20:38
Add an input metadata TSV for testing folder workflow input
TSV schema changes from the original alignments.tsv:
- Rename "SampleID" to "sample_id"
- Don't add columns "start" and "end"; they're identical to the existing
  "start0" and "end1"
etal added 9 commits August 16, 2024 13:20
alignments.tsv changes:
- Rename column 'map_type' to 'map_ref_label'
- Rename column 'map_subtype' to 'map_target_overlap'

per_read.tsv changes:
- Add columns 'read_ref_label' and 'read_target_overlap'
- Repurpose columns 'assigned_type' and 'assigned_subtype' to use the
  more detailed new nomenclature, which depends on the given
  `vector_type`

Update calculate_rdata.R and report.Rmd to handle the new TSV columns
somewhat. Further report changes will be needed to display the new
classification scheme more effectively.
Not yet included in the workflow.
- Sort Table 1 by reference label, then frequency (so all vectors are
  together)
- Show Figure 1 as a stacked bar chart, stacking by assigned type
- Show all subtypes in a single line plot
etal added 8 commits August 27, 2024 15:31
Previously they were left out of annotation.txt and therefore dropped by
the scripts that relied on it for reference names/labels.
- Use assigned_subtype instead of map_target_overlap
- Show unresolved-dimer in read-length violin plot
- Drop redundant column "read_len" from per_read.tsv
- Improve some plot aesthetics
i.e. don't emit "ssAAV" subset BAMs for a scAAV vector; they'll always
be empty.
Rename nonmatch_stat.tsv.gz to nonmatch.tsv.gz
Also in ssAAV context:
- Classify partial in + and - strands as other-vector:snapback (for now)
- Classify other +/- orientation not otherwise accounted for as
  "unclassified"
@etal etal marked this pull request as ready for review September 11, 2024 17:10
- test: update expected flipflop.tsv row count
- compress and rename flipflop(_assignments).tsv -> tsv.gz
- update Nextflow scripts accordingly
@etal etal force-pushed the schema3-tsv branch 2 times, most recently from a3de2ae to cfad346 Compare September 19, 2024 22:36
"helper_name": "pHelper",
"lambda_name": "Lambda",
"target_gap_threshold": 200,
"max_allowed_outside_vector": 10,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you compare results for expected results with max_allowed_outside_vector of 10?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's just to exercise the code path and surface any potential crashes. The GHA-automated CI suite under test/ runs the script directly to check specific values and doesn't use the Nextflow engine. I'm aware nf-core has a Nextflow-native framework for testing Nextflow pipelines, but it looks complicated.

"lambda_name": "Lambda",
"target_gap_threshold": 200,
"max_allowed_outside_vector": 10,
"max_allowed_missing_flanking": 70,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you compare results for expected results with max_allowed_missing_flanking of 70?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's just to exercise the code path and surface any potential crashes.


import pysam

SC_MAX_THRESHOLD = 0.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a ss vector with a lot of snapbacks, are going to guess a sc AAV vector.
We can simplify this and use the vector length (i.e 2.4KB < sc )

Copy link
Contributor
@alphabdiallo alphabdiallo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss how we are guessing the vector type and align before merging.

etal added 2 commits October 14, 2024 11:12
- Update laava.dockerfile and add __init__ to make src/*.py importable
- Rename the first heuristic to guess_vector_type_mapping.py
- Update Nextflow and make_report.sh to use the new heuristic
Copy link
Contributor
@alphabdiallo alphabdiallo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0