-
Notifications
You must be signed in to change notification settings - Fork 5
Restructure output TSVs #53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Take the stem of each filename in the selected folder as the sample unique ID and sample display name for the input metadata TSV. Update help text for metadata input fields: sample_unique_id must be a substring of the input BAM/FASTQ filename; for folder input, these must be unique substrings so that IDs map to files 1:1.
2c59d31
to
ca9964a
Compare
Add an input metadata TSV for testing folder workflow input
TSV schema changes from the original alignments.tsv: - Rename "SampleID" to "sample_id" - Don't add columns "start" and "end"; they're identical to the existing "start0" and "end1"
alignments.tsv changes: - Rename column 'map_type' to 'map_ref_label' - Rename column 'map_subtype' to 'map_target_overlap' per_read.tsv changes: - Add columns 'read_ref_label' and 'read_target_overlap' - Repurpose columns 'assigned_type' and 'assigned_subtype' to use the more detailed new nomenclature, which depends on the given `vector_type` Update calculate_rdata.R and report.Rmd to handle the new TSV columns somewhat. Further report changes will be needed to display the new classification scheme more effectively.
Not yet included in the workflow.
- Sort Table 1 by reference label, then frequency (so all vectors are together) - Show Figure 1 as a stacked bar chart, stacking by assigned type - Show all subtypes in a single line plot
Previously they were left out of annotation.txt and therefore dropped by the scripts that relied on it for reference names/labels.
- Use assigned_subtype instead of map_target_overlap - Show unresolved-dimer in read-length violin plot - Drop redundant column "read_len" from per_read.tsv - Improve some plot aesthetics
i.e. don't emit "ssAAV" subset BAMs for a scAAV vector; they'll always be empty.
Rename nonmatch_stat.tsv.gz to nonmatch.tsv.gz
Also in ssAAV context: - Classify partial in + and - strands as other-vector:snapback (for now) - Classify other +/- orientation not otherwise accounted for as "unclassified"
- test: update expected flipflop.tsv row count - compress and rename flipflop(_assignments).tsv -> tsv.gz - update Nextflow scripts accordingly
a3de2ae
to
cfad346
Compare
- Roll calculate_rdata.R into create_report.R and report.Rmd - Update Nextflow and local integration test accordingly
- test: Update expected flipflop row count; 20 reads are now chimeric
"helper_name": "pHelper", | ||
"lambda_name": "Lambda", | ||
"target_gap_threshold": 200, | ||
"max_allowed_outside_vector": 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you compare results for expected results with max_allowed_outside_vector of 10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's just to exercise the code path and surface any potential crashes. The GHA-automated CI suite under test/
runs the script directly to check specific values and doesn't use the Nextflow engine. I'm aware nf-core has a Nextflow-native framework for testing Nextflow pipelines, but it looks complicated.
"lambda_name": "Lambda", | ||
"target_gap_threshold": 200, | ||
"max_allowed_outside_vector": 10, | ||
"max_allowed_missing_flanking": 70, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you compare results for expected results with max_allowed_missing_flanking of 70?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's just to exercise the code path and surface any potential crashes.
src/guess_vector_type.py
Outdated
|
||
import pysam | ||
|
||
SC_MAX_THRESHOLD = 0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a ss vector with a lot of snapbacks, are going to guess a sc AAV vector.
We can simplify this and use the vector length (i.e 2.4KB < sc )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss how we are guessing the vector type and align before merging.
- Update laava.dockerfile and add __init__ to make src/*.py importable - Rename the first heuristic to guess_vector_type_mapping.py - Update Nextflow and make_report.sh to use the new heuristic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Addresses v3.0 items related to the output TSV schema accommodating the new nomenclature: #29, #32, #34
Report changes directly following from the new schema and nomenclature: #36, #37, #38, #39, #40, #41
Also UI/UX changes in the Form Bio workflow.