-
Notifications
You must be signed in to change notification settings - Fork 4
Add comprehensive local testing framework with golden master verifica… #106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,189 @@ | ||
"""End-to-end pipeline validation for duplicate handling in LAAVA. | ||
|
||
IMPORTANT: This is NOT a direct code test - it validates pipeline output files. | ||
|
||
This test suite validates the complete LAAVA pipeline behavior by: | ||
1. Examining output files generated by the pipeline (via Makefile) | ||
2. Validating that raw data correctly contains duplicates (expected behavior) | ||
3. Confirming that duplicate removal works when applied to the data | ||
4. Testing data structure integrity and consistency | ||
|
||
NOTE: The actual LAAVA code execution happens through the Makefile targets | ||
(sc-local, ss-local) which run the shell scripts and R code. These tests | ||
examine the results of that execution, but do not directly import or call | ||
LAAVA Python/R modules. | ||
|
||
For direct code testing, see unit tests that import LAAVA modules directly. | ||
""" | ||
|
||
import pandas as pd | ||
import pytest | ||
import subprocess | ||
import gzip | ||
from pathlib import Path | ||
import tempfile | ||
import shutil | ||
|
||
|
||
class TestIntegrationDuplicateRemoval: | ||
"""Integration tests that validate duplicate removal in actual LAAVA output.""" | ||
|
||
@pytest.fixture(scope="class") | ||
def build_dir(self): | ||
"""Get the build directory path.""" | ||
return Path("build") | ||
|
||
def read_nonmatch_file(self, file_path): | ||
"""Read a nonmatch.tsv.gz file and return as DataFrame.""" | ||
with gzip.open(file_path, 'rt') as f: | ||
# The file has no header separator, so we need to parse it carefully | ||
df = pd.read_csv(f, sep='\t') | ||
return df | ||
|
||
def test_sc_nonmatch_raw_data_integrity(self, build_dir): | ||
"""Test that sc.nonmatch.tsv.gz contains expected raw data (including duplicates).""" | ||
nonmatch_file = build_dir / "sc.nonmatch.tsv.gz" | ||
assert nonmatch_file.exists(), f"Expected nonmatch file {nonmatch_file} to exist" | ||
|
||
df = self.read_nonmatch_file(nonmatch_file) | ||
|
||
# Check for duplicates - raw data should contain duplicates | ||
duplicate_count = df.duplicated().sum() | ||
total_rows = len(df) | ||
|
||
print(f"SC nonmatch file: {total_rows} total rows, {duplicate_count} duplicates") | ||
|
||
# The raw data should contain duplicates (this is expected) | ||
assert duplicate_count > 0, f"Expected raw data to contain duplicates, but found {duplicate_count}" | ||
|
||
# Additional validation: ensure we have reasonable data | ||
assert total_rows > 0, "Nonmatch file should contain some data" | ||
|
||
# Validate expected columns exist | ||
expected_columns = ['read_id', 'pos0', 'type', 'type_len'] | ||
for col in expected_columns: | ||
assert col in df.columns, f"Expected column '{col}' not found in nonmatch data" | ||
|
||
def test_ss_nonmatch_raw_data_integrity(self, build_dir): | ||
"""Test that ss.nonmatch.tsv.gz contains expected raw data (including duplicates).""" | ||
nonmatch_file = build_dir / "ss.nonmatch.tsv.gz" | ||
assert nonmatch_file.exists(), f"Expected nonmatch file {nonmatch_file} to exist" | ||
|
||
df = self.read_nonmatch_file(nonmatch_file) | ||
|
||
# Check for duplicates - raw data should contain duplicates | ||
duplicate_count = df.duplicated().sum() | ||
total_rows = len(df) | ||
|
||
print(f"SS nonmatch file: {total_rows} total rows, {duplicate_count} duplicates") | ||
|
||
# The raw data should contain duplicates (this is expected) | ||
assert duplicate_count > 0, f"Expected raw data to contain duplicates, but found {duplicate_count}" | ||
|
||
# Additional validation | ||
assert total_rows > 0, "Nonmatch file should contain some data" | ||
|
||
# Validate expected columns exist | ||
expected_columns = ['read_id', 'pos0', 'type', 'type_len'] | ||
for col in expected_columns: | ||
assert col in df.columns, f"Expected column '{col}' not found in nonmatch data" | ||
|
||
def test_duplicate_removal_functionality(self, build_dir): | ||
"""Test that duplicate removal works correctly when applied to raw data.""" | ||
for sample_type in ['sc', 'ss']: | ||
nonmatch_file = build_dir / f"{sample_type}.nonmatch.tsv.gz" | ||
assert nonmatch_file.exists(), f"Expected nonmatch file {nonmatch_file} to exist" | ||
|
||
df = self.read_nonmatch_file(nonmatch_file) | ||
|
||
# Test that duplicate removal works (simulating R's distinct() function) | ||
original_count = len(df) | ||
duplicate_count = df.duplicated().sum() | ||
deduplicated_df = df.drop_duplicates() | ||
final_count = len(deduplicated_df) | ||
|
||
print(f"{sample_type.upper()}: {original_count} -> {final_count} rows ({duplicate_count} duplicates removed)") | ||
|
||
# Verify that deduplication works as expected | ||
assert final_count == original_count - duplicate_count, f"Deduplication math doesn't add up for {sample_type}" | ||
assert not deduplicated_df.duplicated().any(), f"Deduplicated data should have no duplicates for {sample_type}" | ||
|
||
def test_nonmatch_data_structure_integrity(self, build_dir): | ||
"""Test that the nonmatch data has the expected structure and reasonable values.""" | ||
for sample_type in ['sc', 'ss']: | ||
nonmatch_file = build_dir / f"{sample_type}.nonmatch.tsv.gz" | ||
df = self.read_nonmatch_file(nonmatch_file) | ||
|
||
# Test data types and ranges | ||
assert df['pos0'].dtype in ['int64', 'int32'], f"pos0 should be integer type in {sample_type}" | ||
assert df['type_len'].dtype in ['int64', 'int32'], f"type_len should be integer type in {sample_type}" | ||
|
||
# Test that positions are reasonable (positive) | ||
assert (df['pos0'] >= 0).all(), f"All positions should be non-negative in {sample_type}" | ||
assert (df['type_len'] > 0).all(), f"All type_len should be positive in {sample_type}" | ||
|
||
# Test that mutation types are expected values | ||
valid_types = {'D', 'I', 'X', 'N'} # deletion, insertion, mismatch, gaps | ||
actual_types = set(df['type'].unique()) | ||
unexpected_types = actual_types - valid_types | ||
assert len(unexpected_types) == 0, f"Unexpected mutation types in {sample_type}: {unexpected_types}" | ||
|
||
print(f"{sample_type.upper()} data structure validation passed: {len(df)} rows, types: {actual_types}") | ||
|
||
|
||
def test_nonmatch_consistency_across_runs(self, build_dir): | ||
"""Test that nonmatch files are deterministic (same input produces same output).""" | ||
# This test verifies that running the pipeline multiple times | ||
# produces identical nonmatch files (including the same duplicates) | ||
|
||
for sample_type in ['sc', 'ss']: | ||
nonmatch_file = build_dir / f"{sample_type}.nonmatch.tsv.gz" | ||
df = self.read_nonmatch_file(nonmatch_file) | ||
|
||
# Sort the dataframe to ensure consistent ordering | ||
df_sorted = df.sort_values(['read_id', 'pos0', 'type', 'type_len']).reset_index(drop=True) | ||
|
||
# The raw data should be deterministic (same duplicates each time) | ||
# We're not testing for no duplicates here - that's tested elsewhere | ||
duplicate_count = df_sorted.duplicated().sum() | ||
total_count = len(df_sorted) | ||
|
||
print(f"{sample_type.upper()} deterministic data: {total_count} rows, {duplicate_count} duplicates") | ||
|
||
# Store a hash of the data for potential future comparison | ||
data_hash = pd.util.hash_pandas_object(df_sorted).sum() | ||
print(f"{sample_type.upper()} nonmatch data hash: {data_hash}") | ||
|
||
# Basic sanity check - should have some data | ||
assert total_count > 0, f"Should have some nonmatch data for {sample_type}" | ||
|
||
def test_vector_read_filtering_integration(self, build_dir): | ||
"""Test that most nonmatch data comes from vector reads (integration with per_read data).""" | ||
for sample_type in ['sc', 'ss']: | ||
# Read both nonmatch and per_read files | ||
nonmatch_file = build_dir / f"{sample_type}.nonmatch.tsv.gz" | ||
per_read_file = build_dir / f"{sample_type}.per_read.tsv.gz" | ||
|
||
nonmatch_df = self.read_nonmatch_file(nonmatch_file) | ||
|
||
with gzip.open(per_read_file, 'rt') as f: | ||
per_read_df = pd.read_csv(f, sep='\t') | ||
|
||
# Get vector read IDs | ||
vector_reads = set(per_read_df[per_read_df['reference_label'] == 'vector']['read_id']) | ||
|
||
# Check that most nonmatch reads are vector reads | ||
nonmatch_reads = set(nonmatch_df['read_id'].unique()) | ||
non_vector_reads = nonmatch_reads - vector_reads | ||
vector_overlap = len(nonmatch_reads & vector_reads) | ||
|
||
print(f"{sample_type.upper()}: {len(nonmatch_reads)} unique reads in nonmatch, {len(vector_reads)} vector reads") | ||
print(f"{sample_type.upper()}: {vector_overlap} vector reads in nonmatch, {len(non_vector_reads)} non-vector reads") | ||
|
||
# Most reads should be vector reads, but allow for some edge cases (chimeric reads, etc.) | ||
vector_percentage = vector_overlap / len(nonmatch_reads) * 100 | ||
assert vector_percentage > 90, f"Expected >90% vector reads in {sample_type} nonmatch data, got {vector_percentage:.1f}%" | ||
|
||
|
||
if __name__ == "__main__": | ||
pytest.main([__file__, "-v"]) |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mcrocker-bioborg can you make sure these tests all get run in the github action CI/CD?
laava/.github/workflows/tests.yaml
Line 67 in f2c4cab