vcf_consensus
is a CLI application written in Python that generates consensus DNA sequences for individual samples based on variations in a VCF file and a reference genome in FASTA format.
Supports both uncompressed and compressed files (.vcf.gz
, .fasta.gz
).
pip install uv
(in case you don't have one)
git clone https://github.com/yourusername/vcf_consensus.git
cd vcf_consensus
python -m venv .venv
source .venv/bin/activate # Linux/Mac
# .venv\Scripts\activate # Windows
uv pip install -e .
python -m vcf_consensus.cli \
--vcf input.vcf \
--fasta reference.fasta \
--length 500 \
--count 10000 \
--threshold 0.5 \
--output output_consensus.fasta
--vcf Path to the VCF file (supports .vcf.gz).
--fasta Path to the FASTA reference genome (supports .fasta.gz).
--length Length of each consensus sequence.
--count Number of consensus sequences to generate.
--threshold Probability of using an alternative allele when present in the VCF (default: 0.0).
A value of 0.5 means that 50% of heterozygous positions will use an alternative allele.
--output Output FASTA file for consensus sequences.
--seed Random seed for reproducibility (default: None).
--chrom-map (Optinal) Manual chromosome name mapping if VCF and FASTA names differ (e.g., "1=chr1,2=chr2").
Use this only if chromosomes in the VCF do not match those in the FASTA.
--mode Method for selecting consensus start positions (default: random).
Options:
- "random": Randomly selects start positions.
- "sequential": Selects positions sequentially with 50% overlap.
python -m vcf_consensus.cli \
--vcf example.vcf \
--fasta example.fasta \
--length 500 \
--count 10000 \
--threshold 0.5 \
--output consensus.fasta
python -m vcf_consensus.cli \
--vcf example.vcf.gz \
--fasta example.fasta.gz \
--length 500 \
--count 10000 \
--threshold 0.5 \
--output consensus.fasta
python -m vcf_consensus.cli \
--vcf example.vcf \
--fasta example.fasta \
--length 500 \
--count 10000 \
--threshold 0.5 \
--output consensus.fasta \
--chrom-map "1=chr1,2=chr2"
- FASTA files are fully loaded into memory. It may cause problems if FASTA is more than 3 GB.
- Large VCF files (>10 million variants) may cause high memory usage. Same problem as above.
- If you experience slow performance, reduce --count or increase --length to reduce the number of consensus sequences.
- Problem with unmatching CHROM and > still remains
!Wait for the upcoming fixes)
License
This project is distributed under the MIT License.