8000 GitHub - Xinming9606/GWAS: This pipeline downloads genomes πŸ“₯, annotates them with Prokka πŸ“, builds a pan-genome with Panaroo 🐧, and runs GWAS with Pyseer πŸš€ β€” to identify genetic variants linked to your phenotypes! πŸŽ‰
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

This pipeline downloads genomes πŸ“₯, annotates them with Prokka πŸ“, builds a pan-genome with Panaroo 🐧, and runs GWAS with Pyseer πŸš€ β€” to identify genetic variants linked to your phenotypes! πŸŽ‰

Notifications You must be signed in to change notification settings

Xinming9606/GWAS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 

Repository files navigation

🧬 Genome Annotation for GWAS β€” Simple Guide

This little guide will walk you through how to download your genomes πŸ—‚οΈ and annotate them πŸ“ to prepare for a GWAS study β€” step by step! I wrote this so that even if you're not a hardcore bioinformatician, you can follow along easily. πŸš€


1️⃣ Download your genomes from NCBI πŸ“₯

πŸ‘‰ First, prepare a list of the accession numbers of your genomes (e.g. GCF_XXX...).

You can use this handy tool: ncbi-genome-download

πŸ’» If you are using IBL server, you don’t need to install it β€” just run micromamba activate ncbi

To download your genomes:

ncbi-genome-download -F 'cds-fasta' -A <your_accession
843A
_list.txt> --flat-output -o ./your_output_folder -p 4 bacteria

πŸ“Œ Key options:

-F 'cds-fasta' β€” download only the coding sequences (CDS)
-A <your list> β€” list of genome accession numbers
--flat-output β€” all files go into one folder (easier!)
-o β€” where to save the downloaded files
-p 4 β€” download in parallel (use more CPUs = faster)

2️⃣ Annotate the genomes using Prokka ✏️

if you are using IBL server, you do not need to install prokka, use micromamba activate prokka to activate the environment, otherwise, install prokka in your own environment

Before annotation β€” prepare the files πŸ“‚

gzip -d *.gz
#to decompress all cds.fasta files in your folder
cp -r /path/to/home-dir/Bacillus/* /path/to/home-dir/Bacillus/backup
#before change all sequence names, make a copy of originial files, in case you need it in the future
rename -v 's/_cds_from_genomic//' *.fna
rename -v 's/ASM//' *.fnamicr
rename -v 's/GCF_0/GCF_/' *.fna
#rename all the fna files to avoid error in prokka

Run Prokka πŸš€ πŸ‘‰ To annotate one genome:

prokka --kingdom Bacteria --outdir /path/to/home-dir/Bacillus/trial/GCF_001723585 \
    --genus Bacillus --locustag GCF_001723585 \
    /path/to/home-dir/Bacillus/GCF_001723585.1.fna \
    --compliant --centre XXX --force

πŸ‘‰ To annotate all genomes in a folder (bulk annotation):

for file in *.fna; do 
tag=${file%.fna}; 
prokka --kingdom Bacteria --outdir /path/to/home-dir/Bacillus/"$tag" \
      --genus Bacillus --locustag "$tag" \
      --compliant --force --centre XXX /path/to/home-dir/Bacillus/"$file"; 

⚠️ You might encounter some issues with folders and file names... πŸ—‚οΈπŸ˜… Sometimes when you download genomes or run tools, the files might be placed inside folders or named in ways that can confuse Prokka or later steps.

To "tidy up" and rename the files nicely, you can use the following little script 🧹:

# If your files are inside a folder and you want to rename them:
for obj in $(ls /path/to/home-dir/GCF_000008005.1_800v1); do
 mv -v /path/to/home-dir/GCF_000008005.1_800v1/$obj /path/to/home-dir/GCF_000008005.1_800v1/GCF_000008005.1_800v1.${obj##*.};
done

πŸ‘‰ This will rename the files inside the folder to a cleaner format β€” sometimes necessary for tools like Prokka!

If you have many genome folders, you can run this loop to clean them all automatically πŸͺ„:

# For all folders starting with GCF*
for folder in /path/to/home-dir/Bacillus/GCF*; do
 for obj in $(ls $folder); do
     echo ${folder##*/}   # just prints folder name, for your info 😊
     mv -v $folder/$obj $folder/${folder##*/}.${obj##*.}
 done
done

πŸ—‚οΈ Prepare GFF and FFN files for Panaroo After Prokka annotation, you will need to collect all the .gff files (and optionally .ffn files) in one place to run Panaroo.

You can do this with a few simple commands:

# Make folders to store GFF and FFN files:
mkdir gff_files
mkdir ffn_files

# Copy all .gff files into gff_files/ πŸ“‚
cp ./*/*.gff gff_files

# Copy all .ffn files into ffn_files/ πŸ“‚
cp ./*/*.ffn ffn_files

πŸŽ‰ That’s it! Now you have nicely annotated genomes 🎊


3️⃣ 🐧 Run Panaroo to build your pan-genome!

After you’ve collected your .gff files, you can now build your pan-genome and align core genes using Panaroo πŸš€:

Again if you are using IBL server, simply run micromamba activate panaroo, otherwise, install it in your environment.

# Create output folder
mkdir panaroo_output
# Run Panaroo 🐧
panaroo -i ./prokka_output/*/*.gff \
     -o panaroo_output \
     -t 8 \
     --verbose \
     -a core

πŸ“ Notes:

-i β€” input GFF files
-o β€” output folder
-t 8 β€” number of threads (adjust based on your server!)
-a core β€” align core genes using MAFFT

4️⃣ πŸ” Run Pyseer for GWAS!

πŸ—‚οΈ Prepare your phenotype table ./metadata/profile.tab

In this table, you tell which sample has which phenotype β€” for example:

Sample AZM CRO CFM CIP PEN SMX TET
Sample1 1 0 1 1 0 1 0
Sample2 0 1 1 0 1 0 1
...

πŸƒ Run Pyseer

Step 1 β€” Build the similarity matrix from your tree:

python ~/pyseer/scripts/phylogeny_distance.py --lmm ./panaroo_output/core_tree.treefile > pyseer_out/phylogeny_K.tsv

Step 2 β€” Run GWAS for each phenotype:

for phenotype in AZM     CRO     CFM     CIP     PEN     SMX     TET
do
python ~/pyseer/pyseer-runner.py --lmm --phenotypes ./metadata/profile.tab --pres ./panaroo_output/gene_presence_absence.Rtab --similarity ./pyseer_out/phylogeny_K.tsv --phenotype-column $anti --output-patterns ./pyseer_out/gene_patterns_${anti}.txt > ./pyseer_out/${anti}_gwas.txt
done

About

This pipeline downloads genomes πŸ“₯, annotates them with Prokka πŸ“, builds a pan-genome with Panaroo 🐧, and runs GWAS with Pyseer πŸš€ β€” to identify genetic variants linked to your phenotypes! πŸŽ‰

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0