This little guide will walk you through how to download your genomes ποΈ and annotate them π to prepare for a GWAS study β step by step! I wrote this so that even if you're not a hardcore bioinformatician, you can follow along easily. π
π First, prepare a list of the accession numbers of your genomes (e.g. GCF_XXX...).
You can use this handy tool: ncbi-genome-download
π» If you are using IBL server, you donβt need to install it β just run micromamba activate ncbi
To download your genomes:
ncbi-genome-download -F 'cds-fasta' -A <your_accession
843A
_list.txt> --flat-output -o ./your_output_folder -p 4 bacteria
π Key options:
-F 'cds-fasta' β download only the coding sequences (CDS)
-A <your list> β list of genome accession numbers
--flat-output β all files go into one folder (easier!)
-o β where to save the downloaded files
-p 4 β download in parallel (use more CPUs = faster)
if you are using IBL server, you do not need to install prokka, use micromamba activate prokka to activate the environment, otherwise, install prokka in your own environment
Before annotation β prepare the files π
gzip -d *.gz
#to decompress all cds.fasta files in your folder
cp -r /path/to/home-dir/Bacillus/* /path/to/home-dir/Bacillus/backup
#before change all sequence names, make a copy of originial files, in case you need it in the future
rename -v 's/_cds_from_genomic//' *.fna
rename -v 's/ASM//' *.fnamicr
rename -v 's/GCF_0/GCF_/' *.fna
#rename all the fna files to avoid error in prokka
Run Prokka π π To annotate one genome:
prokka --kingdom Bacteria --outdir /path/to/home-dir/Bacillus/trial/GCF_001723585 \
--genus Bacillus --locustag GCF_001723585 \
/path/to/home-dir/Bacillus/GCF_001723585.1.fna \
--compliant --centre XXX --force
π To annotate all genomes in a folder (bulk annotation):
for file in *.fna; do
tag=${file%.fna};
prokka --kingdom Bacteria --outdir /path/to/home-dir/Bacillus/"$tag" \
--genus Bacillus --locustag "$tag" \
--compliant --force --centre XXX /path/to/home-dir/Bacillus/"$file";
To "tidy up" and rename the files nicely, you can use the following little script π§Ή:
# If your files are inside a folder and you want to rename them:
for obj in $(ls /path/to/home-dir/GCF_000008005.1_800v1); do
mv -v /path/to/home-dir/GCF_000008005.1_800v1/$obj /path/to/home-dir/GCF_000008005.1_800v1/GCF_000008005.1_800v1.${obj##*.};
done
π This will rename the files inside the folder to a cleaner format β sometimes necessary for tools like Prokka!
If you have many genome folders, you can run this loop to clean them all automatically πͺ:
# For all folders starting with GCF*
for folder in /path/to/home-dir/Bacillus/GCF*; do
for obj in $(ls $folder); do
echo ${folder##*/} # just prints folder name, for your info π
mv -v $folder/$obj $folder/${folder##*/}.${obj##*.}
done
done
ποΈ Prepare GFF and FFN files for Panaroo After Prokka annotation, you will need to collect all the .gff files (and optionally .ffn files) in one place to run Panaroo.
You can do this with a few simple commands:
# Make folders to store GFF and FFN files:
mkdir gff_files
mkdir ffn_files
# Copy all .gff files into gff_files/ π
cp ./*/*.gff gff_files
# Copy all .ffn files into ffn_files/ π
cp ./*/*.ffn ffn_files
π Thatβs it! Now you have nicely annotated genomes π
After youβve collected your .gff files, you can now build your pan-genome and align core genes using Panaroo π:
Again if you are using IBL server, simply run micromamba activate panaroo, otherwise, install it in your environment.
# Create output folder
mkdir panaroo_output
# Run Panaroo π§
panaroo -i ./prokka_output/*/*.gff \
-o panaroo_output \
-t 8 \
--verbose \
-a core
π Notes:
-i β input GFF files
-o β output folder
-t 8 β number of threads (adjust based on your server!)
-a core β align core genes using MAFFT
ποΈ Prepare your phenotype table ./metadata/profile.tab
In this table, you tell which sample has which phenotype β for example:
Sample | AZM | CRO | CFM | CIP | PEN | SMX | TET |
---|---|---|---|---|---|---|---|
Sample1 | 1 | 0 | 1 | 1 | 0 | 1 | 0 |
Sample2 | 0 | 1 | 1 | 0 | 1 | 0 | 1 |
... |
π Run Pyseer
Step 1 β Build the similarity matrix from your tree:
python ~/pyseer/scripts/phylogeny_distance.py --lmm ./panaroo_output/core_tree.treefile > pyseer_out/phylogeny_K.tsv
Step 2 β Run GWAS for each phenotype:
for phenotype in AZM CRO CFM CIP PEN SMX TET
do
python ~/pyseer/pyseer-runner.py --lmm --phenotypes ./metadata/profile.tab --pres ./panaroo_output/gene_presence_absence.Rtab --similarity ./pyseer_out/phylogeny_K.tsv --phenotype-column $anti --output-patterns ./pyseer_out/gene_patterns_${anti}.txt > ./pyseer_out/${anti}_gwas.txt
done