I have some technical questions, that may become irrelevant if my entire approach is a mistake.
Project: Model ancestry of H. pylori in a population
Data: 100 samples of H. pylori, Illumina whole genome sequencing paired-end reads
Process:
- Data cleaning
- bowtie2 alignment to .bam
- GATK more cleaning, validation, merging lanes
- GATK HaplotypeCaller variant calling to .VCF
- GATK CombineGVCF and GenotypeGVCF to merge all samples into 1 .VCF
- Convert to plink .BED
- Input into ADMIXTURE to generate something like
So far, this is working as I intended.
Now I want to incorporate geographic reference samples from PubMLST to actually figure what the ancestries here are actually from.
Data: 3000 samples of H. pylori from individuals around the world.
This is a single multi-sample .fasta file, which each sequence being concatenated from 7 housekeeping genes in the H. pylori genome.
e.g.,
>1|New_Zealand
AATGAGTTTAGCCTA......
>517|South_Africa
AATGAGTTCAGTCTC......
>2215|Colombia
ATGAGTTCAGTCTC......
The GOAL here is to have it looking like this:
I've tried a couple of different things.
I can get the intervals for these genes on from the RefSeq, concatenated them, and add them to this .fasta.
>Helicobacter pylori
CTACTCGCTATAAGT......
>1|New_Zealand
AATGAGTTTAGCCTA......
>517|South_Africa
AATGAGTTCAGTCTC......
Then aligning them with MAFFT.
>Helicobacter pylori
gtattttgcttccaagaaagggtgcagttgctcttcaaaatccacgacttttttcacgct....
>2777|Spain
gcattatggacagaaaatcgg-------------------tgcatgagcctttgcaaaca.....
>2982|China
gcattatggacagaaaatctg-------------------tgcatgagcctttgcaaact.....
I can also extract the consensus sequence from my sample VCFs, add them to this multi-fasta, and align them as well.
>Helicobacter pylori
cacgctatctaaaaagcccttagccccagcataaataatgaccacttgcttttcaatggg
>Sample01
cacgctatctaaaaagcccttagccccagcataaataatgaccacttgcttttcaatggg
>Sample71
cacgctatctaaaaagcccttagccccagcataaataatgaccacttgcttttcaatggg
>1904|Japan
ggtatt------aaagccattgatgc------------------gttggtgcctattggg
>1797|Sudan
ggcatt------aaagccattgatgc------------------gttggtgcctattggg
The main question here is, how do I get these multiple alignment fasta files to turn into a BAM, a VCF, a BED, or some STRUCTURE-compliant file?
I've tried snp-sites (https://sanger-pathogens.github.io/snp-sites/) but it does not convert correctly.