Question

[Population Genetics] Admixture modeling in STRUCTURE/ADMIXTURE with Multiple Alignment Fasta file, or how to convert multi-fasta to .bam/.vcf/plink .bed

0

Entering edit mode

19 months ago

YC Yang • 0

I have some technical questions, that may become irrelevant if my entire approach is a mistake.

Project: Model ancestry of H. pylori in a population

Data: 100 samples of H. pylori, Illumina whole genome sequencing paired-end reads

Process:

Data cleaning
bowtie2 alignment to .bam
GATK more cleaning, validation, merging lanes
GATK HaplotypeCaller variant calling to .VCF
GATK CombineGVCF and GenotypeGVCF to merge all samples into 1 .VCF
Convert to plink .BED
Input into ADMIXTURE to generate something like

ADMIXTURE output, K=3

So far, this is working as I intended.

Now I want to incorporate geographic reference samples from PubMLST to actually figure what the ancestries here are actually from.

Data: 3000 samples of H. pylori from individuals around the world.

This is a single multi-sample .fasta file, which each sequence being concatenated from 7 housekeeping genes in the H. pylori genome.

e.g.,

>1|New_Zealand
AATGAGTTTAGCCTA......
>517|South_Africa
AATGAGTTCAGTCTC......
>2215|Colombia
ATGAGTTCAGTCTC......

The GOAL here is to have it looking like this:

georefs

I've tried a couple of different things.

I can get the intervals for these genes on from the RefSeq, concatenated them, and add them to this .fasta.

>Helicobacter pylori
CTACTCGCTATAAGT......
>1|New_Zealand
AATGAGTTTAGCCTA......
>517|South_Africa
AATGAGTTCAGTCTC......

Then aligning them with MAFFT.

>Helicobacter pylori 
gtattttgcttccaagaaagggtgcagttgctcttcaaaatccacgacttttttcacgct....
>2777|Spain
gcattatggacagaaaatcgg-------------------tgcatgagcctttgcaaaca.....
>2982|China
gcattatggacagaaaatctg-------------------tgcatgagcctttgcaaact.....

I can also extract the consensus sequence from my sample VCFs, add them to this multi-fasta, and align them as well.

>Helicobacter pylori 
cacgctatctaaaaagcccttagccccagcataaataatgaccacttgcttttcaatggg
>Sample01
cacgctatctaaaaagcccttagccccagcataaataatgaccacttgcttttcaatggg
>Sample71
cacgctatctaaaaagcccttagccccagcataaataatgaccacttgcttttcaatggg
>1904|Japan
ggtatt------aaagccattgatgc------------------gttggtgcctattggg
>1797|Sudan
ggcatt------aaagccattgatgc------------------gttggtgcctattggg

The main question here is, how do I get these multiple alignment fasta files to turn into a BAM, a VCF, a BED, or some STRUCTURE-compliant file?

I've tried snp-sites (https://sanger-pathogens.github.io/snp-sites/) but it does not convert correctly.

snp-sites output

vcf bam snp alignment • 506 views

ADD COMMENT • link updated 18 months ago by Ram 43k • written 19 months ago by YC Yang • 0