[Population Genetics] Admixture modeling in STRUCTURE/ADMIXTURE with Multiple Alignment Fasta file, or how to convert multi-fasta to .bam/.vcf/plink .bed
0
0
Entering edit mode
19 months ago
YC Yang • 0

I have some technical questions, that may become irrelevant if my entire approach is a mistake.

Project: Model ancestry of H. pylori in a population

Data: 100 samples of H. pylori, Illumina whole genome sequencing paired-end reads

Process:

  1. Data cleaning
  2. bowtie2 alignment to .bam
  3. GATK more cleaning, validation, merging lanes
  4. GATK HaplotypeCaller variant calling to .VCF
  5. GATK CombineGVCF and GenotypeGVCF to merge all samples into 1 .VCF
  6. Convert to plink .BED
  7. Input into ADMIXTURE to generate something like

ADMIXTURE output, K=3

So far, this is working as I intended.

Now I want to incorporate geographic reference samples from PubMLST to actually figure what the ancestries here are actually from.

Data: 3000 samples of H. pylori from individuals around the world.

This is a single multi-sample .fasta file, which each sequence being concatenated from 7 housekeeping genes in the H. pylori genome.

e.g.,

>1|New_Zealand
AATGAGTTTAGCCTA......
>517|South_Africa
AATGAGTTCAGTCTC......
>2215|Colombia
ATGAGTTCAGTCTC......

The GOAL here is to have it looking like this:

georefs

I've tried a couple of different things.

I can get the intervals for these genes on from the RefSeq, concatenated them, and add them to this .fasta.

>Helicobacter pylori
CTACTCGCTATAAGT......
>1|New_Zealand
AATGAGTTTAGCCTA......
>517|South_Africa
AATGAGTTCAGTCTC......

Then aligning them with MAFFT.

>Helicobacter pylori 
gtattttgcttccaagaaagggtgcagttgctcttcaaaatccacgacttttttcacgct....
>2777|Spain
gcattatggacagaaaatcgg-------------------tgcatgagcctttgcaaaca.....
>2982|China
gcattatggacagaaaatctg-------------------tgcatgagcctttgcaaact.....

I can also extract the consensus sequence from my sample VCFs, add them to this multi-fasta, and align them as well.

>Helicobacter pylori 
cacgctatctaaaaagcccttagccccagcataaataatgaccacttgcttttcaatggg
>Sample01
cacgctatctaaaaagcccttagccccagcataaataatgaccacttgcttttcaatggg
>Sample71
cacgctatctaaaaagcccttagccccagcataaataatgaccacttgcttttcaatggg
>1904|Japan
ggtatt------aaagccattgatgc------------------gttggtgcctattggg
>1797|Sudan
ggcatt------aaagccattgatgc------------------gttggtgcctattggg

The main question here is, how do I get these multiple alignment fasta files to turn into a BAM, a VCF, a BED, or some STRUCTURE-compliant file?

I've tried snp-sites (https://sanger-pathogens.github.io/snp-sites/) but it does not convert correctly.

snp-sites output

vcf bam snp alignment • 506 views
ADD COMMENT

Login before adding your answer.

Traffic: 2050 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6