Question: Extract Allele At Particular Positions From Bam Files
2
gravatar for guns.guru
5.0 years ago by
guns.guru50
guns.guru50 wrote:

Is there an easy way to extract the allele (monomorphic or polymorphic wrt the reference) for a sample across all reads at specific locations, given its BAM file?

Example: The base at chr1 pos 45800 for that sample from its BAM file?

I do not want the allele from each read, but only the consensus. It is similar to an old post here: Get all alleles of 1 column in bam file

Thanks in advance


Update-1

I figure that samtools mpileup.. is the way to go, but I've been fiddling around for the past 2 days, but still haven't figured out the right set of flags.

This is what I got so far... (SNP_loci.txt is the file with the positions (chr pos) I want to extract the allele/genotype at)

$samtools mpileup -I -l SNP_loci.txt -Bf ref.fa sample1.bam
chr1        1000379 T       22      cCCcCcCCCCcCCCCcCccccc  IIIIIIIIIIIIIIIIIIIIII
**chr1        1000400 A       19      .,...,....,.,,,,,,.     IIIIIIIIIIIIIIIIIII**
chr1        100017386       G       6       .$aAAAa IIIIII
chr1        100019654       T       19      CcCCccCccCC$cCcccccc    IIIIIIIIIIIIIIIIIII
chr1        100019657       G       18      TtTTttTttTtTtttttt      IIIIIIIIIIIIIIIIII
chr1        100019756       A       12      gggGgggGGGGg    IIIIIIIIIIII
chr1        100020740       A       5       cCccc   IIIII
chr1        100022994       G       7       aAaaaaa IIIIIII
chr1        100023027       G       7       aAaaaaa IIIIIII
chr1        100030383       G       11      AAAAaAAaaAA     IIIIIIIIIII

Now, do I need to manually parse this based on the pileup format?

Wait, I suppose it could be piped with bcftools this way.. and then use vcftools

samtools mpileup -I -l SNP_loci.txt -Bf ref.fa sample1.bam | bcftools view -bvcg - > sample1.raw.bcf

But it ONLY reports variants/SNPs, however I need the match with reference too... basically the allele/genotype, whatever it is! I tried without the -vcg options, but the output is too filthy.

Basically I wish I could have something report this for the above data, with some quality attribute...

chr1        1000379 T       22      C
**chr1        1000400 A       19      A**
chr1        100017386       G       6       A
chr1        100019654       T       19      C
chr1        100019657       G       18      T
chr1        100019756       A       12      G
chr1        100020740       A       5       C
chr1        100022994       G       7       A
chr1        100023027       G       7       A
chr1        100030383       G       11      A

I know this is totally possible, but just can't figure how using samtools? Lack of knowledge & hence trying to reach out the experts out thr! Any suggestions please?


Update-2

Here are my attempts in more detail:

(1) With -vcg option for bcf tools (-v option reporting only variants/SNPs) - This format is perfect, however I'll miss the monomorphic alleles, which I NEED to keep (the second line above i.e., chr1:1000400)...

$samtools mpileup -I -l SNP_loci.txt -uBf ref.fa sample1.bam | bcftools view -bvcg - > sample1.bcf
$bcftools view sample1.bcf | vcfutils.pl varFilter -D100 > sample1.vcf
chr1        1000379 .       T       C       99      .       DP=22;AF1=1;CI95=1,1;DP4=0,0,12,10;MQ=60        PL:GT:GQ        255,66,0:1/1:99
chr1        100017386       .       G       A       99      .       DP=6;AF1=0.5016;CI95=0.5,0.5;DP4=1,0,3,2;MQ=60;PV4=1,1,1,1      PL:GT:GQ        156,0,22:0/1:25
chr1        100019654       .       T       C       99      .       DP=19;AF1=1;CI95=1,1;DP4=0,0,7,12;MQ=60 PL:GT:GQ        255,57,0:1/1:99
chr1        100019657       .       G       T       99      .       DP=18;AF1=1;CI95=1,1;DP4=0,0,6,12;MQ=60 PL:GT:GQ        255,54,0:1/1:99
chr1        100019756       .       A       G       99      .       DP=12;AF1=1;CI95=1,1;DP4=0,0,5,7;MQ=60  PL:GT:GQ        255,36,0:1/1:99
chr1        100020740       .       A       C       99      .       DP=5;AF1=1;CI95=0.5,1;DP4=0,0,1,4;MQ=60 PL:GT:GQ        164,15,0:1/1:75
chr1        100022994       .       G       A       99      .       DP=7;AF1=1;CI95=1,1;DP4=0,0,1,6;MQ=60   PL:GT:GQ        198,21,0:1/1:84
chr1        100023027       .       G       A       99      .       DP=7;AF1=1;CI95=1,1;DP4=0,0,1,6;MQ=60   PL:GT:GQ        198,21,0:1/1:84
chr1        100030383       .       G       A       99      .       DP=11;AF1=1;CI95=1,1;DP4=0,0,8,3;MQ=60  PL:GT:GQ        255,33,0:1/1:99

(2) Without the -vcg option, reports all alleles. Yes, the monomorphic one too!! However, what does this X mean? There is no quality scores attached to the calls. How do I parse this into something similar to in (1)? Can I trust those other SNP calls??

$samtools mpileup -I -l SNP_loci.txt -uBf ref.fa sample1.bam | bcftools view -b - > sample1.bcf
$bcftools view sample1.bcf | vcfutils.pl varFilter -D100 > sample1.vcf

chr1        1000379 .       T       C,X     0       .       DP=22;I16=0,0,12,10,0,0,880,35200,0,0,1320,79200,0,0,390,8052   PL      255,66,255,0,66,255
**chr1        1000400 .       A       X       0       .       DP=19;I16=10,9,0,0,760,30400,0,0,1140,68400,0,0,344,7816,0,0    PL      0,57,255**
chr1        100017386       .       G       A,X     0       .       DP=6;I16=1,0,3,2,40,1600,200,8000,60,3600,300,18000,0,0,117,2759        PL      156,0,159,22,37,188
chr1        100019654       .       T       C,X     0       .       DP=19;I16=0,0,7,12,0,0,760,30400,0,0,1140,68400,0,0,367,8105    PL      255,57,255,0,57,255
chr1        100019657       .       G       T,X     0       .       DP=18;I16=0,0,6,12,0,0,720,28800,0,0,1080,64800,0,0,370,8338    PL      255,54,255,0,54,255
chr1        100019756       .       A       G,X     0       .       DP=12;I16=0,0,5,7,0,0,480,19200,0,0,720,43200,0,0,208,4430      PL      255,36,255,0,36,255
chr1       100020740       .       A       C,X     0       .       DP=5;I16=0,0,1,4,0,0,200,8000,0,0,300,18000,0,0,125,3125        PL      164,15,164,0,15,164
chr1        100022994       .       G       A,X     0       .       DP=7;I16=0,0,1,6,0,0,280,11200,0,0,420,25200,0,0,160,3696       PL      198,21,198,0,21,198
chr1        100023027       .       G       A,X     0       .       DP=7;I16=0,0,1,6,0,0,280,11200,0,0,420,25200,0,0,145,3325       PL      198,21,198,0,21,198
chr1        100030383       .       G       A,X     0       .       DP=11;I16=0,0,8,3,0,0,440,17600,0,0,660,39600,0,0,239,5415      PL      255,33,255,0,33,255

Can someone please throw their experience at this? I'm really struggling with this... when the answer doesn't seem too far... It might just take a few mins for you, but will help me a lot!!

Thanks a bunch in advance!!!

consensus allele bam • 9.0k views
ADD COMMENTlink modified 5.0 years ago • written 5.0 years ago by guns.guru50
1

This would seem to be the same as normal SNP calling then, yes?

ADD REPLYlink written 5.0 years ago by Devon Ryan88k

Ryan, thanks again... yes and no... I explain better below! I want to call the genotype - polymorphic or monomorphic (same as reference)

ADD REPLYlink written 5.0 years ago by guns.guru50

I moved your answer to an update on your question.

ADD REPLYlink written 5.0 years ago by Devon Ryan88k
1

You used the -v option, which explicitly tells bcftools to only report variant sites...

ADD REPLYlink written 5.0 years ago by Devon Ryan88k

Ryan, I understand. But without using -v (meaning leaving out -cg too)... I'm unsure of the output format like reported above (in update-2). Thanks for correcting my post btw.

ADD REPLYlink written 5.0 years ago by guns.guru50
1

-v forces -c to be on, but you can use -cg without using -v.

ADD REPLYlink written 5.0 years ago by Devon Ryan88k

Yes, I used just -cg too. It doesn't report the monomorphic alleles. Output same as ONLY SNPs/variants

ADD REPLYlink written 5.0 years ago by guns.guru50
3
gravatar for guns.guru
5.0 years ago by
guns.guru50
guns.guru50 wrote:

I think I figured the problem... Woohoo!! The vcfutils.pl is filtering out these monomorphic reference alleles, while bcftools reports them OK! Was only using vcfutils for varFilter -D max depth purposes, but who knew it was also the culprit here...

So, like Ryan suggested, do not use the -v option & I'm rerunning my SNP calling routine as follows. No vcfutils!!

samtools mpileup -I -uBf ref.fa sample1.bam | bcftools view -bcg - > sample1.bcf
bcftools view sample1.bcf > sample1.vcf

Thanks so much Ryan!!

ADD COMMENTlink written 5.0 years ago by guns.guru50

Glad you got it working and thanks for posting the update!

ADD REPLYlink written 5.0 years ago by Devon Ryan88k
2
gravatar for hliang
5.0 years ago by
hliang100
United States
hliang100 wrote:

The new samtools mpileup can generate VCF output, although without genotype calling.

For those who might be interested in using a newer version (0.2.0-rc6) of samtools and bcftools, you can try this way: samtools mpileup -uvDV -l sites.bed -f ref.fa input.bam | bcftools call -c - > output.vcf.

ADD COMMENTlink written 5.0 years ago by hliang100

Note that this command generates some warnings with the latest version of samtools

[warning] samtools mpileup option `-D` is functional, but deprecated. Please switch to `-t DP` in future.
[warning] samtools mpileup option `-V` is functional, but deprecated. Please switch to `-t DV` in future.
ADD REPLYlink written 2.8 years ago by africys0

This solution works, the previous answers no longer work with the new version of samtools.

ADD REPLYlink written 2.1 years ago by radlinsky0
0
gravatar for guns.guru
5.0 years ago by
guns.guru50
guns.guru50 wrote:

Thanks for the tip hliang! That's good to know. Actually, with my existing Version: 0.1.12a (r862) a single line command seems to work too..

samtools mpileup -uBI -l SNP_sites.bed -f ref.fa input.bam | bcftools view -cg - > output.vcf
ADD COMMENTlink written 5.0 years ago by guns.guru50

I've recently suggested Calculate The Frequency Of Nucleotides At Each Position In An Mpileup File on a similar post. Then you could check easily check the genotype.

ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by favero.francesco120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 882 users visited in the last hour