Mpileup BCF/VCF output without non-ref base 'X'?
2
1
Entering edit mode
6.1 years ago
Michi ▴ 980

Hi - I am obtaining from a RNA-seq bam file the BCF output from samtools mpileup in order to corroborate variants seen in the exome.

Thus, for corroborating one single SNP, my command looks like this:

samtools mpileup -ug -t DP -t DV -t DP4 --min-MQ 40 --min-BQ 35 -f GRCh37.fa --region "22:29907105-29907105" RNAsample.bam  | bcftools view --min-alleles 3

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    RNAsample
22    29907105    .    G    A,<X>    0    .    DP=47;I16=14,12,6,7,974,36674,484,18098,520,10400,260,5200,488,11274,261,5927;QS=0.666667,0.333333,0;VDB=0.280567;SGB=-0.683931;RPB=0.814533;MQB=1;MQSB=1;BQB=0.956592;MQ0F=0    PL:DP:DV:DP4    82,0,143,160,182,243:39:13:14,12,6,7

Because, samtools is always outputting "an non-ref base 'X' represents an base has not been seen from the alignment data" I need to filter for minimum 3 alleles (one ref, one alt and the X). I find this all counter-intuitive and would like to omit the non-ref base 'X' in the output from samtools mpileup as in this case it will not be needed for anything.

Does anybody know how to do this properly (excluding sed, awk and the like ;)  )

EDIT:

The pileup itself for this position looks as follows:

$ samtools mpileup  -t DP -t DV -t DP4 --min-MQ 40 --min-BQ 35 -f /home/mpschr/bin/bcbionextgen/data/genomes/Hsapiens/GRCh37/seq/GRCh37.fa --region "22:29907105-29907105" RNAsample.bam
22    29907105    G    39    .,...AAA..A,,a,a..a,a,,A..aA,,a,.,..a,.    DDDDDDDDDDDDJFJJEGHIIJJIJJEJDEDDJDJHDDF

 

sequencing samtools mpileup vcf bcf • 4.7k views
ADD COMMENT
0
Entering edit mode

where does this 'X' come from ? can you show us a few reads overlaping that position "22:29907105" ?

ADD REPLY
0
Entering edit mode

and what is the version of samtools/bcftools please

ADD REPLY
0
Entering edit mode

HI, I put the pileup in the original questions. The versions I am using are the following:

  • bcftools 1.2
    Using htslib 1.2.1
  • samtools 1.2
    Using htslib 1.2.1
ADD REPLY
0
Entering edit mode

I want to see some reads (samtools view RNAsample.bam 22:29907105-29907105 ) not the ouput of mpileup

ADD REPLY
0
Entering edit mode

aha ok, this is quite a lot in order to paste it as is, I think - are we I looking for something specific? In any case, a working solution has been proposed. Thanks for your time!

ADD REPLY
2
Entering edit mode
6.1 years ago
Michi ▴ 980

Ok, after a lot of experimenting I have found a solution with which I am satisfied - I only use the bcftools view, since I already did the variant calling, and then pipe it to bcftools norm, where norm shortens the representation of the indels and with the option '-m-both' splits up multiple alleles to a line each.

Thus the command looks like this

samtools mpileup -ug -t DP -t DV -t DP4 --min-MQ 40 --min-BQ 35 -f GRCh37.fa --region "22:29907105-29907105" sample.bam | bcftools view | bcftools norm -m-both -f GRCh37.fa

and gives this output:

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    Sample.bam
22    29907105    .    G    A    0    .    DP=95;I16=35,6,18,2,2110,121546,1123,68521,2460,147600,1200,72000,869,19919,428,10158;QS=0.651934,0.348066;VDB=0.12166;SGB=-0.692067;RPB=0.998114;MQB=1;MQSB=1;BQB=0.426273;MQ0F=0    PL:DP:DV:DP4    255,0,255:61:20:35,6,18,2
22    29907105    .    G    <X>    0    .    DP=95;I16=35,6,18,2,2110,121546,1123,68521,2460,147600,1200,72000,869,19919,428,10158;QS=0.651934,0;VDB=0.12166;SGB=-0.692067;RPB=0.998114;MQB=1;MQSB=1;BQB=0.426273;MQ0F=0    PL:DP:DV:DP4    255,255,255:61:20:35,6,18,2
Lines total/modified/skipped:    1/0/0

By piping it through grep -v '<X>' the non reference bases are removed from the output.

In any case, thanks for all the help Devon

ADD COMMENT
2
Entering edit mode
6.1 years ago

The <X> is due to you having a gVCF (this is standard in the last couple samtools/bcftools releases). You need to use bcftools call to get a final plain VCF.

ADD COMMENT
1
Entering edit mode

Hi,

This is the solution, as it seems! Now I am able to do the following:

samtools mpileup [..] | bcftools call -m |bcftools view --min-alleles 2
ADD REPLY
0
Entering edit mode

Hi, I have a follow-up question. There are some variants that get removed with the bcftools call command because they have very few reads. But since I am sure that this is a variant from my exon data, I'd like to keep it anyway. Is there an alternative to call? the command bcftools convert --gvcf2vcf did not remove the <X> alternatives..

ADD REPLY
0
Entering edit mode

You should perform joint calling, which will likely resolve most of these issues.

ADD REPLY
0
Entering edit mode

Ok that seems to be a good point. I am supplying the two bam-files. The problem I am running now into, is the reads from both .bam files are piled up together for both have the same sample name.. Is there a way around that without generating a new .bam file?

ADD REPLY
0
Entering edit mode

Maybe reheadering? I would hope that modifying the read group in the header would suffice.

ADD REPLY
0
Entering edit mode

Well, I reheadered.

samtools view -H file.bam | sed "s/Solid5500XL/Solid/" | samtools reheader - file.bam > changedheader.bam

as proposed here: Bam Header Edit. It's a pity that for changing three characters, one needs to create a new .bam file ;) In the end it worked out like this, with the joint calling. Thanks for all your help

ADD REPLY
0
Entering edit mode

By the way - for some of the calls, it is removing the alternative reads as in the sample below. It's a rather weird behaviour. Would there be any way of keeping the alternative base, which is seen with the samtools view command?

bcftools view
1    13219094    .    G    T,<X>    0    .    DP=24;I16=10,7,5,1,631,23463,266,12996,326,9490,88,1942,339,7833,141,3353;QS=0.771429,0.228571,0;VDB=0.155969;SGB=2.63044;RPB=0.990244;MQB=0.660858;MQSB=0.995323;BQB=0.339296;MQ0F=0.25    PL:DP:DV:DP4    18,0,149,69,167,186:23:6:10,7,5,1

bcftools call
1    13219094    .    G    .    13.5613    .    DP=24;VDB=0.155969;SGB=2.63044;RPB=0.990244;MQB=0.660858;MQSB=0.995323;BQB=0.339296;MQ0F=0.25;AN=2;DP4=10,7,5,1;MQ=18    GT:DP:DV:DP4    0/0:23:6:10,7,5,1
ADD REPLY
0
Entering edit mode

I haven't a clue what it's trying to do with a "." variant call. You might post this as a new question and see if someone else can shed some light on this oddity (I'm curious about this as well). If you don't get a reply in a day or two then try posting this to the samtools-help email list. The developers might reply there (the wait time for a reply is shorter here, so try it first).

ADD REPLY

Login before adding your answer.

Traffic: 2819 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6