Question: Mpileup BCF/VCF output without non-ref base 'X'?
1
gravatar for Michi
4.1 years ago by
Michi950
Barcelona
Michi950 wrote:

Hi - I am obtaining from a RNA-seq bam file the BCF output from samtools mpileup in order to corroborate variants seen in the exome.

Thus, for corroborating one single SNP, my command looks like this:

samtools mpileup -ug -t DP -t DV -t DP4 --min-MQ 40 --min-BQ 35 -f GRCh37.fa --region "22:29907105-29907105" RNAsample.bam  | bcftools view --min-alleles 3

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    RNAsample
22    29907105    .    G    A,<X>    0    .    DP=47;I16=14,12,6,7,974,36674,484,18098,520,10400,260,5200,488,11274,261,5927;QS=0.666667,0.333333,0;VDB=0.280567;SGB=-0.683931;RPB=0.814533;MQB=1;MQSB=1;BQB=0.956592;MQ0F=0    PL:DP:DV:DP4    82,0,143,160,182,243:39:13:14,12,6,7

Because, samtools is always outputting "an non-ref base 'X' represents an base has not been seen from the alignment data" I need to filter for minimum 3 alleles (one ref, one alt and the X). I find this all counter-intuitive and would like to omit the non-ref base 'X' in the output from samtools mpileup as in this case it will not be needed for anything.

Does anybody know how to do this properly (excluding sed, awk and the like ;)  )

EDIT:

The pileup itself for this position looks as follows:

$ samtools mpileup  -t DP -t DV -t DP4 --min-MQ 40 --min-BQ 35 -f /home/mpschr/bin/bcbionextgen/data/genomes/Hsapiens/GRCh37/seq/GRCh37.fa --region "22:29907105-29907105" RNAsample.bam
22    29907105    G    39    .,...AAA..A,,a,a..a,a,,A..aA,,a,.,..a,.    DDDDDDDDDDDDJFJJEGHIIJJIJJEJDEDDJDJHDDF

 

sequencing mpileup bcf samtools vcf • 3.8k views
ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by Michi950

where does this 'X' come from ? can you show us a few reads overlaping that position "22:29907105" ?

ADD REPLYlink written 4.1 years ago by Pierre Lindenbaum124k

and what is the version of samtools/bcftools please

ADD REPLYlink written 4.1 years ago by Pierre Lindenbaum124k

HI, I put the pileup in the original questions. The versions I am using are the following:

  • bcftools 1.2
    Using htslib 1.2.1
  • samtools 1.2
    Using htslib 1.2.1
ADD REPLYlink written 4.1 years ago by Michi950

I want to see some reads (samtools view RNAsample.bam 22:29907105-29907105 ) not the ouput of mpileup

ADD REPLYlink written 4.1 years ago by Pierre Lindenbaum124k

aha ok, this is quite a lot in order to paste it as is, I think - are we I looking for something specific? In any case, a working solution has been proposed. Thanks for your time!

ADD REPLYlink modified 2 days ago by RamRS24k • written 4.1 years ago by Michi950
2
gravatar for Michi
4.1 years ago by
Michi950
Barcelona
Michi950 wrote:

Ok, after a lot of experimenting I have found a solution with which I am satisfied - I only use the bcftools view, since I already did the variant calling, and then pipe it to bcftools norm, where norm shortens the representation of the indels and with the option '-m-both' splits up multiple alleles to a line each.

Thus the command looks like this

samtools mpileup -ug -t DP -t DV -t DP4 --min-MQ 40 --min-BQ 35 -f GRCh37.fa --region "22:29907105-29907105" sample.bam | bcftools view | bcftools norm -m-both -f GRCh37.fa

and gives this output:

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    Sample.bam
22    29907105    .    G    A    0    .    DP=95;I16=35,6,18,2,2110,121546,1123,68521,2460,147600,1200,72000,869,19919,428,10158;QS=0.651934,0.348066;VDB=0.12166;SGB=-0.692067;RPB=0.998114;MQB=1;MQSB=1;BQB=0.426273;MQ0F=0    PL:DP:DV:DP4    255,0,255:61:20:35,6,18,2
22    29907105    .    G    <X>    0    .    DP=95;I16=35,6,18,2,2110,121546,1123,68521,2460,147600,1200,72000,869,19919,428,10158;QS=0.651934,0;VDB=0.12166;SGB=-0.692067;RPB=0.998114;MQB=1;MQSB=1;BQB=0.426273;MQ0F=0    PL:DP:DV:DP4    255,255,255:61:20:35,6,18,2
Lines total/modified/skipped:    1/0/0

By piping it through grep -v '<X>' the non reference bases are removed from the output.

In any case, thanks for all the help Devon

ADD COMMENTlink modified 1 day ago by RamRS24k • written 4.1 years ago by Michi950
2
gravatar for Devon Ryan
4.1 years ago by
Devon Ryan92k
Freiburg, Germany
Devon Ryan92k wrote:

The <X> is due to you having a gVCF (this is standard in the last couple samtools/bcftools releases). You need to use bcftools call to get a final plain VCF.

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by Devon Ryan92k
1

Hi,

This is the solution, as it seems! Now I am able to do the following:

samtools mpileup [..] | bcftools call -m |bcftools view --min-alleles 2
ADD REPLYlink modified 1 day ago by RamRS24k • written 4.1 years ago by Michi950

Hi, I have a follow-up question. There are some variants that get removed with the bcftools call command because they have very few reads. But since I am sure that this is a variant from my exon data, I'd like to keep it anyway. Is there an alternative to call? the command bcftools convert --gvcf2vcf did not remove the <X> alternatives..

ADD REPLYlink written 4.1 years ago by Michi950

You should perform joint calling, which will likely resolve most of these issues.

ADD REPLYlink written 4.1 years ago by Devon Ryan92k

Ok that seems to be a good point. I am supplying the two bam-files. The problem I am running now into, is the reads from both .bam files are piled up together for both have the same sample name.. Is there a way around that without generating a new .bam file?

ADD REPLYlink written 4.1 years ago by Michi950

Maybe reheadering? I would hope that modifying the read group in the header would suffice.

ADD REPLYlink written 4.1 years ago by Devon Ryan92k

Well, I reheadered.

samtools view -H file.bam | sed "s/Solid5500XL/Solid/" | samtools reheader - file.bam > changedheader.bam

as proposed here: Bam Header Edit. It's a pity that for changing three characters, one needs to create a new .bam file ;) In the end it worked out like this, with the joint calling. Thanks for all your help

ADD REPLYlink modified 2 days ago by RamRS24k • written 4.1 years ago by Michi950

By the way - for some of the calls, it is removing the alternative reads as in the sample below. It's a rather weird behaviour. Would there be any way of keeping the alternative base, which is seen with the samtools view command?

bcftools view
1    13219094    .    G    T,<X>    0    .    DP=24;I16=10,7,5,1,631,23463,266,12996,326,9490,88,1942,339,7833,141,3353;QS=0.771429,0.228571,0;VDB=0.155969;SGB=2.63044;RPB=0.990244;MQB=0.660858;MQSB=0.995323;BQB=0.339296;MQ0F=0.25    PL:DP:DV:DP4    18,0,149,69,167,186:23:6:10,7,5,1

bcftools call
1    13219094    .    G    .    13.5613    .    DP=24;VDB=0.155969;SGB=2.63044;RPB=0.990244;MQB=0.660858;MQSB=0.995323;BQB=0.339296;MQ0F=0.25;AN=2;DP4=10,7,5,1;MQ=18    GT:DP:DV:DP4    0/0:23:6:10,7,5,1
ADD REPLYlink modified 1 day ago by RamRS24k • written 4.1 years ago by Michi950

I haven't a clue what it's trying to do with a "." variant call. You might post this as a new question and see if someone else can shed some light on this oddity (I'm curious about this as well). If you don't get a reply in a day or two then try posting this to the samtools-help email list. The developers might reply there (the wait time for a reply is shorter here, so try it first).

ADD REPLYlink written 4.1 years ago by Devon Ryan92k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1917 users visited in the last hour