Allele Depth (~ Dp4) Values For All (Including Multi-Allelic Alt) Possible Alleles Reported?
Entering edit mode
9.0 years ago
Rm 8.2k

Is there a quick way to get allele read depth for all possible alleles say from bam or mpileup?

10      101023  .       A       G,T     220     .       DP=35;VDB=0.0407;AF1=1;AC1=2;DP4=0,0,14,11;MQ=43;FQ=-99 GT:PL:GQ        1/1:253,72,0,247,52,244:99
2       89159986        .       T       A,C     193     .       DP=52;VDB=6.418245e-02;AF1=1;AC1=2;DP4=0,0,29,23;MQ=23;FQ=-181  GT:PL:DP:SP:GQ  1/1:226,154,0,211,122,208:52:0:99

one possiblity is to parse the mpileup output?.

allele counts • 5.0k views
Entering edit mode
9.0 years ago
Rm 8.2k

Here is the perl script i had to write to get the all possible strand specific allele counts (DP4) from the mpileup output:

Note: It skips the indels

use strict;
#use warnings;
&usage if (@ARGV < 1);

my $command = shift(@ARGV);
my %func = (mpileup2allele=>\&mpileup2allele);

die("Unknown command \"$command\".\n") if (!defined($func{$command}));

# mpileup2allele
sub mpileup2allele {
Usage: mpileup2allele <input.chr.raw-mpileup> 
/) if (@ARGV == 0 && -t STDIN);

my @qual;
my @line;
my $phred=10; # Phred quality threshold filter
my $sanger=33; # Quality Encoding: Sanger/Illumina 1.8+ =33 # Solexa/Illumina (1.3+) =64 

print "#CHR\tPOS\tREF\tA\ta\tC\tc\tG\tg\tT\tt\n" ;

# Reading input mpileup 
  while (<>) {
    @line = split;
      if ($line[2] eq "*"){next;} # skipping indel row #
      $line[4] =~ s/[+-][0-9]+[atgcrykmswbdhvnATGCRYKMSWBDHVNF]+//g; # Removes indels notation from read base lines
      $line[4] =~ s/\^.//g; # Removes unwanted symbols
      $line[4] =~ s/[0-9\@\'\&\!\?\_\$\*\:\;\F\"\+\#\-\=\%\)\(\/\~\}\[\>\<]+//g; # Removes additional symbols  
      $line[4] =~ s/\./$line[2]/g; # Substituting . with reference base 
      $line[4] =~ s/\,/lc($line[2])/ge; # Substituting, with lower case reference base 

      # Store quality scores as an array
      @qual = split(//, $line[5]);
      # Transform quality values from ASCII into Sanger format
      for( my $i = 0; $i < scalar @qual; $i++ ){
       $qual[$i] = ord($qual[$i]) - $sanger ;

    my $char;
    my %chars=();
    my $chars;
    my @letters = split(//, $line[4]);
    my @case = qw/A a C c G g T t/;
    my $top=0;
    print "$line[0]\t$line[1]\t$line[2]\t"; # print nucleotide position

    # Quality comparison 
    foreach $char(@letters) { if ($qual[$top] >= $phred){ $chars{ $char }++; } $top++; }
    my @keys = sort { uc $a cmp uc $b } keys %chars; # case insensitive sort

    #printing output in the required order: A a C c G g T t #
        foreach my $case(@case){
         if ( ! exists( $chars{$case})) {
                print 0, "\t" ;
         else {
                print $chars{$case},"\t";
                #print $case, "\t", $chars{$case},"\t"; 

    print "\n";

  } # End of while

 } # End of mpileup2allele 


sub usage {
my $str ='Sample Input Pileup Data
10      56256   A      16      <>>.,+4gcgc,,,,,,..,...,,  1313?5??>5<9=71570';
Usage: mpileup2allele <input.chr.raw-mpileup>  \n
$str \n/);


#CHR    POS    REF    A    a    C    c    G    g    T    t
1    970567    C    0    0    0    0    0    0    0    0    
1    1581958    T    0    0    0    2    0    0    5    2    
1    1644908    C    0    0    4    4    0    0    5    4    
1    1646113    G    0    0    1    0    2    5    0    0    
1    1654067    C    0    0    11    12    0    0    0    0    
1    2241376    C    0    0    3    1    0    0    0    0
Entering edit mode

Sorry, are capital (A,T,G,C) and lower case (a,t,g,c) mean maternal and paternal? Could you give a example have to use this script?

Entering edit mode

@juncheng: Capital letters represent forward strand and lower case represents reverse strand.

Usage: mpileup2allele <input.chr.raw-mpileup>
Entering edit mode

@Rm. Hi. Sometimes your script does not match the depth count in the vcf. For example,

NZ_CP024834.1   5388732 .   G   A   222 .   DP=55;VDB=0.0130;AF1=1;AC1=2;DP4=0,0,21,32;MQ=60;FQ=-187    GT:PL:DP:SP:GQ  1/1:255,160,0:53:0:99

using above script:

NZ_CP024834.1   5388732 G   21  33  0   0   0   0   0   0

21+ 33 sums to 54 reads from your script, whereas in vcf it shows DP=55. May I know if it is a bug or I am missing something?


I re-read your post, your script is to parse DP4 field. But even in that cases, sometimes the sum does not match DP4 count. For example,

NZ_CP024834.1   5368982 .   G   A   222 .   DP=42;VDB=0.0148;AF1=1;AC1=2;DP4=0,0,20,21;MQ=60;FQ=-150    GT:PL:DP:SP:GQ  1/1:255,123,0:41:0:99

from your script, It matches DP but not DP4:

NZ_CP024834.1   5368982 G   21  21  0   0   0   0   0   0
Entering edit mode
9.0 years ago

You can use the older pileup function and get some output shown below:

seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&

seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+

seq1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6

seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<

Now you can use fifth column to get the allele depth of all the alleles. I am sure you must be aware of this.

Entering edit mode

@ashutoshmits thanks yes iam

Entering edit mode

I do have a Python code similar to what you wrote in Perl. But it was kind of integrated between other codes. So i couldn't share it. But your Perl script is much advanced than mine. Just to let you know that ">" or "<" characters in 5th column (not shown in the example above) represent read that span between two exons which is typical for RNA-seq. It shouldn't be counted and ignored and you have already taken care of it in your code.

Entering edit mode

yes I have ignored ">" or "<"

Entering edit mode
9.0 years ago
Drei ▴ 50

Just parse it. In whatever language you're using - Python, Bash, R - simply split the row (as in, python split()) by 'DP4=' to get it as an array. Split the 2nd element (as in, array[1]) by semicolon. Your first element in your final array is your a,b,c,d array depth.

You could do it the long way if for some reason you expect the string 'DP4=' to be present in the file somewhere else apart from the info field, for for some crazy reason (I can't imagine how though):

  1. For each row: split row by tab (as in, python split() for example) and get it as an array.

  2. Grab 8th element in that array, split it by semicolon

  3. I recall (although I might be wrong) that the DP4 column might not necessarily always be in the same position for mpileup output, so you might have to iterate over the elements of the array from step 2, checking for the array element that begins with DP4/build a hash table/whatever. Or, if I'm mistaken and it is indeed in the same place always, just grab the 5th element.

  4. Split the DP4 element by '=' sign, grab your depth.

And - put in all in a function or a shell script so you can call it once and be done with it, making it 'quick'. I have a feeling this is a short and beautiful awk script waiting to be written.

Entering edit mode

your solution won't work for the multi-allelic ALT.


Login before adding your answer.

Traffic: 2099 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6