Question: Allele Depth (~ Dp4) Values For All (Including Multi-Allelic Alt) Possible Alleles Reported?
0
gravatar for Rm
6.0 years ago by
Rm7.9k
Danville, PA
Rm7.9k wrote:

Is there a quick way to get allele read depth for all possible alleles say from bam or mpileup?

10      101023  .       A       G,T     220     .       DP=35;VDB=0.0407;AF1=1;AC1=2;DP4=0,0,14,11;MQ=43;FQ=-99 GT:PL:GQ        1/1:253,72,0,247,52,244:99
2       89159986        .       T       A,C     193     .       DP=52;VDB=6.418245e-02;AF1=1;AC1=2;DP4=0,0,29,23;MQ=23;FQ=-181  GT:PL:DP:SP:GQ  1/1:226,154,0,211,122,208:52:0:99

one possiblity is to parse the mpileup output?.

allele counts • 3.5k views
ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by Rm7.9k
3
gravatar for Rm
6.0 years ago by
Rm7.9k
Danville, PA
Rm7.9k wrote:

Here is the perl script i had to write to get the all possible strand specific allele counts (DP4) from the mpileup output:

Note: It skips the indels

#!/usr/bin/perl 
use strict;
#use warnings;
&usage if (@ARGV < 1);

my $command = shift(@ARGV);
my %func = (mpileup2allele=>\&mpileup2allele);

die("Unknown command \"$command\".\n") if (!defined($func{$command}));
&{$func{$command}};
exit(0);

# mpileup2allele
sub mpileup2allele {
  die(qq/
Usage: extract.allele.counts4m.mpileup.pl mpileup2allele <input.chr.raw-mpileup> 
/) if (@ARGV == 0 && -t STDIN);

my @qual;
my @line;
my $phred=10; # Phred quality threshold filter
my $sanger=33; # Quality Encoding: Sanger/Illumina 1.8+ =33 # Solexa/Illumina (1.3+) =64 

print "#CHR\tPOS\tREF\tA\ta\tC\tc\tG\tg\tT\tt\n" ;

# Reading input mpileup 
  while (<>) {
    @line = split;
      if ($line[2] eq "*"){next;} # skipping indel row #
      $line[4] =~ s/[+-][0-9]+[atgcrykmswbdhvnATGCRYKMSWBDHVNF]+//g; # Removes indels notation from read base lines
      $line[4] =~ s/\^.//g; # Removes unwanted symbols
      $line[4] =~ s/[0-9\@\'\&\!\?\_\$\*\:\;\F\"\+\#\-\=\%\)\(\/\~\}\[\>\<]+//g; # Removes additional symbols  
      $line[4] =~ s/\./$line[2]/g; # Substituting . with reference base 
      $line[4] =~ s/\,/lc($line[2])/ge; # Substituting, with lower case reference base 

      # Store quality scores as an array
      @qual = split(//, $line[5]);
      # Transform quality values from ASCII into Sanger format
      for( my $i = 0; $i < scalar @qual; $i++ ){
       $qual[$i] = ord($qual[$i]) - $sanger ;
           }

    my $char;
    my %chars=();
    my $chars;
    my @letters = split(//, $line[4]);
    my @case = qw/A a C c G g T t/;
    my $top=0;
    print "$line[0]\t$line[1]\t$line[2]\t"; # print nucleotide position

    # Quality comparison 
    foreach $char(@letters) { if ($qual[$top] >= $phred){ $chars{ $char }++; } $top++; }
    my @keys = sort { uc $a cmp uc $b } keys %chars; # case insensitive sort

    #printing output in the required order: A a C c G g T t #
        foreach my $case(@case){
         if ( ! exists( $chars{$case})) {
                print 0, "\t" ;
              }
         else {
                print $chars{$case},"\t";
                #print $case, "\t", $chars{$case},"\t"; 
                }
         }

    print "\n";

  } # End of while

 } # End of mpileup2allele 

##################

sub usage {
my $str ='Sample Input Pileup Data
10      56256   A      16      <>>.,+4gcgc,,,,,,..,...,,  1313?5??>5<9=71570';
  die(qq/
Usage: extract.allele.counts4m.mpileup.pl mpileup2allele <input.chr.raw-mpileup>  \n
$str \n/);
}

Output:

#CHR    POS    REF    A    a    C    c    G    g    T    t
1    970567    C    0    0    0    0    0    0    0    0    
1    1581958    T    0    0    0    2    0    0    5    2    
1    1644908    C    0    0    4    4    0    0    5    4    
1    1646113    G    0    0    1    0    2    5    0    0    
1    1654067    C    0    0    11    12    0    0    0    0    
1    2241376    C    0    0    3    1    0    0    0    0
ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by Rm7.9k

Sorry, are capital (A,T,G,C) and lower case (a,t,g,c) mean maternal and paternal? Could you give a example have to use this stript?

ADD REPLYlink written 5.2 years ago by juncheng190

@juncheng : Capital letters represent forward strand and lower case represents reverse strand.

Usage: extract.allele.counts4m.mpileup.pl mpileup2allele <input.chr.raw-mpileup>
ADD REPLYlink written 5.1 years ago by Rm7.9k

@Rm. Hi. Sometimes your script does not match the depth count in the vcf. For example,

NZ_CP024834.1   5388732 .   G   A   222 .   DP=55;VDB=0.0130;AF1=1;AC1=2;DP4=0,0,21,32;MQ=60;FQ=-187    GT:PL:DP:SP:GQ  1/1:255,160,0:53:0:99

using above script:

NZ_CP024834.1   5388732 G   21  33  0   0   0   0   0   0

21+ 33 sums to 54 reads from your script, whereas in vcf it shows DP=55. May I know if it is a bug or I am missing something?

edit:

I re-read your post, your script is to parse DP4 field. But even in that cases, sometimes the sum does not match DP4 count. For example,

NZ_CP024834.1   5368982 .   G   A   222 .   DP=42;VDB=0.0148;AF1=1;AC1=2;DP4=0,0,20,21;MQ=60;FQ=-150    GT:PL:DP:SP:GQ  1/1:255,123,0:41:0:99

from your script, It matches DP but not DP4:

NZ_CP024834.1   5368982 G   21  21  0   0   0   0   0   0
ADD REPLYlink modified 19 months ago • written 19 months ago by Prakki Rama2.3k
0
gravatar for Ashutosh Pandey
6.0 years ago by
Philadelphia
Ashutosh Pandey11k wrote:

You can use the older pileup function and get some output shown below:

seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&

seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+

seq1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6

seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<

Now you can use fifth column to get the allele depth of all the alleles. I am sure you must be aware of this.

ADD COMMENTlink written 6.0 years ago by Ashutosh Pandey11k

@ashutoshmits thanks yes iam

ADD REPLYlink written 6.0 years ago by Rm7.9k

I do have a Python code similar to what you wrote in Perl. But it was kind of integrated between other codes. So i couldn't share it. But your Perl script is much advanced than mine. Just to let you know that ">" or "<" characters in 5th column (not shown in the example above) represent read that span between two exons which is typical for RNA-seq. It shouldn't be counted and ignored and you have already taken care of it in your code.

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by Ashutosh Pandey11k

yes I have ignored ">" or "<"

ADD REPLYlink written 6.0 years ago by Rm7.9k
0
gravatar for Drei
6.0 years ago by
Drei50
Melbourne, Australia
Drei50 wrote:

Just parse it. In whatever language you're using - Python, Bash, R - simply split the row (as in, python split()) by 'DP4=' to get it as an array. Split the 2nd element (as in, array[1]) by semicolon. Your first element in your final array is your a,b,c,d array depth.

You could do it the long way if for some reason you expect the string 'DP4=' to be present in the file somewhere else apart from the info field, for for some crazy reason (I can't imagine how though):

  1. For each row: split row by tab (as in, python split() for example) and get it as an array.

  2. Grab 8th element in that array, split it by semicolon

  3. I recall (although I might be wrong) that the DP4 column might not necessarily always be in the same position for mpileup output, so you might have to iterate over the elements of the array from step 2, checking for the array element that begins with DP4/build a hash table/whatever. Or, if I'm mistaken and it is indeed in the same place always, just grab the 5th element.

  4. Split the DP4 element by '=' sign, grab your depth.

And - put in all in a function or a shell script so you can call it once and be done with it, making it 'quick'. I have a feeling this is a short and beautiful awk script waiting to be written.

ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by Drei50
1

your solution won't work for the multi-allelic ALT.

ADD REPLYlink written 6.0 years ago by Pierre Lindenbaum122k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1500 users visited in the last hour