Question

How To Process Raw Pileup

2

Entering edit mode

13.6 years ago

Rm 8.3k

Raw pileup file some times before indel (*) have multiple insert (+4caaa) or deletion (-13GGCGCGCGTGCGC) strings in the read colum (column9)

how to get rid of them? Any quick awk or perl suggestions.

2L  650 t   T   48  0   34  7   ..,.,,+4caaa,+4caaa C?CCA=?
2L  650 *   */* 9   0   34  7   *   +CAAA   5   2   0   0   0
2L  654 A   A   48  0   34  7   .$.$,.,+1g,,    DBC?CCA
2L  654 *   */* 19  0   34  7   *   +G  6   1   0   0   0
2L  2332    g   G   60  0   14  33  .,...,,.-13GGCGCGCGTGCGC.,,A,,A,,A,..........aa..   DCBBBBDBCCCCBCCCDCBDCBCCCACCBABCC
2L  2332    *   */* 61  0   14  33  *   -ggcgcgcgtgcgc  32  1   0   0   0
2L  3334    a   A   163 0   15  49  ..$,..,,,t,.T,,,..,,,,T,-7attattt,,-7attattt,,,,,,....,,......,.,,..    BBCA>BCCCC:CCCC>ACCCCBCCCCBCCCCCCDCCCCCCCCBCBCDCC
2L  3334    *   */-attattt  27  27  15  49  *   -attattt    47  2   0   0   0
2L  3928    c   C   32  0   0   11  ,,-4tctt,,.-4TCTT...-4TCTT.^!.^!,   CCC8CCCCBCA
2L  3928    *   */-tctt 157 157 0   11  *   -tctt   8   3   0   0   0

pileup format awk perl • 4.1k views

ADD COMMENT • link 13.2 years ago by Rm 8.3k

Ram · Answer 1 · 2010-09-29

4

Entering edit mode

13.6 years ago

Pierre Lindenbaum 161k

Those +4caaa and -13GGCGCGCGTGCGC are the base qualities 'read bases at a SNP line' of the short reads under the mutation. If you only want to keep the simple substitutions you could use the following simple awk script:

{
$4=toupper($4);
if($4=='A' || $4=='T' || $4=='G' || $4=='C') print $0;
}

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 13.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Column 9 is representative nucleotides of read bases. where as these extra (+4caaa and -13GGCGCGCGTGCGC) are insert or delete (given in next line with indel *) i need to remove them alone not other nucleotides in that column.

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 13.6 years ago by Rm 8.3k

0

Entering edit mode

for example

2L  650 t   T   48  0   34  7   ..,.,,+4caaa,+4caaa C?CCA=?

Here read depth is 7 (column8) so there should be 7 letters in column 9 but has more because of these extra (+4caaa twice) which only I need to remove.

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 13.6 years ago by Rm 8.3k

0

Entering edit mode

base qualities are represented in column10

ADD REPLY • link 13.6 years ago by Rm 8.3k

0

Entering edit mode

if there is an potential SNP then coloumn 9 also will have a t g c A T G C. apart from reference read (.,)

ADD REPLY • link 13.6 years ago by Rm 8.3k

0

Entering edit mode

Thanks for pointing my error. However, as far as I understand a few short read were poorly aligned vs the reference genome. But at the end, the pileup algorithm decided that the mutation was a simple substitution. You can use pileup/'tview' to visualize the alignment at this position.

ADD REPLY • link 13.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

thanks, I am trying to get a concensus sequence from the raw pileup based on nucleotide distribution at each position (not based on quality) and these were cause errors in my counts.

ADD REPLY • link 13.6 years ago by Rm 8.3k

Ram · Answer 2 · 2010-09-30

1

Entering edit mode

13.6 years ago

Rm 8.3k

got simple awk solution: (included entire IUB nucleotide codes)

awk '/$9/gsub("[+-][0-9]+[atgcrykmswbdhvnATGCRYKMSWBDHVN]+", "", $9)' raw.pileup >processed.pileup

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 13.6 years ago by Rm 8.3k

0

Entering edit mode

it is a brute force way of looking for string. But it will be error prone if the nucleotide immediately after the above string is a SNP.

Any suggestion to improve are welcomed: May be with using index, Pos, substr functions?

ADD REPLY • link 13.6 years ago by Rm 8.3k