How To Process Raw Pileup
2
2
Entering edit mode
12.0 years ago
Rm 8.2k

Raw pileup file some times before indel (*) have multiple insert (+4caaa) or deletion (-13GGCGCGCGTGCGC) strings in the read colum (column9)

how to get rid of them? Any quick awk or perl suggestions.

2L  650 t   T   48  0   34  7   ..,.,,+4caaa,+4caaa C?CCA=?
2L  650 *   */* 9   0   34  7   *   +CAAA   5   2   0   0   0
2L  654 A   A   48  0   34  7   .$.$,.,+1g,,    DBC?CCA
2L  654 *   */* 19  0   34  7   *   +G  6   1   0   0   0
2L  2332    g   G   60  0   14  33  .,...,,.-13GGCGCGCGTGCGC.,,A,,A,,A,..........aa..   DCBBBBDBCCCCBCCCDCBDCBCCCACCBABCC
2L  2332    *   */* 61  0   14  33  *   -ggcgcgcgtgcgc  32  1   0   0   0
2L  3334    a   A   163 0   15  49  ..$,..,,,t,.T,,,..,,,,T,-7attattt,,-7attattt,,,,,,....,,......,.,,..    BBCA>BCCCC:CCCC>ACCCCBCCCCBCCCCCCDCCCCCCCCBCBCDCC
2L  3334    *   */-attattt  27  27  15  49  *   -attattt    47  2   0   0   0
2L  3928    c   C   32  0   0   11  ,,-4tctt,,.-4TCTT...-4TCTT.^!.^!,   CCC8CCCCBCA
2L  3928    *   */-tctt 157 157 0   11  *   -tctt   8   3   0   0   0
pileup format awk perl • 3.5k views
ADD COMMENT
4
Entering edit mode
12.0 years ago

Those +4caaa and -13GGCGCGCGTGCGC are the base qualities 'read bases at a SNP line' of the short reads under the mutation. If you only want to keep the simple substitutions you could use the following simple awk script:

{
$4=toupper($4);
if($4=='A' || $4=='T' || $4=='G' || $4=='C') print $0;
}
ADD COMMENT
0
Entering edit mode

Column 9 is representative nucleotides of read bases. where as these extra (+4caaa and -13GGCGCGCGTGCGC) are insert or delete (given in next line with indel *) i need to remove them alone not other nucleotides in that column.

ADD REPLY
0
Entering edit mode

for example

2L  650 t   T   48  0   34  7   ..,.,,+4caaa,+4caaa C?CCA=?

Here read depth is 7 (column8) so there should be 7 letters in column 9 but has more because of these extra (+4caaa twice) which only I need to remove.

ADD REPLY
0
Entering edit mode

base qualities are represented in column10

ADD REPLY
0
Entering edit mode

if there is an potential SNP then coloumn 9 also will have a t g c A T G C. apart from reference read (.,)

ADD REPLY
0
Entering edit mode

Thanks for pointing my error. However, as far as I understand a few short read were poorly aligned vs the reference genome. But at the end, the pileup algorithm decided that the mutation was a simple substitution. You can use pileup/'tview' to visualize the alignment at this position.

ADD REPLY
0
Entering edit mode

thanks, I am trying to get a concensus sequence from the raw pileup based on nucleotide distribution at each position (not based on quality) and these were cause errors in my counts.

ADD REPLY
1
Entering edit mode
12.0 years ago
Rm 8.2k

got simple awk solution: (included entire IUB nucleotide codes)

awk '/$9/gsub("[+-][0-9]+[atgcrykmswbdhvnATGCRYKMSWBDHVN]+", "", $9)' raw.pileup >processed.pileup
ADD COMMENT
0
Entering edit mode

it is a brute force way of looking for string. But it will be error prone if the nucleotide immediately after the above string is a SNP.

Any suggestion to improve are welcomed: May be with using index, Pos, substr functions?

ADD REPLY

Login before adding your answer.

Traffic: 2213 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6