How To Process Raw Pileup
2
2
Entering edit mode
12.0 years ago
Rm 8.2k

Raw pileup file some times before indel (*) have multiple insert (+4caaa) or deletion (-13GGCGCGCGTGCGC) strings in the read colum (column9)

how to get rid of them? Any quick awk or perl suggestions.

2L  650 t   T   48  0   34  7   ..,.,,+4caaa,+4caaa C?CCA=?
2L  650 *   */* 9   0   34  7   *   +CAAA   5   2   0   0   0
2L  654 A   A   48  0   34  7   .$.$,.,+1g,,    DBC?CCA
2L  654 *   */* 19  0   34  7   *   +G  6   1   0   0   0
2L  2332    g   G   60  0   14  33  .,...,,.-13GGCGCGCGTGCGC.,,A,,A,,A,..........aa..   DCBBBBDBCCCCBCCCDCBDCBCCCACCBABCC
2L  2332    *   */* 61  0   14  33  *   -ggcgcgcgtgcgc  32  1   0   0   0
2L  3334    a   A   163 0   15  49  ..$,..,,,t,.T,,,..,,,,T,-7attattt,,-7attattt,,,,,,....,,......,.,,.. BBCA>BCCCC:CCCC>ACCCCBCCCCBCCCCCCDCCCCCCCCBCBCDCC 2L 3334 * */-attattt 27 27 15 49 * -attattt 47 2 0 0 0 2L 3928 c C 32 0 0 11 ,,-4tctt,,.-4TCTT...-4TCTT.^!.^!, CCC8CCCCBCA 2L 3928 * */-tctt 157 157 0 11 * -tctt 8 3 0 0 0  pileup format awk perl • 3.5k views ADD COMMENT 4 Entering edit mode 12.0 years ago Those +4caaa and -13GGCGCGCGTGCGC are the base qualities 'read bases at a SNP line' of the short reads under the mutation. If you only want to keep the simple substitutions you could use the following simple awk script: {$4=toupper($4); if($4=='A' || $4=='T' ||$4=='G' || $4=='C') print$0;
}

0
Entering edit mode

Column 9 is representative nucleotides of read bases. where as these extra (+4caaa and -13GGCGCGCGTGCGC) are insert or delete (given in next line with indel *) i need to remove them alone not other nucleotides in that column.

0
Entering edit mode

for example

2L  650 t   T   48  0   34  7   ..,.,,+4caaa,+4caaa C?CCA=?


Here read depth is 7 (column8) so there should be 7 letters in column 9 but has more because of these extra (+4caaa twice) which only I need to remove.

0
Entering edit mode

base qualities are represented in column10

0
Entering edit mode

if there is an potential SNP then coloumn 9 also will have a t g c A T G C. apart from reference read (.,)

0
Entering edit mode

Thanks for pointing my error. However, as far as I understand a few short read were poorly aligned vs the reference genome. But at the end, the pileup algorithm decided that the mutation was a simple substitution. You can use pileup/'tview' to visualize the alignment at this position.

0
Entering edit mode

thanks, I am trying to get a concensus sequence from the raw pileup based on nucleotide distribution at each position (not based on quality) and these were cause errors in my counts.

1
Entering edit mode
12.0 years ago
Rm 8.2k

got simple awk solution: (included entire IUB nucleotide codes)

awk '/$9/gsub("[+-][0-9]+[atgcrykmswbdhvnATGCRYKMSWBDHVN]+", "",$9)' raw.pileup >processed.pileup

0
Entering edit mode

it is a brute force way of looking for string. But it will be error prone if the nucleotide immediately after the above string is a SNP.

Any suggestion to improve are welcomed: May be with using index, Pos, substr functions?

Traffic: 2213 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.