Question: How To Process Raw Pileup
2
gravatar for Rm
8.6 years ago by
Rm7.8k
Danville, PA
Rm7.8k wrote:

Raw pileup file some times before indel (*) have multiple insert (+4caaa) or deletion (-13GGCGCGCGTGCGC) strings in the read colum (column9)

how to get rid of them? Any quick awk or perl suggestions.

2L  650 t   T   48  0   34  7   ..,.,,+4caaa,+4caaa C?CCA=?
2L  650 *   */* 9   0   34  7   *   +CAAA   5   2   0   0   0
2L  654 A   A   48  0   34  7   .$.$,.,+1g,,    DBC?CCA
2L  654 *   */* 19  0   34  7   *   +G  6   1   0   0   0
2L  2332    g   G   60  0   14  33  .,...,,.-13GGCGCGCGTGCGC.,,A,,A,,A,..........aa..   DCBBBBDBCCCCBCCCDCBDCBCCCACCBABCC
2L  2332    *   */* 61  0   14  33  *   -ggcgcgcgtgcgc  32  1   0   0   0
2L  3334    a   A   163 0   15  49  ..$,..,,,t,.T,,,..,,,,T,-7attattt,,-7attattt,,,,,,....,,......,.,,..    BBCA>BCCCC:CCCC>ACCCCBCCCCBCCCCCCDCCCCCCCCBCBCDCC
2L  3334    *   */-attattt  27  27  15  49  *   -attattt    47  2   0   0   0
2L  3928    c   C   32  0   0   11  ,,-4tctt,,.-4TCTT...-4TCTT.^!.^!,   CCC8CCCCBCA
2L  3928    *   */-tctt 157 157 0   11  *   -tctt   8   3   0   0   0
perl format pileup awk • 2.5k views
ADD COMMENTlink modified 8.3 years ago • written 8.6 years ago by Rm7.8k
4
gravatar for Pierre Lindenbaum
8.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

Those +4caaa and -13GGCGCGCGTGCGC) are the [?]base qualities[?] 'read bases at a SNP line' of the short reads under the mutation. If you only want to keep the simple substitutions you could use the following simple awk script:

  {
  $4=toupper($4);
  if($4=='A' || $4=='T' || $4=='G' || $4=='C') print $0;
  }
ADD COMMENTlink modified 8.6 years ago • written 8.6 years ago by Pierre Lindenbaum120k

Column 9 is representative nucleotides of read bases. where as these extra (+4caaa and -13GGCGCGCGTGCGC) are insert or delete (given in next line with indel *) i need to remove them alone not other nucleotides in that column.

ADD REPLYlink written 8.6 years ago by Rm7.8k

for example "2L 650 t T 48 0 34 7 ..,.,,+4caaa,+4caaa C?CCA=?". Here read depth is 7 (column8) so there should be 7 letters in column 9 but has more because of these extra (+4caaa twice) which only i need to remove.

ADD REPLYlink written 8.6 years ago by Rm7.8k

base qualities are represented in column10

ADD REPLYlink written 8.6 years ago by Rm7.8k

if there is an potential SNP then coloumn 9 also will have a t g c A T G C. apart from reference read (.,)

ADD REPLYlink written 8.6 years ago by Rm7.8k

Thanks for pointing my error. However, as far as I understand a few short read were poorly aligned vs the reference genome. But at the end, the pileup algorithm decided that the mutation was a simple substitution. You can use pileup/'tview' to visualize the alignment at this position.

ADD REPLYlink written 8.6 years ago by Pierre Lindenbaum120k

thanks, I am trying to get a concensus sequence from the raw pileup based on nucleotide distribution at each position (not based on quality) and these were cause errors in my counts.

ADD REPLYlink written 8.6 years ago by Rm7.8k
1
gravatar for Rm
8.6 years ago by
Rm7.8k
Danville, PA
Rm7.8k wrote:

got simple awk solution: (included entire IUB nucleotide codes)

awk '/$9/gsub("[+-][0-9]+[atgcrykmswbdhvnATGCRYKMSWBDHVN]+", "", $9)' raw.pileup >processed.pileup
ADD COMMENTlink modified 8.6 years ago • written 8.6 years ago by Rm7.8k

it is a brute force way of looking for string. But it will be error prone if the nucleotide immediately after the above string is a SNP.

Any suggestion to improve are welcomed: May be with using index, Pos, substr functions?

ADD REPLYlink written 8.6 years ago by Rm7.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1603 users visited in the last hour