7.2 years ago by
San Francisco, CA, Cancer Therapeutics Innovation Group
Given your nucleotide distribution, I do not see how the beginnings of these reads could be genomic. Perhaps your samples were multiplexed, and that is the barcode you are seeing? That would explain why the sequences are different in your different samples. At the very least I highly doubt that sequence is genomic, unless the reads all start at a very specific N-mer in the genome that is different for each sample (that seems like a very improbable explanation). Although I have never worked with ion torrent data before, I would definitely recommend getting rid of that part of those reads. It is just too weird.
Even the first 22 or 23 bases look fishy in terms of biases away from certain nucleotide calls. Quite a few programs out there work under the assumption that the beginnings of the reads are of the highest quality. Perhaps the Ion torrent software is built knowing that these kind of oddities can happen? I would probably just strip off the first 23 bases, and then use a read mapper that can handle indels like
bwa. I might lean toward
bwa bwasw (rather than
bwa aln followed by
bwa sampe) for these since they are on the longer side. I really don't know anything about
TMAP, is there literature stating that it is better for ion torrent reads than something like
bowtie2, and showing a performance comparison?
Also what do you want to do with this data? If you are doing variant calling, then you want a really clean dataset, so err on the side of caution. Having strong position specific read biases like this can bias your variant calls, which is always embarrassing if you think you found something exciting when it is just data noise. After mapping your reads, I would feed the alignment through a pipeline like the raw data processing step that comes before the
UnifiedGenotyper in Broad's variant calling pipeline (In the Genome Analysis Toolkit). This alignment processing pipeline has stages that attempt to identify these kind of position specific biases in reads, and then re-adjusts quality scores accordingly.
Anyways, good luck with this dataset!