I am new to Ion Torrent mapping, but have come to the conclusion that TMAP is the mapper of choice at the moment. Would anyone disagree with this statement?.
I have been looking at my Ion Torrent reads with FASTQC and have noticed an odd nucleotide distribution to the first nine bases.
It almost looks like primer/linker, but is different for each sample. Has anyone else experienced this? Should the first N bases be removed from Ion Torrent reads?
A suggestion was made to use the --nogroup flag to avoid grouping together values of individual positions when reads are >50bp. However, this did not change the "odd" profile i see. I have now included a snapshot (truncated by me at 54bp).
Given your nucleotide distribution, I do not see how the beginnings of these reads could be genomic. Perhaps your samples were multiplexed, and that is the barcode you are seeing? That would explain why the sequences are different in your different samples. At the very least I highly doubt that sequence is genomic, unless the reads all start at a very specific N-mer in the genome that is different for each sample (that seems like a very improbable explanation). Although I have never worked with ion torrent data before, I would definitely recommend getting rid of that part of those reads. It is just too weird.
Even the first 22 or 23 bases look fishy in terms of biases away from certain nucleotide calls. Quite a few programs out there work under the assumption that the beginnings of the reads are of the highest quality. Perhaps the Ion torrent software is built knowing that these kind of oddities can happen? I would probably just strip off the first 23 bases, and then use a read mapper that can handle indels like bowtie2 (not bowtie) or bwa. I might lean toward bowtie2 or bwa bwasw (rather than bwa aln followed by bwa sampe) for these since they are on the longer side. I really don't know anything about TMAP, is there literature stating that it is better for ion torrent reads than something like bwa or bowtie2, and showing a performance comparison?
Also what do you want to do with this data? If you are doing variant calling, then you want a really clean dataset, so err on the side of caution. Having strong position specific read biases like this can bias your variant calls, which is always embarrassing if you think you found something exciting when it is just data noise. After mapping your reads, I would feed the alignment through a pipeline like the raw data processing step that comes before the UnifiedGenotyper in Broad's variant calling pipeline (In the Genome Analysis Toolkit). This alignment processing pipeline has stages that attempt to identify these kind of position specific biases in reads, and then re-adjusts quality scores accordingly.
Sorry about getting back to this discussion, but seeing the per base sequence content diagram I interpret that the first 13 bases are almost exactly the same for every read. For every base position in this first 13 a single base gets to almost 100% occurrence.
Not specific of Ion Torrent, I have already seen this before as primers with 454 data. I would say it is the primers that we are seeing, removing them would be the solution.