Entering edit mode
8.7 years ago
John
13k
Hello :)
Is there a tool that just looks at over-represented sequences? The only one i know of is FastQC, but this does a lot of other things and the output is a little unwieldy. Ideally I just want to supply 1 or more BAM files and get out a over-represented sequence profile (bonus points for annotating them as XYZ adaptors) for each sample.
Additionally, if you have paired reads, BBMerge can print out the adapter sequence for you based on the overlaps, even if you have adapters not in the list of adapter sequences provided with the BBMap package (in adapters.fa). For example:
I think this is exactly what I am looking for Brian. Thank you both so much! :)
Eek, wait, i got a weird result. Maybe im not doing it right?
Params:
Could it be because my reads don't overlap? I have 50nt P.E. sequencing of 700bp fragments.
Short answer, probably. If your reads don't fully overlap, you won't have adapter contamination.
Looking at the insert size distribution, you should have maybe 1% of the overlapped reads shorter than 50bp, which would mean
1%*6.5%*17591731=11000
reads with insert size shorter than 50bp, which thus should contain adapter sequence. But, a 0.00065 fraction is close enough to BBMerge's false-positive rate that the signal might be too noisy to get a confident consensus (in which case it just gives you a 1bp truncated adapter sequence, as in this case, so that the fasta file is valid). You could retry adding the flagsvstrict mininsert=20
, which would reduce the noise (it will still probably fail, but that's only a wasted 12 seconds); but, your 5' adapter rate in this case is so low you don't need to worry about it anyway. If you have odd FastQC results for this data, something else is the culprit.I really apprechiate your quick replies (particularly over the weekend!) Brian - I really think I should be using BBMap more if this is the sort of end-user help people can expect :]
Well, FastQC's position-dependant-kmer-analysis detects stuff, and correctly identifies the kmers contributing to an Illumina TruSeq Adapter, even though it detects that in over-represented sequences and not in Adapter contamination (???)
I just find it super weird that there isnt a tool specifically for beginning-of-read over-represented-sequence analysis. Something like FastQC's kmer analysis tool, but peices together kmers to form the full string - perhaps even going back to the data with this hint and testing the full kmer more throughly.
I tried with vstrict and mininsert but alas I got the same result. I will now proceed to open 155 fastqc reports whilst scratching my fingernails across a chalkboard to lighten the mood.