1
0
Entering edit mode
5.8 years ago
John 13k

Hello :)

Is there a tool that just looks at over-represented sequences? The only one i know of is FastQC, but this does a lot of other things and the output is a little unwieldy. Ideally I just want to supply 1 or more BAM files and get out a over-represented sequence profile (bonus points for annotating them as XYZ adaptors) for each sample.

FastQC Illumina sequencing • 1.3k views
2
Entering edit mode
5.8 years ago

If you're only interested in adapter sequences, then try BBDuk (part of BBMap) with the stats flag.

1
Entering edit mode

Additionally, if you have paired reads, BBMerge can print out the adapter sequence for you based on the overlaps, even if you have adapters not in the list of adapter sequences provided with the BBMap package (in adapters.fa). For example:

bbmerge.sh in1=r1.fq in2=r2.fq outa=adapters.fa

0
Entering edit mode

I think this is exactly what I am looking for Brian. Thank you both so much! :)

0
Entering edit mode

Eek, wait, i got a weird result. Maybe im not doing it right?

Params:

java -Djava.library.path=/ex/bbmap/jni/ -ea -Xmx1000m -cp /ex/bbmap/current/ jgi.BBMerge in1=/mnt/Alpha/RAW_DATA/ChIP/44_Mm01_WEAd_C2/replicate1-H3K27ac/paired/run130812_SN7001180_0069_AD265MACXX/sequence/44_Mm01_WEAd_C2_H3K27ac_F_1_CAGATC_L003_R1_001.fastq.gz in2=/mnt/Alpha/RAW_DATA/ChIP/44_Mm01_WEAd_C2/replicate1-H3K27ac/paired/run130812_SN7001180_0069_AD265MACXX/sequence/44_Mm01_WEAd_C2_H3K27ac_F_1_CAGATC_L003_R2_001.fastq.gz outa=./adapters.fa

0
Entering edit mode

Could it be because my reads don't overlap? I have 50nt P.E. sequencing of 700bp fragments.

1
Entering edit mode

Looking at the insert size distribution, you should have maybe 1% of the overlapped reads shorter than 50bp, which would mean 1%*6.5%*17591731=11000 reads with insert size shorter than 50bp, which thus should contain adapter sequence. But, a 0.00065 fraction is close enough to BBMerge's false-positive rate that the signal might be too noisy to get a confident consensus (in which case it just gives you a 1bp truncated adapter sequence, as in this case, so that the fasta file is valid). You could retry adding the flags vstrict mininsert=20, which would reduce the noise (it will still probably fail, but that's only a wasted 12 seconds); but, your 5' adapter rate in this case is so low you don't need to worry about it anyway. If you have odd FastQC results for this data, something else is the culprit.

0
Entering edit mode

I really apprechiate your quick replies (particularly over the weekend!) Brian - I really think I should be using BBMap more if this is the sort of end-user help people can expect :]

Well, FastQC's position-dependant-kmer-analysis detects stuff, and correctly identifies the kmers contributing to an Illumina TruSeq Adapter, even though it detects that in over-represented sequences and not in Adapter contamination (???)

I just find it super weird that there isnt a tool specifically for beginning-of-read over-represented-sequence analysis. Something like FastQC's kmer analysis tool, but peices together kmers to form the full string - perhaps even going back to the data with this hint and testing the full kmer more throughly.

I tried with vstrict and mininsert but alas I got the same result. I will now proceed to open 155 fastqc reports whilst scratching my fingernails across a chalkboard to lighten the mood.

Traffic: 2950 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.