Question: remove over-represented sequences
0
gravatar for wuzongze001
4.0 years ago by
wuzongze00110
Germany
wuzongze00110 wrote:

I download data SRR1029258.sra (GSM1263454) from GEO and run fastq-dump to convert it to fastq. Then I use fastqc to do quality control. I found the quality of sequences is not high enough and the highest over-represented sequence percentage is about 29.9%. So I use command:

fastq_quality_trimmer -t 10 -l 40 -i SRR1029258.fastq -o trim_SRR1029258.fastq

The per base sequences quality is better but  the highest over-represented sequence percentage turn to be  33.1%.

According to google, over-represented sequences may due to contaminated and may cause a wrong conclusion. Is there any way i can remove these reads?

 

ADD COMMENTlink modified 4.0 years ago by Gary450 • written 4.0 years ago by wuzongze00110
1
gravatar for Devon Ryan
4.0 years ago by
Devon Ryan89k
Freiburg, Germany
Devon Ryan89k wrote:

Given that this is a ChIPseq experiment, over-represented sequences are expected. If you were to completely remove these, then you'd end up removing at least some of the real peaks.

ADD COMMENTlink written 4.0 years ago by Devon Ryan89k

thank you very much!  I suddenly realize that.

ADD REPLYlink written 4.0 years ago by wuzongze00110
0
gravatar for Adrian Pelin
4.0 years ago by
Adrian Pelin2.2k
Canada
Adrian Pelin2.2k wrote:

rRNA is a common overrepresented sequence, in some of the RNA-Seq reads i'm working with I have contamination ranging from 15% to 22%. The easiest thing to do is to blast your overrepresented sequence. Does FastQC say No Hit next to it?

ADD COMMENTlink written 4.0 years ago by Adrian Pelin2.2k

I blast it, top two hit both are rRNA. And the top five over-represented sequences percentage of my fastqc result are 33%, 4.5%, 3.6%,1.9%,0.6. I do not really understand "Does FastQC say No Hit next to it" , could you explain this?

ADD REPLYlink written 4.0 years ago by wuzongze00110

FastQC has a limited database of things that it checks against, it won't be as thorough as blast.

ADD REPLYlink written 4.0 years ago by Devon Ryan89k

Devon has a point, having the entire rRNA database in the package would make the software large in size and difficult to download. Although I must say, I haven't figured out how to easily cut and paste the sequences that are overrepresented, I usually have to save the report and then open the HTML, to copy the sequence.

ADD REPLYlink written 4.0 years ago by Adrian Pelin2.2k

Hi! I run fastqc and I got a bunch of over-represented sequences with "no hit" next to them... I'm not sure if these are adaptors or important biological information that should stay in there?  I guess what I'm really asking is the following: what is the meaning of "no hit" in this context? I'm going to try removing them (using cutadapt) and see if the per-base quality improves (I was getting really low quality at the ends of reads).  Any advice anyone could give me regarding this issue would be really appreciated. Cheers! 

ADD REPLYlink written 3.8 years ago by lawrence.mckechnie0

"The easiest thing to do is to blast your overrepresented sequence"

ADD REPLYlink written 3.7 years ago by Adrian Pelin2.2k
0
gravatar for Gary
4.0 years ago by
Gary450
Taiwan/Taichung/China Medical University Hospital
Gary450 wrote:

If the percentage of unique reads is too low, it could be caused by PCR bias during ChIP-Seq library preparation. 

ADD COMMENTlink written 4.0 years ago by Gary450
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 707 users visited in the last hour