Question: FASTQ file: duplicated sequences and overall poor quality
0
gravatar for piyushjo
6 weeks ago by
piyushjo50
piyushjo50 wrote:

Hi,

I am performing fastqc quality check on bunch of fastq files I downloaded from SRA. The fastqc report shows overall poor quality with perbase sequence content and sequence duplication level are flagged red. There is no adapter content but a lot of sequences are present in overrepresented sequences category (less than 1%). So I ran trim_galore with default parameters with paired option. The post processing looks worse then before with no improvement in sequence duplication levels or overrepresented sequences.

Now there is no adapter content which flags, so I can't run with trimmomatic with adapter sequence. Could you tell me what processing I need to do to improve sequence quality. For the particular example I posted the alignment percentage to reference genome is 88% (paired sequences). I also have some single cell sequences from the same experiments which have 60-70% alignment.

Original

orliginal

Overrepresented sequences

original OS

Post processing

post

Overrepresented sequences

postOS

trimgalore fasqtc trimming • 170 views
ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by piyushjo50

Please use these instructions to add images properly: How to add images to a Biostars post

ADD REPLYlink written 6 weeks ago by genomax57k

I am trying to upload pics,but I don't get them posted.

ADD REPLYlink written 6 weeks ago by piyushjo50

I showed you one example above.

ADD REPLYlink written 6 weeks ago by genomax57k

Thanks! I was copying the wrong link.

ADD REPLYlink written 6 weeks ago by piyushjo50

Seeing "X" does not immediately reflect bad data. You have to take the results in context of the experiment you are looking at. Please take some time read the informative blog posts that FastQC team has on this site.

BTW: Your data looks ok (at least the bit you posted).

ADD REPLYlink written 6 weeks ago by genomax57k

I added the overrepresented sequences part. I also want to mention that trim_galore detected Nextra trasnposase sequence which the fastqc doesn't show and then did the clipping, I think that resulted in variable sequence length.

ADD REPLYlink written 6 weeks ago by piyushjo50

I think that resulted in variable sequence length.

That is expected. When extraneous sequences are trimmed that will happen.

Over-represented sequences could represent sequences that were enriched as a part of the experiment (e.g. a binding site). So even if FastQC flagged them they may represent a result you want. I suggest that you go along to the next step (as long as all extraneous sequence has been trimmed from the data).

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by genomax57k

Hi Genomax,

After talking to another bioinfo prof, he recommended me removing the overrepresented sequences as the source is tissue and not amplified RNA. Could you suggest any tool that can look and remove overrepresented sequences?

Thanks!

ADD REPLYlink written 5 weeks ago by piyushjo50
1

You could use bbduk.sh with literal=sequence1,sequence2 etc option fro BBMap suite. That said I don't think that is a good idea since you could be skewing your data in some way by selectively removing sequences from it.

ADD REPLYlink written 5 weeks ago by genomax57k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1826 users visited in the last hour