Question: Maximum Over-Represented Sequence in Bacterial Transcriptome Data
0
gravatar for debashis.bioinfo
3.0 years ago by
india
debashis.bioinfo0 wrote:

Hi,

I have transcriptome data of two sample with paired end chemistry. I have used Trimmomatic and NGS QC Toolkit for quality trimming. But after trimming data there are so many over represented sequence(50 bp length) present in both reads. Can anyone suggest me what procedure should be followed to remove these overrepresented sequences? I also checked that these overrepresented sequences are not adapters.

rna-seq • 994 views
ADD COMMENTlink modified 3.0 years ago by Govardhan Anande130 • written 3.0 years ago by debashis.bioinfo0

Have you tried blasting a few representative sequences @NCBI?
As long as they are from species of your interest you should be able to proceed with analysis. If they appear to be contaminants then you would want to investigate the extent of that contamination and then decide if the experiment needs to be repeated.

ADD REPLYlink written 3.0 years ago by genomax67k

I used blast. some sequences show similarity with chloroplast genome sequence. But my sequence is from bacteria.

Some sequence also show similarity with distant related species of bacteria.

Should I remove these reads from fast file before moving toward assembly?

ADD REPLYlink written 3.0 years ago by debashis.bioinfo0
1

Chloroplasts are considered to have originated from cyanobacteria (that were engulfed by an eukaryotic cell) so that result in itself may not be surprising. Do you know what fraction of your reads represent data you know comes from the bacterium you are working with and what fraction goes into "other" (chloroplast etc) bin?
I hesitate to recommend that you throw any reads away without a full understanding of what bacterial species you are working with and what this experiment is about.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by genomax67k

Are those reads by chance from rRNA sequences?

ADD REPLYlink written 3.0 years ago by WouterDeCoster38k
0
gravatar for Govardhan Anande
3.0 years ago by
Australia
Govardhan Anande130 wrote:

Hi,

If your data is RNA seq, then you don't have to worry about it. Because it might be the expressed genes.

For example: lets say your read length is 100bp and one gene of length 100 bp got expressed 10 times then there will be 10 reads covering the gene which might end up getting it in over represented seq

Correct me if I am wrong.

Cheers

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Govardhan Anande130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 929 users visited in the last hour