Question

What do Over-represented Sequences mean in FastQC using RNA-Seq?

0

Entering edit mode

19 months ago

Saran ▴ 50

Hello,

I have two samples: a control and infected. For each sample, I have two fastqc files: R1(forward) and R2(reverse).

I initially performed FastQC analysis and saw that there were over represented sequences present in my files, R2 results shown below for non-infected:

AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTT  10012948    15.265471635087929  Clontech SMART CDS Primer II A (100% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTG  192028  0.29276073211831966 Clontech SMART CDS Primer II A (100% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTA  102871  0.15683436412264704 Clontech SMART CDS Primer II A (100% over 26bp)
GCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTT  75230   0.11469363778855786 Clontech SMART CDS Primer II A (100% over 24bp)
AAGCAGTGGTATAAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTT  75166   0.11459606510720113 Clontech SMART CDS Primer II A (96% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACATGGGAAGCAGTGGTATCAACGCAG  70585   0.10761199552446307 Clontech SMARTer II A Oligonucleotide (100% over 25bp)

I also performed this on my R1 reads and got flagged for over-represented sequences yet some were TruSeq and some were truseq and some clontech:

GATCGGAAGAGCACACGTCTGAACTCCAGTCACCCACCACCTAATCTCGT  772364  1.1775254134909172  TruSeq Adapter, Index 23 (97% over 37bp)
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCCACCACCTAATCTCGG  632984  0.9650304083736875  TruSeq Adapter, Index 23 (97% over 37bp)
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCCACCACCTAATCGCGG  220259  0.3358009566086663  TruSeq Adapter, Index 23 (97% over 37bp)
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTT  195017  0.29731768125230873 Clontech SMART CDS Primer II A (100% over 26bp)
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCCACCACCTAATCGCGT  185486  0.2827869745958852  TruSeq Adapter, Index 23 (97% over 37bp)
GCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTT  95866   0.14615472923352238 Clontech SMART CDS Primer II A (100% over 24bp)

First, A sample wouldnt have two different types of Adapters so this confuses me?

I ran trimmomatic anyway with the illuminaclip parameter:

java -jar /mnt/Active/Trimmomatic-0.39/trimmomatic-0.39.jar PE /mnt/Active/rna_seq/rsv_mock.CCACCACCTA-ATCGAATCCG.HKW73DSX3_CCACCACCTA-ATCGAATCCG_L004_R1.fastq.gz /mnt/Active/rna_seq/rsv_mock.CCACCACCTA-ATCGAATCCG.HKW73DSX3_CCACCACCTA-ATCGAATCCG_L004_R2.fastq.gz rsv_mock_R1_paired.fq.gz rsv_mock_R1_unpaired.fq.gz rsv_mock_R2_paired.fq.gz rsv_mock_R2_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:True SLIDINGWINDOW:4:30 MINLEN:50

This removed the over-represented features flag for the R1 file yet the R2 file now has even more sequences flagged:

AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTT  1128037 2.38682200926412    Clontech SMART CDS Primer II A (100% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTT   1073014 2.270398427931469   Clontech SMART CDS Primer II A (100% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT 965710  2.0433530837786824  Clontech SMART CDS Primer II A (100% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTT    866803  1.834075015355141   Clontech SMART CDS Primer II A (100% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT    671489  1.4208086473925545  Clontech SMART CDS Primer II A (100% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTT 650235  1.3758371482441225  Clontech SMART CDS Primer II A (100% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTT  444462  0.9404405031763581  Clontech SMART CDS Primer II A (100% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT   404197  0.8552434855226643  Clontech SMART CDS Primer II A (100% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT  210997  0.44645014612880746 Clontech SMART CDS Primer II A (100% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT 106459  0.22525740226982716 Clontech SMART CDS Primer II A (100% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT    52962   0.11206269586427249 Clontech SMART CDS Primer II A (100% over 26bp)
AAGCAGTGGTATCAACGCAGAGTACATGGGGAGGCATTGAGGCAGCCAGC  48149   0.10187883280784063 Clontech SMARTer II A Oligonucleotide (100% over 25bp)
AAGCAGTGGTATCAACGCAGAGTACATGGGAAGCAGTGGTATCAACGCAG  47423   0.10034268392378298 Clontech SMARTer II A Oligonucleotide (100% over 25bp)

```

What exactly are these sequences? Could they not be primers as the fastqc file suggests and actually just be genes that are highly expressed? Why would there be more after running trimmomatic?

Also, the "Per Sequence GC content" does not have the nice bell curve that it previously had before running trimmomatic....

Your help will be greatly appreciated, thank you!

fastqc RNAseq trimmomatic adapters • 1.4k views

ADD COMMENT • link updated 19 months ago by Istvan Albert 100k • written 19 months ago by Saran ▴ 50

0

Entering edit mode

A sample wouldnt have two different types of Adapters so this confuses me?

One is a primer and other is an adapter. Those are two different entities. Do you expect to see primer sequences in your data? Looks like clonetech kit uses some kind of poly-A capture technology which is probably represented by the poly-T's you see in the results above.

You have not given us information about other parts of FastQC report. How long are these reads? Did other parameters in FastQC look reasonable?

What exactly are these sequences?

More than likely they are things that should get removed once you properly scan and trim your data. If you are willing I suggest you give bbduk.sh a try; A guide is available: https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/

ADD REPLY • link 19 months ago by GenoMax 141k

score 1 · Answer 1 · 2022-09-28

See what the mapping says. Align your data to a reference.

To be honest, your data looks a little suspect, you have too many full adapters in there.

In my opinion, there is no need to remove entire reads, the report is there to let you understand the data better. If something is a fully artificial sequence, it won't map to the genome.

You can use a tool like fastp to remove known adapter/primer sequences at the start/end of reads.

See fastp for example: https://github.com/OpenGene/fastp