Question

3' Tag-Seq / DE Analysis - FASTQC detects overrepresented sequences as well as fails per base & GC sequence content

1

Entering edit mode

4.5 years ago

gmchaput ▴ 10

I am analyzing differential gene expression for a plant based on 3'cDNA tag libraries generated from total RNA. I used SortMeRNA to remove rRNA followed by using trimmomatic that removed the multiplex adapter as well as quality filtered the reads. After the filtering and trimming, I checked my read quality with FASTQC (see at bottom of post). I am getting multiple overrepresented sequences that all contain the same 46bp:

GGGGGGAGAGCAATACAAGCGTTGTGCTGCTAGGCGAAGCGGTTGA

I ran a BLASTn of this sequence on both NCBI nr/nt database and the refseq_rna database. When not specifying my plant, the results are found in the chloroplast genome of other plants. I ran against my specific plant genome and the only hit is for 26bp that is for the 10B subunit of the mediator of RNA polymerase (92% ident, E = 0.13).

My assumption so far is that this is a contaminant and due being present in ~25% of my reads it is causing the skew in the GC content as well as per base sequence content results of the FASTQC report.

So my questions are:

1) Should I remove this sequence and if so, how?

2) What other reasons could explain the per base sequence content issues since the adapters were removed?

3) Since this data is RNA seq looking at differential gene expression and I will be mapping against the transcriptome via Salmon - does it matter about the FASTQC flags I am seeing in this report?

*Added note (10.24.19) - These are unsterilized plant leaf samples that were used for this experiment so other euk RNA may be present - such as fungi - though very low to null in comparison. Additionally the read depth was not high enough to really capture the fungal RNA. Also, since this is a RNAseq analysis, we expect to have high duplication levels for some transcripts.

Per base sequence content

Per seq GC content

Overrepresented sequences

enter image description here

Per sequence quality scores

Per base sequence quality

Sequence duplication levels - expect to see duplicates since abundance of reads may be high from RNAseq

fastqc RNA-Seq Tag-seq trimmomatic sortmerna • 2.8k views

ADD COMMENT • link 4.5 years ago by gmchaput ▴ 10

1

Entering edit mode

I suggest you embed the relevant fastqc images into the post using the image buttom (the one right of 10101). It would be best to upload them to a public image hoster such as imgur and then paste the full link (including the suffix e.g. .png) into the image field. Right now the link you provide redirects to a html file and I would never download anything from a random Dropbox.

ADD REPLY • link 4.5 years ago by ATpoint 82k

0

Entering edit mode

Thanks for the heads up and my apologies for coming off sketchy with a link to dropbox - I fixed it :)