Should I remove all the overrepresented sequences to improve GC content in fastqc?
3.6 years ago
Lila M ★ 1.1k

Hi everybody, I've just get some RNA-seq (single end) that I have to analyze. As always, I first did the QC with fastqc but I was very surprising because Per base content and per sequence GC content fails. I also have a warning for seq duplication levels (65 overrepresented sequences) :

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 46114   0.378499669139054   No Hit
CGAGAGACCACGGGCGAGGCCGGGGCGACGGGGAAGGCGCGAGAAAGGCG  43691   0.3586118975659108  No Hit
GTCGATTTGGCGAGGGCGCTCCCGACGACGCACCGGGAGGAGGCCCTTCC  34300   0.2815313928843638  No Hit
GTCGGGGGGACGGGTCCGAGGACGCGGCGGCGGAGCCGCCCCGCCCCGAC  31342   0.2572523882152108  No Hit
GGGGCCTCGGAGGAGGGGCGGCGGGGAGGAGGAGGGGCGCGGGAGCGGCG  30073   0.24683654746972222 No Hit
GGGACGGGTCCGAGGACGCGGCGGCGGAGCCGCCCCGCCCCGACGCGGAA  28075   0.23043713863639984 No Hit
GGGCGGGCTCCCGGCCCCGGCCGACGCGCCGCGAGGCGAGCCGGGCGGGCGGGCGCGCGCGCGTACGCGCGGGG  27746   0.22773673548016204 No Hit
CCCCCACCGAGAACCGCCTCGCGAGCCCCGGGGCCCCGCCACCGGGGGCC  27333   0.2243468676882891  No Hit
GCCGGGGAGAGCGAGCGGGGCCGTGCCCGGCGGCGCGGAGCGGCGCGGCG  27197   0.22323059161154643 No Hit
GGGAAGGCCGGGGAGAGCGAGCGGGGCCGTGCCCGGCGGCGCGGAGCGGC  25003   0.20522243196174195 No Hit
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG  24424   0.2004700507232566  No Hit
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGC  23847   0.1957340853094293  TruSeq Adapter, Index 2 (100% over 50bp)
CGGGAGAAGACGAGAGACCACGGGCGAGGCCGGGGCGACGGGGAAGGCGC  23551   0.19330454326004817 No Hit
GGCGGGCTCCCGGCCCCGGCCGACGCGCCGCGAGGCGAGCCGGGCGGGCG  23251   0.1908421695613511  No Hit
CCTCGGAGGAGGGGCGGCGGGGAGGAGGAGGGGCGCGGGAGCGGCGGTCG  22969   0.18852753828457586 No Hit
GTTCGGAAGAGCGGGCCGGGAGAAGACGAGAGACCACGGGCGAGGCCGGG  22640   0.1858271351283381  No Hit
GGGGAAGGCGCGAGAAAGGCGGCCGGCGGGGAAGGGGACGCCACGGGGAC  22538   0.1849899280707811  No Hit
GCCGGGGCGACGGGGAAGGCGCGAGAAAGGCGGCCGGCGGGGAAGGGGAC  20570   0.1688367566073284  No Hit
GGGGCGGGCTCCCGGCCCCGGCCGACGCGCCGCGAGGCGAGCCGGGCGGG  20185   0.16567671036066717 No Hit
GGCCCGGCGAGGGGGAGAGGCGACGGGAGAGAGAGCGCGCGGCCGACGGC  19917   0.1634769898564978  No Hit
CCCACCGAGAACCGCCTCGCGAGCCCCGGGGCCCCGCCACCGGGGGCCCC  19905   0.16337849490854991 No Hit
GCGGGAAGGCCGGGGAGAGCGAGCGGGGCCGTGCCCGGCGGCGCGGAGCG  19537   0.1603579831714815  No Hit
GGGCGGGCTCCCGGCCCCGGCCGACGCGCCGCGAGGCGAGCCGGGCGGGC  19466   0.1597752213961232  No Hit
CCGACACGCCACCACCACCGTCGCTCGTGATTCTCGTCCATCCTCCGACC  19406   0.15928274665638378 No Hit
CCGGGAGAAGACGAGAGACCACGGGCGAGGCCGGGGCGACGGGGAAGGCG  19210   0.15767399583990171 No Hit
GCCGGGAGAAGACGAGAGACCACGGGCGAGGCCGGGGCGACGGGGAAGGC  18647   0.15305294119868024 No Hit
GCGACGGGGAAGGCGCGAGAAAGGCGGCCGGCGGGGAAGGGGACGCCACG  18494   0.15179713061234473 No Hit
GACGGGTCCGAGGACGCGGCGGCGGAGCCGCCCCGCCCCGACGCGGAAGC  18409   0.15109945806438058 No Hit
CGTCGCTCGTGATTCTCGTCCATCCTCCGACCCGGTCCCGCTCCGGGAGA  17813   0.14620754231630242 No Hit
GGGGTCTTTAAACCTCCGCGCCGGAACGCGCTAGGTACCTGGACGGCGGG  17374   0.1426042688038757  No Hit
CCACACACGACCGGTCGGAGGCAGAACGGCAGCCCCTCGGCGGCCGGCCG  17071   0.1401172713681917  No Hit
GTCCGGCCCCCGACCCTCGAGACGCCCTAGCGGGAAGGCCGGGGAGAGCGAGCGGGGCCGTGCCCGGCGGCGCGG 17056   0.13999415268325682 No Hit
CCGAGAACCGCCTCGCGAGCCCCGGGGCCCCGCCACCGGGGGCCCCGGAG  16957   0.1391815693626868  No Hit
CCGGGGAGAGCGAGCGGGGCCGTGCCCGGCGGCGCGGAGCGGCGCGGCGG  16841   0.13822945153252394 No Hit
GGGGCGGGCTCCCGGCCCCGGCCGACGCGCCGCGAGGCGAGCCGGGCGGGCGGGCGCGCGCGCGTACGCGCGGGG 16730   0.13731837326400603 No Hit
GGGAGAGCGAGCGGGGCCGTGCCCGGCGGCGCGGAGCGGCGCGGCGGAGGCGACGGGAATCCGGCCGGCCCCGA  16593   0.13619388927493437 No Hit
GCCGGGCGGGCGGGCGCGCGCGCGTACGCGCGGGGAGGGCGAGGAGGACG  16590   0.1361692655379474  No Hit
CCACGGGCGAGGCCGGGGCGACGGGGAAGGCGCGAGAAAGGCGGCCGGCG  16295   0.13374793140089528 No Hit
CGCTAGAGAAGGCTTTTCTCACCGAGGGTGGGTCACACTCCCCCCACCCGCCAGCCGCTCCTCCTCGGGCCCGC  16237   0.13327187248581385 No Hit
GTCCATCCTCCGACCCGGTCCCGCTCCGGGAGACCGGCGCGCCCCCACCG  16006   0.1313758447378171  No Hit
GACGAGAGACCACGGGCGAGGCCGGGGCGACGGGGAAGGCGCGAGAAAGG  15993   0.13126914187754024 No Hit
GGAAGGCCGGGGAGAGCGAGCGGGGCCGTGCCCGGCGGCGCGGAGCGGCG  15510   0.12730472022263797 No Hit
GGGAGGGCGAGGAGGACGGGCGGGGCCTCGGAGGAGGGGCGGCGGGGAGG  15451   0.12682045339522757 No Hit
GCGAGGAGGACGGGCGGGGCCTCGGAGGAGGGGCGGCGGGGAGGAGGAGG  15247   0.12514603928011356 No Hit
CTCGGAGGAGGGGCGGCGGGGAGGAGGAGGGGCGCGGGAGCGGCGGTCGG  15180   0.12459610915407121 No Hit
GAGAAGACGAGAGACCACGGGCGAGGCCGGGGCGACGGGGAAGGCGCGAG  15015   0.12324180361978783 No Hit
GGACGGGTCCGAGGACGCGGCGGCGGAGCCGCCCCGCCCCGACGCGGAAG  14892   0.12223223040332204 No Hit
CACCGCTAAGAGTCGTACGAGGTCGATTTGGCGAGGGCGCTCCCGACGAC  14860   0.12196957720879435 No Hit
GGGGAGAGCGAGCGGGGCCGTGCCCGGCGGCGCGGAGCGGCGCGGCGGAGGCGACGGGAATCCGGCCGGCCCCGA 14839   0.12179721104988556 No Hit
GGCACGGGCCGGGGGCGGGACGGGCGCCGCACGCCCCGACCCGTCTCCCCCGCGGAGGTCGGGGGGACGGGTCCG 14524   0.11921171866625364 No Hit
CCCGACACGCCACCACCACCGTCGCTCGTGATTCTCGTCCATCCTCCGAC  14416   0.11832526413472269 No Hit
CCGGCGAGGGGGAGAGGCGACGGGAGAGAGAGCGCGCGGCCGACGGCGCC  14402   0.11821035336211684 No Hit
GCCACCACCACCGTCGCTCGTGATTCTCGTCCATCCTCCGACCCGGTCCC  14357   0.11784099730731228 No Hit
CGTGATTCTCGTCCATCCTCCGACCCGGTCCCGCTCCGGGAGACCGGCGCGCCCCCACCGTGGGACGCTTTCCC  14175   0.11634715726343606 No Hit
GGCGAGCCGGGCGGGCGGGCGCGCGCGCGTACGCGCGGGGAGGGCGAGGA  14045   0.11528012866066735 No Hit
CACCGAGAACCGCCTCGCGAGCCCCGGGGCCCCGCCACCGGGGGCCCCGG  13604   0.11166043932358267 No Hit
GGCAGAGACAGAGGCGGCGGCCCGGGGGATCCGGTACCCCCAAGGCACGC  13589   0.1115373206386478  No Hit
CGGGGAAGGCGCGAGAAAGGCGGCCGGCGGGGAAGGGGACGCCACGGGGA  13537   0.11111050919754033 No Hit
CCCGGCGCCGCGGCCACGGGCGCGGCCGGGCGGGCCGCGGGGCGGGCTCC  13513   0.11091351930164456 No Hit
CGGCGGGCGGCGGGCGGGGAAGAGGGCACAGACGGGCGAGGGCCGGGGAC  13502   0.11082323226602565 No Hit
GGCCGGGGAGAGCGAGCGGGGCCGTGCCCGGCGGCGCGGAGCGGCGCGGC  13440   0.11031434170162827 No Hit
GGGCGGGCCGCGGGGCGGGCTCCCGGCCCCGGCCGACGCGCCGCGAGGCG  13134   0.10780272052895727 No Hit
GGGGACGGGTCCGAGGACGCGGCGGCGGAGCCGCCCCGCCCCGACGCGGA  13111   0.10761393854539049 No Hit
GGCGAGGCCGGGGCGACGGGGAAGGCGCGAGAAAGGCGGCCGGCGGGGAA  13017   0.10684239478646541 No Hit
CGGGAATCCGGCCGGCCCCGAAGACGGGGAGCCGGCGCGGCGGGGCCGGA  12869   0.10562762376177487 No Hit
GCTAGAGAAGGCTTTTCTCACCGAGGGTGGGTCACACTCCCCCCACCCGCCAGCCGCTCCTCCTCGGGCCCGC   12865   0.10559479211245891 No Hit
CCGGCGAGGGGGAGAGGCGACGGGAGAGAGAGCGCGCGGCCGACGGCACC  12712   0.10433898152612339 No Hit
GCGGGCTCCCGGCCCCGGCCGACGCGCCGCGAGGCGAGCCGGGCGGGCGG  12229   0.10037455987122114 No Hit
GGGCCTCGGAGGAGGGGCGGCGGGGAGGAGGAGGGGCGCGGGAGCGGCGG  12192   0.10007086711504849 No Hit


I think if I remove all the overrepresented sequences I will improve the Per sequence GC content, but not sure if is the best option... any suggestion or advice?

Thanks!

In general, it is generally a good idea to try to figure out why you see the overrepresented sequences before removing them.

Apparently the run did overcluster due to library concentration... so in that case what is the best choice?

Is this NextSeq data? "Failing" a test does not automatically make the data bad. If those are bad basecalls due to run overclustering then you would get a lower fraction of reads aligning.

My suggestion is don't mess with the data beyond scanning for and getting rid of adapters/extraneous sequences. If the downstream analysis demonstrates a problem then you can backtrack to diagnose other issues.

Yes, is NextSeq data. Only one index has been reported in overrepresented sequences but adapter content pass (flat). However, lot of kmers has been reported. So I will do the alignment and run multiqc and see whats going on. Thanks!

Don't depend on FastQC to judge adapter contamination. It does not look at the entire data when reporting various stats (see below). Use a proper scan/trim program like bbduk.sh from BBMap.

• Duplication module and overrepresented sequences module track the first 8000 sequences it sees (but then reads them to the end of the file), the amount of data this represents will vary per file.
• Per tile plot only tracks 10% of the data (1 in 10 sequences)
• k-mer module only tracks 2% of the data (1 in 50 sequences)
So I can't figure out how to scan and trim my fastq file with bbduk.sh, as is necessary to tell the sequences or kmers that I want to remove and they only appeared in fastqc report... maybe I miss something? thanks

Idea is you are not going to remove anything other than Illumina adapters, that is if they are present. If your data has really bad quality (Q10 or less) then and only then you may want to do quality based trimming/filtering (trimq=10). Otherwise something like this should suffice.

If you have PE data

bbduk.sh in1=file_R1.fq.gz in2=file_R2.fq.gz out1=clean_R1.fq.gz out2=clean_R2.fq.gz ktrim=r k=21 path=/path_to/bbmap/resources/adapters.fa tbo tpe


bbduk.sh in1=file.fq.gz out=clean.fq.gz ktrim=r k=21 path=/path_to/bbmap/resources/adapters.fa


Once this is done, don't worry about fastqc. Just proceed with your analysis.

bbduk.sh should produce nice stats at the end of the run. Post then here if you want a second opinion.

thank you very much for the info! Very useful!

Other naive question.... should I use also the splitnextera.sh ? thanks!

Not unless you have Nextera long mate pair libraries. Do you?

No for that pool of sequences, but I will have to analyze some DNA paired end with Nextera long mate pair libraries.

Then yes. See the guide in bbmap/docs/guides/SplitNexteraGuide.txt for that.