Question

cutting overrepresented sequences (recognized by fatsqc)

1

Entering edit mode

4.1 years ago

Researcher ▴ 20

When should we and when shouldn't we cut the overrepresented sequences (recognized by fastqc)?

sequencing genome-sequence fastqc genome sequence • 3.9k views

ADD COMMENT • link 4.1 years ago by Researcher ▴ 20

1

Entering edit mode

If it is RNAseq, we mustn't. That is unless the overrepresented sequences are adapter sequences, in which case it would depend on your analysis. For transcript quantification on a known reference, the aligner/pseudoaligner will probably handle it for you. For transcript assembly, you might want to remove them (but e.g. Trinity does this in its workflow anyway unless asked not to).

For DNAseq, it depends. It would be intuitive to remove adapters (as overrepresented sequences) prior to assembly, but I know a bunch of god-level assembly specialists who let the assembler handle it and remove contamination afterwards. Personally, I prefer the former, but there is not a 100% perfect answer. For variant calling you would likely want to remove the duplicates (which are also overrepresented.)

ADD REPLY • link 4.1 years ago by cschu181 ★ 2.8k

0

Entering edit mode

My sequences are related to paired-end DNA sequences of E. Coli, and the adaptors are already cut, but in r2 file of 1/4 of my samples, I have an overrepresented sequence of 50 sequential G (GGGGGGGGGG...). And In some other cases (only a few samples), some other overrepresented sequences are shared between r1 and r2. Then would do you suggest in my case?

ADD REPLY • link 4.1 years ago by Researcher ▴ 20

1

Entering edit mode

Let me guess, the G-polymers are at the 3' end? G is the signal from certain Illumina machines if there's not enough template to sequence, so yea, get rid of those. You can try this with bbduk, I forgot which settings, though.

As for the shared overrepresented sequences: if they're not already characterised by fastqc, can you maybe blast them and see what comes up?

ADD REPLY • link 4.1 years ago by cschu181 ★ 2.8k

0

Entering edit mode

Thanks for your detailed answer, then I have a couple of questions: 1. Should I cut the overrepresented sequences from both reads in my paired-reads, even if they only appear in one of them? 2. Should I cut these 50 sequential Gs from all samples or only the ones which have an overrepresentation of them? I mean some samples contain these 50 sequential Gs, but they don't have an overrepresentation of that!

ADD REPLY • link 4.1 years ago by Researcher ▴ 20

1

Entering edit mode

Again, it depends on what they are. Have you blasted them? If it is oversequenced, as genomax posts, maybe you can downsample the data.
Yea, I'd get rid of them. Check bbduk, section Kmer masking.

ADD REPLY • link 4.1 years ago by cschu181 ★ 2.8k

0

Entering edit mode

I blasted them, and all of them except one had hits among E. Coli genomes. So, I may keep all of them even the one who doesn't have hits, to make it consistent, what do you think? What do you mean by downsampling, exactly?

ADD REPLY • link 4.1 years ago by Researcher ▴ 20

1

Entering edit mode

Yea, don't do anything with those overrepresented sequences.

Downsampling would be to generate random subsets of your read data to decrease coverage. As genomax states

If you have a small genome that is over-sequenced then you will see over-representation

So, calculate your coverage (2 x number_reads x readlength / e_coli_genome_size (~4.8Mb?)) and use a tool like seqtk to generate subsets with less coverage (30x, 60x, ...) and see how those perform in terms of overrepresentation.

To generate proper subsamples for both reads, R1 and R2:

seqtk sample -s100 read_file_r1 size_of_subset
seqtk sample -s100 read_file_r2 size_of_subset

ADD REPLY • link 4.1 years ago by cschu181 ★ 2.8k

0

Entering edit mode

One could also use bbnorm.sh for normalization instead of just downsampling the data.

That said, none of this may be needed. @negin tell us what kind of DNAseq analysis you are doing? If you are just going to align to a reference then none of this is needed.

ADD REPLY • link 4.1 years ago by GenoMax 141k

0

Entering edit mode

One could also use bbnorm.sh for normalization

Ever since Brian Bushnell has stated (tweet is now unavailable) that this should only be done for single cell, not for isolates, I have dropped this from my bacterial assemblies. But of course this would then only affect assemblies and not alignment-based analyses.

ADD REPLY • link 4.1 years ago by cschu181 ★ 2.8k

0

Entering edit mode

People now tailor sequencing accordingly to get just enough coverage for bacterial isolates so normalization is probably not needed.

should only be done for single cell, not for isolates

Do you mean pure bacterial strains not for metagenomic data. That is logical.

ADD REPLY • link 4.1 years ago by GenoMax 141k

0

Entering edit mode

I am not sure what he meant, but he said isolates vs single cells (s. below). Maybe he meant metagenomic data.

In my experience, BBDuk for adapter removal, then Tadpole error correction, then BBMerge (with rsem flag), then Spades with no error correction, enables the fastest, lowest-memory, and most accurate assemblies. For single cells BBNorm is suggested also, but not for isolates. (BB: 13th June 2019, tweet now unavailable)

ADD REPLY • link 4.1 years ago by cschu181 ★ 2.8k

0

Entering edit mode

I am doing a Genome-wide Association Study on genomic data (using Nextera...) of E. coli.

ADD REPLY • link 4.1 years ago by Researcher ▴ 20

1

Entering edit mode

With two color chemistry (no signal = G base). Poly-G's are essentially clusters producing no usable sequence.

You should read this blog post (and some others) from author's of FastQC. If you have a small genome that is over-sequenced then you will see over-representation (assuming you have removed adapters and contaminating sequences).

ADD REPLY • link 4.1 years ago by GenoMax 141k

0

Entering edit mode

If no-signal means G, then what is the difference between N and G? Because as far as I understood, N bases in sequencing data also happen when the software is unable to identify the base.

ADD REPLY • link 4.1 years ago by Researcher ▴ 20

1

Entering edit mode

If you see your read ending with a long G-polymer on a two colour machine (NovaSeq, not sure which else), these are essentially no-signal: as I said, the template is not long enough to sequence and the machine will report stretches of no signal as G. A G somewhere inside the read will be an actual G (check the quality!) and there will be Ns present for ambiguous calls as usual.

e.g. from a recent NovaSeq 6000 run (s. the N at position 2):

@some_header
TNCGCCGTCCAGAACAAATTAGCACGTACGCTCAGCTGTCGTAGGTGCGGCACTCGTCGGTCTCGGGGTTCTCCTTGCAGTAGTTCTCGAGCGGGTCGGAGTCCTTGAGCTTGTCCCGGGCGTGGCTGGCGGCGGCGCTGAGCTCCTCCA
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

ADD REPLY • link 4.1 years ago by cschu181 ★ 2.8k

0

Entering edit mode

If no-signal means G

Only for sequencers where 1-color and 2-color chemistry is being used.

ADD REPLY • link 4.1 years ago by GenoMax 141k