1
0
Entering edit mode
19 months ago
mskr_ ▴ 10

Hello a newbie here, I am reanalyzing an article (GSE83931) for training purpose. I have two concerns/question.

1- I performed FASTQC on the sequences followed by multiqc. When I look at the reports individually it doesn't show any adapter sequence. (please see pic1). (Authors reported the they used Trimmomatic to remove them). I can see adapter in the multiqc report (pic2). Pictures belong to the same run. .

How can we explain the discrepancy here?

2- They reported that TruSeq3-SE.fa adapter sequence was removed by Trimmomatic. I used cutadapt instead. The adapter sequence (based on the FASTQC report) I found online corresponds to : AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA

I used following command line parameters:

cutadapt -a AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA -m 50 -j 4 -o SRR3734812_trim50.fastq.gz --length-tag 'length=' SRR3734812.fastq.gz


Output:

This is cutadapt 1.18 with Python 3.7.6 Command line parameters: -a
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA -m 50 -j 4 -o
SRR3734812_trim50.fastq.gz --length-tag length= SRR3734812.fastq.gz
Processing reads on 4 cores in single-end mode ... Finished in 709.18

=== Summary ===

783,598 (3.1%) Reads that were too short:                   0 (0.0%)
Reads written (passing filters):    25,562,072 (100.0%)

Total basepairs processed: 2,556,207,200 bp Total written (filtered):
2,553,044,075 bp (99.9%)

Sequence: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA; Type: regular 3';
Length: 34; Trimmed: 783598 times.

No. of allowed errors: 0-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-34 bp:
3

Bases preceding removed adapters:   A: 24.0%   C: 31.0%   G: 29.6%
T: 15.5%   none/other: 0.0%

Overview of removed sequences length    count   expect  max.err error counts
3   529182  399407.4    0   529182 4    116588  99851.8 0   116588
5   39583   24963.0 0   39583 6 16724   6240.7  0   16724 7 14190   1560.2  0   14190
8   12594   390.0   0   12594 9 11809   97.5    0   11202 607 10    10917   24.4    1   10045
872 11  9490    6.1 1   9007 483 12 8432    1.5 1   8112 320 13 7396    0.4 1   7214
182 14  6684    0.1 1   2 6682 15   8   0.0 1   0 8 17  1   0.0 1   0 1


After trimming I performed FASTQC again on the same sequence. Apparently, it did something as the sequence length is now 83-100 (pic3). When I compare the first 3-4 reads from before and after trimming, it looks same. How can I validate trimming step ?

A naïve question: Should all reads have a adapter or only some of them have adapters? (because in the report it say 3% of the runs have adapter) Although not mentioned in the article, could authors upload already trimmed sequences to GEO?

0
Entering edit mode

After comparing my FASTQC report with google search images of FASTQC reports with higher adapter content, I decided that my <1 % "adapter content" is actually not adapter but rather something else. If there is any opposing idea please let me know!

1
Entering edit mode

There may still be a bit of adapter left. A program like bbduk.sh (GUIDE) can get rid of even last base by using a method that overlaps R1/R2 reads (look at the tbo tpe options). That said you generally need not worry about this since aligners will soft-clip any bases that do not match/map. Only if you are planning to do de novo assemblies then you may need to worry about those.

0
Entering edit mode

thank you I appreciated !

3
Entering edit mode
19 months ago
GenoMax 127k

After trimming I performed FASTQC again on the same sequence. Apparently, it did something as the sequence length is now 83-100 (pic3).

That is perfectly normal. After trimming some reads will be smaller in length so you will now have a distribution. In your case from 83 bp to 100 bp.

All library fragments have adapters. Without them they will not be sequenced. All reads (originating from these library fragments) need not have adapter sequence. In Illumina sequencing if you have inserts that are shorter than the number of cycles used for sequencing then once that insert ends, sequencer will read-through into the adapter on 3'-end. So only those reads will show adapter and that too only at 3-end.

AAAAAAAAAAAAAAAAAAASequenceeeeeeeeeeeeeeAAAAAAAAAAAA     <------ Library Fragment

Seq starts here--->Sequenceeeeeeeeeee               <-----This is what you will normally see
Seq starts here--->SequenceeeeeeeeeeeeeAAAAA        <-----If your inserts are shorter than sequencing cycles


Although not mentioned in the article, could authors upload already trimmed sequences to GEO?

That is possible but not always to be taken for granted.

0
Entering edit mode

Thank you for detailed answer. If there are adapters, why don't they show up in the first picture? (before trimming)

1
Entering edit mode

FastQC does sampling of your dataset to come up with metrics. I don't recollect what fraction of data it uses to check adapters but it does not look at entire dataset. Generally this is sufficient.

1
Entering edit mode

Also note the difference in scale - I think the adapters are picked up by FastQC, it's just quite difficult to pick up ~0.4% on a scale of 100 visually. It looks like there is a very slight uptick at the end regardless.