Question

HiSeq2000 RNA-seq QC

2

Entering edit mode

10.4 years ago

Ming Tommy Tang ★ 4.3k

Hi biostarers,

I am relatively new to RNA-seq, and have some question about the HiSeq 2000 RNA-seq 100bp pair-end data.

I got the data from the lab next door, and I did some quality control with fastqc.

the reads quality look fine: http://postimg.org/image/ifu6jq8wr/

but the per base content looks wrong at the first several 30 bases http://postimg.org/image/ybke367gz/

and there are a lot of duplicate reads http://postimg.org/image/t4530u5zz/

what's wrong with the sequencing?

Do the reads still have adapters that I need to trim off?

Thank you,

Ming

rna-seq • 5.3k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 10.4 years ago by Ming Tommy Tang ★ 4.3k

1

Entering edit mode

Good for you for checking the quality of your data, but be advised that FastQC is extremely strict and will often flag patterns that are not the result of poor quality. Think of FastQC like a quality screen. If it flags something, it means you should take a closer look, but it doesn't necessarily mean that the item being flagged indicates bad data.

ADD REPLY • link 10.4 years ago by Dan D 7.4k

Ram · Answer 1 · 2014-04-29

7

Entering edit mode

10.4 years ago

Devon Ryan 104k

The weird per-base content is extremely normal and is typically referred to as the "random" hexamer effect, since things aren't actually amplified in a uniform manner. Most people will have this and it's nothing to worry about. BTW, don't let anyone talk you into trimming that off, it's the legitimate sequence.

Duplicates are expected in RNAseq. Firstly, your samples are probably full of rRNAs, even if you performed ribodepletion. Secondly, any highly expressed gene is going to have apparent PCR duplication due solely to it being highly expressed (there is a ceiling on the coverage you can get before running into this, it's pretty high with paired-end reads but not infinite).

You'll see information about the adapters at the end of the fastQC report, but it's probably good to assume that you need to trim them off (at least unless the people who did the sequencing tell you otherwise).

The output from FastQC is most meaningful for sequencing DNA, so you often get warnings and errors when you use it to look at an RNAseq dataset.

ADD COMMENT • link 10.4 years ago by Devon Ryan 104k

2

Entering edit mode

I might tolerate random hexamer priming artefacts in the first 6-12 nucleotides of a read, but there has to be adapter contamination in there. I've never seen an RNA-SEq dataset with that amount of distortion in.

ADD REPLY • link 10.4 years ago by User 59 13k

1

Entering edit mode

I agree that there is almost definitely adapter contamination. The second clue is that the quality scores are lower in the beginning of the read, go up (presumably after the adapter ends), then decrease gradually as you'd expect. Absent adapter contamination, you usually expect the beginning of the reads to have the highest quality and decrease from there. My guess is that the adapter sequences are synthesized with a lower accuracy rate than the reverse-transcribed RNA fragments.

In any case, you can just try to map it (or a subset) to your genome of interest and you'll quickly find out the answer.

ADD REPLY • link updated 4.8 years ago by Ram 44k • written 10.4 years ago by matted 7.8k

0

Entering edit mode

I have another fastq file from the same machine gave me the overpresented sequences

Overrepresented sequences

Sequence                                              Count    Percentage             Possible Source
AAGCAGTGGTATCAACGCAGAGTACATGGGAAGCAGTGGTATCAACGCAG    42089    0.41441946173421285    No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGGATGTGAGGGCGATCTGGC    32502    0.32002331595631606    No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGCGCGACCTCAGATCAGACGT    23822    0.2345577328383288     No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGTACCTGGTTGATCCTGCCAG    20383    0.20069642634722756    No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGAGATTCTGAAACCATTTACT    16026    0.15779624827751895    No Hit
AAGCAGTGGTATCAACGCAGAGTACTGGGTCAATAAGATATGTTGATTTT    15612    0.15371989442834305    No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGGGCGGAGGAAGCTCATCAG    15338    0.1510220177262315     No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGACTGACACGCTGTCCTTTCC    13227    0.13023655160156888    No Hit
AAGCAGTGGTATCAACGCAGAGTACTGGCCGTGAGTCTGTTCCAAGCTCC    12826    0.12628819920176326    No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGGGGGTGTACTGGCTTCGAC    10313    0.10154453441195888    No Hit

=================================================================================

Overrepresented sequences

Sequence                                              Count    Percentage             Possible Source
AAGCAGTGGTATCAACGCAGAGTACATGGGTACCTGGTTGATCCTGCCAG    45741    0.38499099869649644    No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGAAGCAGTGGTATCAACGCAG    43533    0.36640679360430645    No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGCGCGACCTCAGATCAGACGT    39535    0.3327565889129225     No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGGATGTGAGGGCGATCTGGC    32702    0.27524487088985433    No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGACATTTGCTTCTGACATAG    24864    0.20927430951640075    No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGGGCGGAGGAAGCTCATCAG    22127    0.18623763862087356    No Hit
AAGCAGTGGTATCAACGCAGAGTACTGGCCGTGAGTCTGTTCCAAGCTCC    19912    0.16759451621181518    No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGCCGGGCGCGGTGGCGCACG    18570    0.15629922489219603    No Hit
AAGCAGTGGTATCAACGCAGAGTACTGGGTCAATAAGATATGTTGATTTT    16668    0.14029054822310844    No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGAAACCAGAAGACCAACACG    12398    0.10435098493341123    No Hit
AAGCAGTGGTATCAACGCAGAGTACTTGGTGACGGCCTTGGTGCCCTCCG    12271    0.1032820564702282     No Hit

=================================================================================

Are those potential adapters? why there are so many different adapters? I think there should be only one or two there. How could I trim them off?

I will contact the sequencing service provider to get more information though.

Thank you,

ADD REPLY • link updated 4.8 years ago by Ram 44k • written 10.4 years ago by Ming Tommy Tang ★ 4.3k

2

Entering edit mode

If you look at the sequences, there is only one adapter here: AAGCAGTGGTATCAACGCAGAGTACATGGG. The rest of the reads are genomic, probably from high abundance RNAs like ribosomal RNAs. In fact, if you just BLAST the three sequences you showed below, two of them are rRNA (after the barcode above). The second one doesn't have the barcode, which is confusing though.

I would attempt to trim with that sequence and see how many sequences had it. I would also try to find out the source of the sequence with the service provider (since it doesn't match any Illumina primer I'm aware of).

ADD REPLY • link updated 4.8 years ago by Ram 44k • written 10.4 years ago by matted 7.8k

0

Entering edit mode

Thank you, I just contacted the service provider, and will update it here.

ADD REPLY • link 10.4 years ago by Ming Tommy Tang ★ 4.3k

0

Entering edit mode

Hi Matted, I got replied from the service provider:

Thanks you for your feedback. I am forwarding these information regarding to rRNA residual to our support team. You will hear from them soon. Sorry for the inconvenience.

Considering 'Random primer' method had been used to this project, there was no adapter in cDNA preparation step. The adapters appeared in fastq files, if any, should be standard illumina sequencing adapter at 3'end of reads (when fragments were shorter than 100bp). Usually FastQC can identify these sequences, and report them in overrepresented part.

To remove this illumina adapter, please apply following 13 bp sequence with cutadapt to both reads:

5' 'AGATCGGAAGAGC' 3'

you can find more details of illumina adapters here.

I then just grep this pattern to see how many sequences have this adpater

grep 'AGATCGGAAGAGC' my.fastq | wc -l
283607

I looked at several matched lines:

cat my.fastq | grep 'AGATCGGAAGAGC' --color=always | less -R

AAGCAGTGGTATCAACGCAGAGTACTTGGGAGGTGAGGTGGGAAGTAAATCGGGGTGCAGGAAATGGGAAGAGAATGGGACTGGG**AGATCGGAAGAGC**AC
CTCAGTTCCGAAAACCAACAAAATAGAACCGCGGTCCTATTCCATTATTCCTCCCATGTACTCTGCGTTGATACCACTGCTT**AGATCGGAAGAGC**ACACG
AAGCAGTGGTATCAACGCAGAGTACATGGGGGGCGGAGGAAGCTCATCAGCGGGGCCACGTGCTGAGTGCTC**AGATCGGAAGAGC**ACACGTCTGAACTCC
GTACTTGATGTATCCAACAGTTTTGCAGTACCTTTTGATGAAGATGACAAAGATGATTCTGTCTGGTTTTTAGACCATGATTA**AGATCGGAAGAGC**ACAC
CAATAAAGAAAGCGTTCAAGCTCAACATAAAATTTCAATTAATTCCATAATTTACACCAACT**AGATCGGAAGAGC**ACACGTCTGAACTCCAGTCACGTTT
CTCTTCACAAGGCTTTCCTTACCTCTATTTAAGATTGCAGCTTCCAAGTACTCTGCGTTGATACCACTGCTT**AGATCGGAAGAGC**ACACGTCTGAACTCC
AAGCAGTGGTATCAACGCAGAGTACATGGGCTCCCTAGGTGAACAGCCTCTGGAATGGGTTCGCCCCGAGAG**AGATCGGAAGAGC**ACACGTCTGAACTCC
AGGCAGTGGTATCAACGCAGAGTACTACATACCCTTGACCGAAGACCGGTCCTCCTCTATTCGGGGAAGGTCGTCCTCAAGGGGATA**AGATCGGAAGAGC**

(EDIT by @RamRS: Added `` to highlight location of adapter sequences, switched from highlighting using a `` tag to just using plain text)

Some showed up at the end of the read (3'end), some showed up in the middle of the reads. How the cutadapt works? it will remove the adapter in the middle also? I check the pdf file, and I do not see AGATCGGAAGAGC in the file, and it did not show up in my fastqc report either.

Any thoughts?

Thank you.
Ming

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 10.4 years ago by Ming Tommy Tang ★ 4.3k

0

Entering edit mode

Yeah, now that you mention it that is high and, in fact, does have a different sequence profile than the normal bias that I've run into. I suspect that you're right and there's still adapter in there.

ADD REPLY • link updated 4.8 years ago by Ram 44k • written 10.4 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you Devon Ryan, the overpresented sequences reported by fastqc here http://postimg.org/image/n4xk7g60h/

No hit for the possible source. I planned to use trimmomatic to trim the adapter. along with the package, there is a folder named adapters containing the fasta files for the adapter sequences for HiSeq machine,

>PrefixPE/1
TACACTCTTTCCCTACACGACGCTCTTCCGATCT
>PrefixPE/2
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

and they do not look like the overpresented ones in my sequencing data.

first several lines of my reads:

@HWI-ST1129:511:H8V4NADXX:1:1101:3238:1984 1:N:0:GTTTCG
NAGCAGTGGTATCAACGCAGAGTACATGGGTACCTGGTTGATCCTGCCAGTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGTACGCACG
+
#1=DDFEDHFFDDHHJJJJGGGECHIIIJGCFHEHIJJJJJJIJFIEIIIGEHIGJJJJJJIJHJJJIIJFHHHHHFFFFFEEEEEEDEEDDADDDDDDD
@HWI-ST1129:511:H8V4NADXX:1:1101:3778:1986 1:N:0:GTTTCG
NCCATCTCCAAGACAGCGGTAGCACCCATCGAGAGGGTCAAGCTGCTGCTGCAGGTGCAGCATGCCAGCAAGCAAATCACGGCAGATAAGCAATACAAGG
+
#1=DFFFFHHHHHJJIJJJHIIJJJJJJJJJIJJJJJHIJJJJJJJJIJJJJJJJGHHFHHFFFFFFEDEEEDDDDDDDDDDDDDDDCDDDDDDDDDDDD
@HWI-ST1129:511:H8V4NADXX:1:1101:4730:1981 1:N:0:GTTTCG
NAGCAGTGGTATCAACGCAGAGTACATGGGTACCTGGTTGATCCTGCCAGTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGTACGCACG
+
#1:BDDDFFHHGDDHHIIJEGIEFFGBGIDHHIJIJGBFGHIIFGEGIIIHGIIBHIJJIJIE=DHEIJJGFHHHHFFFEFCAADCECEEDDDDDDDDBD

Thank you,
Ming

ADD REPLY • link updated 4.8 years ago by Ram 44k • written 10.4 years ago by Ming Tommy Tang ★ 4.3k

0

Entering edit mode

You're probably good to go then, but trimming won't hurt (you can get rid of the occasional low quality read, though you probably won't gain much since you seem to have decent quality).

ADD REPLY • link 10.4 years ago by Devon Ryan 104k

0

Entering edit mode

See the comment below from Daniel Swan as he's probably correct that there's still some adapter sequence that's leading to that.

ADD REPLY • link 10.4 years ago by Devon Ryan 104k