Question: HiSeq2000 RNA-seq QC
2
gravatar for Ming Tang
5.7 years ago by
Ming Tang2.5k
Houston/MD Anderson Cancer Center
Ming Tang2.5k wrote:

Hi biostarers,

I am relatively new to RNA-seq, and have some question about the HiSeq 2000 RNA-seq 100bp pair-end data.

I got the data from the lab next door, and I did some quality control with fastqc.

the reads quality look fine: http://postimg.org/image/ifu6jq8wr/

but the per base content looks wrong at the first several 30 bases http://postimg.org/image/ybke367gz/

and there are a lot of duplicate reads http://postimg.org/image/t4530u5zz/

what's wrong with the sequencing?

Do the reads still have adapters that I need to trim off?

Thank you,

Ming

 

rna-seq • 3.1k views
ADD COMMENTlink modified 5.7 years ago by Devon Ryan93k • written 5.7 years ago by Ming Tang2.5k
1

Good for you for checking the quality of your data, but be advised that FastQC is extremely strict and will often flag patterns that are not the result of poor quality. Think of FastQC like a quality screen. If it flags something, it means you should take a closer look, but it doesn't necessarily mean that the item being flagged indicates bad data.

ADD REPLYlink written 5.7 years ago by Dan D7.0k
7
gravatar for Devon Ryan
5.7 years ago by
Devon Ryan93k
Freiburg, Germany
Devon Ryan93k wrote:

The weird per-base content is extremely normal and is typically referred to as the "random" hexamer effect, since things aren't actually amplified in a uniform manner. Most people will have this and it's nothing to worry about. BTW, don't let anyone talk you into trimming that off, it's the legitimate sequence.

Duplicates are expected in RNAseq. Firstly, your samples are probably full of rRNAs, even if you performed ribodepletion. Secondly, any highly expressed gene is going to have apparent PCR duplication due solely to it being highly expressed (there is a ceiling on the coverage you can get before running into this, it's pretty high with paired-end reads but not infinite).

You'll see information about the adapters at the end of the fastQC report, but it's probably good to assume that you need to trim them off (at least unless the people who did the sequencing tell you otherwise).

The output from FastQC is most meaningful for sequencing DNA, so you often get warnings and errors when you use it to look at an RNAseq dataset.

ADD COMMENTlink written 5.7 years ago by Devon Ryan93k
2

I might tolerate random hexamer priming artefacts in the first 6-12 nucleotides of a read, but there has to be adapter contamination in there. I've never seen an RNA-SEq dataset with that amount of distortion in.

ADD REPLYlink written 5.7 years ago by Daniel Swan13k
1

I agree that there is almost definitely adapter contamination. The second clue is that the quality scores are lower in the beginning of the read, go up (presumably after the adapter ends), then decrease gradually as you'd expect. Absent adapter contamination, you usually expect the beginning of the reads to have the highest quality and decrease from there. My guess is that the adapter sequences are synthesized with a lower accuracy rate than the reverse-transcribed RNA fragments.

In any case, you can just try to map it (or a subset) to your genome of interest and you'll quickly find out the answer.

ADD REPLYlink modified 24 days ago by RamRS25k • written 5.7 years ago by matted7.2k

I have another fastq file from the same machine gave me the overpresented sequences

Overrepresented sequences

Sequence                                              Count    Percentage             Possible Source
AAGCAGTGGTATCAACGCAGAGTACATGGGAAGCAGTGGTATCAACGCAG    42089    0.41441946173421285    No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGGATGTGAGGGCGATCTGGC    32502    0.32002331595631606    No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGCGCGACCTCAGATCAGACGT    23822    0.2345577328383288     No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGTACCTGGTTGATCCTGCCAG    20383    0.20069642634722756    No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGAGATTCTGAAACCATTTACT    16026    0.15779624827751895    No Hit
AAGCAGTGGTATCAACGCAGAGTACTGGGTCAATAAGATATGTTGATTTT    15612    0.15371989442834305    No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGGGCGGAGGAAGCTCATCAG    15338    0.1510220177262315     No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGACTGACACGCTGTCCTTTCC    13227    0.13023655160156888    No Hit
AAGCAGTGGTATCAACGCAGAGTACTGGCCGTGAGTCTGTTCCAAGCTCC    12826    0.12628819920176326    No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGGGGGTGTACTGGCTTCGAC    10313    0.10154453441195888    No Hit

=================================================================================

Overrepresented sequences

Sequence                                              Count    Percentage             Possible Source
AAGCAGTGGTATCAACGCAGAGTACATGGGTACCTGGTTGATCCTGCCAG    45741    0.38499099869649644    No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGAAGCAGTGGTATCAACGCAG    43533    0.36640679360430645    No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGCGCGACCTCAGATCAGACGT    39535    0.3327565889129225     No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGGATGTGAGGGCGATCTGGC    32702    0.27524487088985433    No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGACATTTGCTTCTGACATAG    24864    0.20927430951640075    No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGGGCGGAGGAAGCTCATCAG    22127    0.18623763862087356    No Hit
AAGCAGTGGTATCAACGCAGAGTACTGGCCGTGAGTCTGTTCCAAGCTCC    19912    0.16759451621181518    No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGCCGGGCGCGGTGGCGCACG    18570    0.15629922489219603    No Hit
AAGCAGTGGTATCAACGCAGAGTACTGGGTCAATAAGATATGTTGATTTT    16668    0.14029054822310844    No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGAAACCAGAAGACCAACACG    12398    0.10435098493341123    No Hit
AAGCAGTGGTATCAACGCAGAGTACTTGGTGACGGCCTTGGTGCCCTCCG    12271    0.1032820564702282     No Hit

=================================================================================

Are those potential adapters? why there are so many different adapters? I think there should be only one or two there. How could I trim them off?

I will contact the sequencing service provider to get more information though.

Thank you,

ADD REPLYlink modified 24 days ago by RamRS25k • written 5.7 years ago by Ming Tang2.5k
2

If you look at the sequences, there is only one adapter here: AAGCAGTGGTATCAACGCAGAGTACATGGG. The rest of the reads are genomic, probably from high abundance RNAs like ribosomal RNAs. In fact, if you just BLAST the three sequences you showed below, two of them are rRNA (after the barcode above). The second one doesn't have the barcode, which is confusing though.

I would attempt to trim with that sequence and see how many sequences had it. I would also try to find out the source of the sequence with the service provider (since it doesn't match any Illumina primer I'm aware of).

ADD REPLYlink modified 24 days ago by RamRS25k • written 5.7 years ago by matted7.2k

Thank you, I just contacted the service provider, and will update it here.

ADD REPLYlink written 5.7 years ago by Ming Tang2.5k

Hi Matted, I got replied from the service provider:

Thanks you for your feedback. I am forwarding these information regarding to rRNA residual to our support team. You will hear from them soon. Sorry for the inconvenience.

Considering 'Random primer' method had been used to this project, there was no adapter in cDNA preparation step. The adapters appeared in fastq files, if any, should be standard illumina sequencing adapter at 3'end of reads (when fragments were shorter than 100bp). Usually FastQC can identify these sequences, and report them in overrepresented part.

To remove this illumina adapter, please apply following 13 bp sequence with cutadapt to both reads:

5' 'AGATCGGAAGAGC' 3'

you can find more details of illumina adapters here.

I then just grep this pattern to see how many sequences have this adpater

grep 'AGATCGGAAGAGC' my.fastq | wc -l
283607

I looked at several matched lines:

cat my.fastq | grep 'AGATCGGAAGAGC' --color=always | less -R

AAGCAGTGGTATCAACGCAGAGTACTTGGGAGGTGAGGTGGGAAGTAAATCGGGGTGCAGGAAATGGGAAGAGAATGGGACTGGG**AGATCGGAAGAGC**AC
CTCAGTTCCGAAAACCAACAAAATAGAACCGCGGTCCTATTCCATTATTCCTCCCATGTACTCTGCGTTGATACCACTGCTT**AGATCGGAAGAGC**ACACG
AAGCAGTGGTATCAACGCAGAGTACATGGGGGGCGGAGGAAGCTCATCAGCGGGGCCACGTGCTGAGTGCTC**AGATCGGAAGAGC**ACACGTCTGAACTCC
GTACTTGATGTATCCAACAGTTTTGCAGTACCTTTTGATGAAGATGACAAAGATGATTCTGTCTGGTTTTTAGACCATGATTA**AGATCGGAAGAGC**ACAC
CAATAAAGAAAGCGTTCAAGCTCAACATAAAATTTCAATTAATTCCATAATTTACACCAACT**AGATCGGAAGAGC**ACACGTCTGAACTCCAGTCACGTTT
CTCTTCACAAGGCTTTCCTTACCTCTATTTAAGATTGCAGCTTCCAAGTACTCTGCGTTGATACCACTGCTT**AGATCGGAAGAGC**ACACGTCTGAACTCC
AAGCAGTGGTATCAACGCAGAGTACATGGGCTCCCTAGGTGAACAGCCTCTGGAATGGGTTCGCCCCGAGAG**AGATCGGAAGAGC**ACACGTCTGAACTCC
AGGCAGTGGTATCAACGCAGAGTACTACATACCCTTGACCGAAGACCGGTCCTCCTCTATTCGGGGAAGGTCGTCCTCAAGGGGATA**AGATCGGAAGAGC**

(**EDIT by @RamRS: Added ** to highlight location of adapter sequences, switched from highlighting using a <span> tag to just using plain text)

Some showed up at the end of the read (3'end), some showed up in the middle of the reads. How the cutadapt works? it will remove the adapter in the middle also? I check the pdf file, and I do not see AGATCGGAAGAGC in the file, and it did not show up in my fastqc report either.

any thoughts?

Thank you.
Ming

ADD REPLYlink modified 22 days ago by RamRS25k • written 5.7 years ago by Ming Tang2.5k

Yeah, now that you mention it that is high and, in fact, does have a different sequence profile than the normal bias that I've run into. I suspect that you're right and there's still adapter in there.

ADD REPLYlink modified 24 days ago by RamRS25k • written 5.7 years ago by Devon Ryan93k

Thank you Devon Ryan, the overpresented sequences reported by fastqc here http://postimg.org/image/n4xk7g60h/

No hit for the possible source. I planned to use trimmomatic to trim the adapter. along with the package, there is a folder named adapters containing the fasta files for the adapter sequences for HiSeq machine,

>PrefixPE/1
TACACTCTTTCCCTACACGACGCTCTTCCGATCT
>PrefixPE/2
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

and they do not look like the overpresented ones in my sequencing data.

first several lines of my reads:

@HWI-ST1129:511:H8V4NADXX:1:1101:3238:1984 1:N:0:GTTTCG
NAGCAGTGGTATCAACGCAGAGTACATGGGTACCTGGTTGATCCTGCCAGTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGTACGCACG
+
#1=DDFEDHFFDDHHJJJJGGGECHIIIJGCFHEHIJJJJJJIJFIEIIIGEHIGJJJJJJIJHJJJIIJFHHHHHFFFFFEEEEEEDEEDDADDDDDDD
@HWI-ST1129:511:H8V4NADXX:1:1101:3778:1986 1:N:0:GTTTCG
NCCATCTCCAAGACAGCGGTAGCACCCATCGAGAGGGTCAAGCTGCTGCTGCAGGTGCAGCATGCCAGCAAGCAAATCACGGCAGATAAGCAATACAAGG
+
#1=DFFFFHHHHHJJIJJJHIIJJJJJJJJJIJJJJJHIJJJJJJJJIJJJJJJJGHHFHHFFFFFFEDEEEDDDDDDDDDDDDDDDCDDDDDDDDDDDD
@HWI-ST1129:511:H8V4NADXX:1:1101:4730:1981 1:N:0:GTTTCG
NAGCAGTGGTATCAACGCAGAGTACATGGGTACCTGGTTGATCCTGCCAGTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGTACGCACG
+
#1:BDDDFFHHGDDHHIIJEGIEFFGBGIDHHIJIJGBFGHIIFGEGIIIHGIIBHIJJIJIE=DHEIJJGFHHHHFFFEFCAADCECEEDDDDDDDDBD

Thank you,
Ming

ADD REPLYlink modified 24 days ago by RamRS25k • written 5.7 years ago by Ming Tang2.5k

You're probably good to go then, but trimming won't hurt (you can get rid of the occasional low quality read, though you probably won't gain much since you seem to have decent quality).

ADD REPLYlink written 5.7 years ago by Devon Ryan93k

See the comment below from Daniel Swan as he's probably correct that there's still some adapter sequence that's leading to that.

ADD REPLYlink written 5.7 years ago by Devon Ryan93k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 956 users visited in the last hour