Question: HiSeq2000 RNA-seq QC
2
gravatar for Ming Tang
5.0 years ago by
Ming Tang2.4k
Houston/MD Anderson Cancer Center
Ming Tang2.4k wrote:

Hi biostarers,

I am relatively new to RNA-seq, and have some question about the HiSeq 2000 RNA-seq 100bp pair-end data.

I got the data from the lab next door, and I did some quality control with fastqc.

the reads quality look fine: http://postimg.org/image/ifu6jq8wr/

but the per base content looks wrong at the first several 30 bases http://postimg.org/image/ybke367gz/

and there are a lot of duplicate reads http://postimg.org/image/t4530u5zz/

what's wrong with the sequencing?

Do the reads still have adapters that I need to trim off?

Thank you,

Ming

 

rna-seq • 2.8k views
ADD COMMENTlink modified 5.0 years ago by Devon Ryan89k • written 5.0 years ago by Ming Tang2.4k
1

Good for you for checking the quality of your data, but be advised that FastQC is extremely strict and will often flag patterns that are not the result of poor quality. Think of FastQC like a quality screen. If it flags something, it means you should take a closer look, but it doesn't necessarily mean that the item being flagged indicates bad data.

ADD REPLYlink written 5.0 years ago by Dan D6.7k
7
gravatar for Devon Ryan
5.0 years ago by
Devon Ryan89k
Freiburg, Germany
Devon Ryan89k wrote:

The weird per-base content is extremely normal and is typically referred to as the "random" hexamer effect, since things aren't actually amplified in a uniform manner. Most people will have this and it's nothing to worry about. BTW, don't let anyone talk you into trimming that off, it's the legitimate sequence.

Duplicates are expected in RNAseq. Firstly, your samples are probably full of rRNAs, even if you performed ribodepletion. Secondly, any highly expressed gene is going to have apparent PCR duplication due solely to it being highly expressed (there is a ceiling on the coverage you can get before running into this, it's pretty high with paired-end reads but not infinite).

You'll see information about the adapters at the end of the fastQC report, but it's probably good to assume that you need to trim them off (at least unless the people who did the sequencing tell you otherwise).

The output from FastQC is most meaningful for sequencing DNA, so you often get warnings and errors when you use it to look at an RNAseq dataset.

ADD COMMENTlink written 5.0 years ago by Devon Ryan89k
2

I might tolerate random hexamer priming artefacts in the first 6-12 nucleotides of a read, but there has to be adapter contamination in there. I've never seen an RNA-SEq dataset with that amount of distortion in.

ADD REPLYlink written 5.0 years ago by Daniel Swan13k
1

I agree that there is almost definitely adapter contamination.  The second clue is that the quality scores are lower in the beginning of the read, go up (presumably after the adapter ends), then decrease gradually as you'd expect.  Absent adapter contamination, you usually expect the beginning of the reads to have the highest quality and decrease from there.  My guess is that the adapter sequences are synthesized with a lower accuracy rate than the reverse-transcribed RNA fragments.

In any case, you can just try to map it (or a subset) to your genome of interest and you'll quickly find out the answer.

ADD REPLYlink written 5.0 years ago by matted7.0k

I have another fastq file from the same machine gave me the overpresented sequences 

Overrepresented sequences

Sequence Count Percentage Possible Source
AAGCAGTGGTATCAACGCAGAGTACATGGGAAGCAGTGGTATCAACGCAG 42089 0.41441946173421285 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGGATGTGAGGGCGATCTGGC 32502 0.32002331595631606 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGCGCGACCTCAGATCAGACGT 23822 0.2345577328383288 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGTACCTGGTTGATCCTGCCAG 20383 0.20069642634722756 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGAGATTCTGAAACCATTTACT 16026 0.15779624827751895 No Hit
AAGCAGTGGTATCAACGCAGAGTACTGGGTCAATAAGATATGTTGATTTT 15612 0.15371989442834305 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGGGCGGAGGAAGCTCATCAG 15338 0.1510220177262315 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGACTGACACGCTGTCCTTTCC 13227 0.13023655160156888 No Hit
AAGCAGTGGTATCAACGCAGAGTACTGGCCGTGAGTCTGTTCCAAGCTCC 12826 0.12628819920176326 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGGGGGTGTACTGGCTTCGAC 10313 0.10154453441195888

No Hit

 

=================================================================================

Overrepresented sequences

Sequence Count Percentage Possible Source
AAGCAGTGGTATCAACGCAGAGTACATGGGTACCTGGTTGATCCTGCCAG 45741 0.38499099869649644 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGAAGCAGTGGTATCAACGCAG 43533 0.36640679360430645 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGCGCGACCTCAGATCAGACGT 39535 0.3327565889129225 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGGATGTGAGGGCGATCTGGC 32702 0.27524487088985433 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGACATTTGCTTCTGACATAG 24864 0.20927430951640075 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGGGCGGAGGAAGCTCATCAG 22127 0.18623763862087356 No Hit
AAGCAGTGGTATCAACGCAGAGTACTGGCCGTGAGTCTGTTCCAAGCTCC 19912 0.16759451621181518 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGCCGGGCGCGGTGGCGCACG 18570 0.15629922489219603 No Hit
AAGCAGTGGTATCAACGCAGAGTACTGGGTCAATAAGATATGTTGATTTT 16668 0.14029054822310844 No Hit
AAGCAGTGGTATCAACGCAGAGTACATGGGGAAACCAGAAGACCAACACG 12398 0.10435098493341123 No Hit
AAGCAGTGGTATCAACGCAGAGTACTTGGTGACGGCCTTGGTGCCCTCCG 12271 0.1032820564702282 No Hit

=================================================================================

Are those potential adapters? why there are so many different adapters? I think there should be only one or two there. How could I trim them off?

I will contact the sequencing service provider to get more information though.

Thank you,

 

ADD REPLYlink written 5.0 years ago by Ming Tang2.4k
2

If you look at the sequences, there is only one adapter here: AAGCAGTGGTATCAACGCAGAGTACATGGG.  The rest of the reads are genomic, probably from high abundance RNAs like ribosomal RNAs.  In fact, if you just BLAST the three sequences you showed below, two of them are rRNA (after the barcode above).  The second one doesn't have the barcode, which is confusing though.

I would attempt to trim with that sequence and see how many sequences had it.  I would also try to find out the source of the sequence with the service provider (since it doesn't match any Illumina primer I'm aware of).

ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by matted7.0k

Thank you, I just contacted the service provider, and will update it here.

ADD REPLYlink written 5.0 years ago by Ming Tang2.4k

Hi Matted, I got replied from the service provider:

"Thanks you for your feedback. I am forwarding these information regarding to rRNA residual to our support team. You will hear from them soon. Sorry for the inconvenience. 

Considering 'Random primer' method had been used to this project, there was no adapter in cDNA preparation step. The adapters appeared in fastq files, if any, should be standard illumina sequencing adapter at 3'end of reads (when fragments were shorter than 100bp). Usually FastQC can identify these sequences, and report them in overrepresented part. 

 

To remove this illumina adapter, please apply following 13 bp sequence with cutadapt to both reads:

5' 'AGATCGGAAGAGC' 3'

you can find more details of illumina adapters here: http://supportres.illumina.com/documents/myillumina/6378de81-c0cc-47d0-9281-724878bb1c30/2012-09-18_illuminacustomersequenceletter.pdf

"

I then just grep this pattern to see how many sequences have this adpater

grep 'AGATCGGAAGAGC'  my.fastq | wc -l  

283607

I looked at several matched lines:

cat my.fastq | grep 'AGATCGGAAGAGC' --color=always | less -R

 

AAGCAGTGGTATCAACGCAGAGTACTTGGGAGGTGAGGTGGGAAGTAAATCGGGGTGCAGGAAATGGGAAGAGAATGGGACTGGGAGATCGGAAGAGCAC
CTCAGTTCCGAAAACCAACAAAATAGAACCGCGGTCCTATTCCATTATTCCTCCCATGTACTCTGCGTTGATACCACTGCTTAGATCGGAAGAGCACACG
AAGCAGTGGTATCAACGCAGAGTACATGGGGGGCGGAGGAAGCTCATCAGCGGGGCCACGTGCTGAGTGCTCAGATCGGAAGAGCACACGTCTGAACTCC
GTACTTGATGTATCCAACAGTTTTGCAGTACCTTTTGATGAAGATGACAAAGATGATTCTGTCTGGTTTTTAGACCATGATTAAGATCGGAAGAGCACAC
CAATAAAGAAAGCGTTCAAGCTCAACATAAAATTTCAATTAATTCCATAATTTACACCAACTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTT
CTCTTCACAAGGCTTTCCTTACCTCTATTTAAGATTGCAGCTTCCAAGTACTCTGCGTTGATACCACTGCTTAGATCGGAAGAGCACACGTCTGAACTCC
AAGCAGTGGTATCAACGCAGAGTACATGGGCTCCCTAGGTGAACAGCCTCTGGAATGGGTTCGCCCCGAGAGAGATCGGAAGAGCACACGTCTGAACTCC
AGGCAGTGGTATCAACGCAGAGTACTACATACCCTTGACCGAAGACCGGTCCTCCTCTATTCGGGGAAGGTCGTCCTCAAGGGGATAAGATCGGAAGAGC

Some showed up at the end of the read (3'end), some showed up in the middle of the reads. How the cutadapt works? it will remove the adapter in the middle also? I check the pdf file, and I do not see AGATCGGAAGAGC in the file, and it did not show up in my fastqc report either.

any thoughts?

Thank you.

Ming

 

ADD REPLYlink written 5.0 years ago by Ming Tang2.4k

Yeah, now that you mention it that is high and, in fact, does have a different sequence profile than the normal bias that I've run into. I suspect that you're right and there's still adapter in there.
 

ADD REPLYlink written 5.0 years ago by Devon Ryan89k

Thank you Devon Ryan, the overpresented sequences reported by fastqc here http://postimg.org/image/n4xk7g60h/

No hit for the possible source. I planned to use trimmomatic to trim the adapter http://www.usadellab.org/cms/?page=trimmomatic . along with the package, there is a folder named adapters containing the fasta files for the adapter sequences for HiSeq machine,

>PrefixPE/1
TACACTCTTTCCCTACACGACGCTCTTCCGATCT
>PrefixPE/2
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

 

and they do not look like the overpresented ones in my sequencing data.

first several lines of my reads: 

@HWI-ST1129:511:H8V4NADXX:1:1101:3238:1984 1:N:0:GTTTCG
NAGCAGTGGTATCAACGCAGAGTACATGGGTACCTGGTTGATCCTGCCAGTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGTACGCACG
+
#1=DDFEDHFFDDHHJJJJGGGECHIIIJGCFHEHIJJJJJJIJFIEIIIGEHIGJJJJJJIJHJJJIIJFHHHHHFFFFFEEEEEEDEEDDADDDDDDD
@HWI-ST1129:511:H8V4NADXX:1:1101:3778:1986 1:N:0:GTTTCG
NCCATCTCCAAGACAGCGGTAGCACCCATCGAGAGGGTCAAGCTGCTGCTGCAGGTGCAGCATGCCAGCAAGCAAATCACGGCAGATAAGCAATACAAGG
+
#1=DFFFFHHHHHJJIJJJHIIJJJJJJJJJIJJJJJHIJJJJJJJJIJJJJJJJGHHFHHFFFFFFEDEEEDDDDDDDDDDDDDDDCDDDDDDDDDDDD
@HWI-ST1129:511:H8V4NADXX:1:1101:4730:1981 1:N:0:GTTTCG
NAGCAGTGGTATCAACGCAGAGTACATGGGTACCTGGTTGATCCTGCCAGTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGTACGCACG
+
#1:BDDDFFHHGDDHHIIJEGIEFFGBGIDHHIJIJGBFGHIIFGEGIIIHGIIBHIJJIJIE=DHEIJJGFHHHHFFFEFCAADCECEEDDDDDDDDBD

Thank you,

Ming

 

 

ADD REPLYlink written 5.0 years ago by Ming Tang2.4k

You're probably good to go then, but trimming won't hurt (you can get rid of the occasional low quality read, though you probably won't gain much since you seem to have decent quality).

ADD REPLYlink written 5.0 years ago by Devon Ryan89k

See the comment below from Daniel Swan as he's probably correct that there's still some adapter sequence that's leading to that.

ADD REPLYlink written 5.0 years ago by Devon Ryan89k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1335 users visited in the last hour