Meaning of Per base sequence content in FastQC
2
4
Entering edit mode
6.7 years ago

Hi all,

Can anybody help me to understand the meaning of Per base sequence content in FastQC analysis? I read the definition like "the proportion of each base position in a file for which each of the four normal DNA bases has been called" in the manual. But I couldn't understand the meaning. If anybody can explain the concept to me with a simple example, that will be great help?

Thanks,

DeepS

 

fastqc sequence content • 17k views
ADD COMMENT
15
Entering edit mode
6.7 years ago

It's surprisingly straight forward but a difficult concept to put into a coherent single sentence :) The easiest explanation is to describe the steps to generate it.

Let's take a small example set of sequences (without quality scores):

CATAAATTCATTTTTTAATAGCTGAGTAGTATTCCATTGTGTAAATGTAC
CGATTCTTGATCTTACAGCACAAGCCATTGCTGTTCTATTCAGGAATTTT
TCTTGATCTTACAGCACAAGCCATTGCTGTTCTATTCAGGAATTTTTCC
GTAAGTTTCAGTGTCTCTGGTTTTATGTGGAGTTCCTTAATCCACTT
ATAGGAATGGATCAATTCGCATTCTTCTACATGATAACAGCCAGTTGTGC
GTCAAAGATCAGGTGACCATAGGTGTGTGGATTCATCTCTGGGTCTTCA
GTGGATTCATCTCTGGGTCTTCAATTCTGTTACATTGGTCTACTTGTCTG
ACCATGCAGTTTTGATCACAATTGCTCTGTAGTACAGTTTTAGGTCCGGC
GATGAATCTGCCGATTGCCCTTTCTAATTCGTTGAAGAATTGAGTTGGAA
CTGGCTAGGACTTCAAGTACAATGTTGAATAGGTAGGGCGAGAGTGGA

To make life easier we'll just consider the first two positions of each read, rather than the whole thing. So we start out with 4 vectors of zeros (one vector for each nucleotide): A = [0, 0], C=[0,0], G=[0,0], and T=[0,0].

We then read in one read at a time and increment these vectors according to the sequences we see. So the first read has a C in position 1 and an A in position two. So, we increment the first position of the C vector (resulting in C=[1,0]) and the second position of the A vector (so now A=[0,1]). We continue doing that for each additional read which results in:

read2: A=[0,1], C=[2,0], G=[0,1], T=[0,0]
read3: A=[0,1], C=[2,1], G=[0,1], T=[1,0]
read4: A=[0,1], C=[2,1], G=[1,1], T=[1,1]
read5: A=[1,1], C=[2,1], G=[1,1], T=[1,2]
read6: A=[1,1], C=[2,1], G=[2,1], T=[1,3]
read7: A=[1,1], C=[2,1], G=[3,1], T=[1,4]
read8: A=[2,1], C=[2,2], G=[3,1], T=[1,4]
read9: A=[2,2], C=[2,2], G=[4,1], T=[1,4]
read10: A=[2,2], C=[3,2], G=[4,1], T=[1,5]

We then divide the results by the number of reads (10 here) and we plot the results. We expect to see flat lines that represent the percentages of A, C, T, and G in the genome. However, there are often biases (particularly at the start of reads), so we perform this analysis to pick that up.

ADD COMMENT
0
Entering edit mode

Thanks Ryan..

So what is the ideal case? Is it like say, for position 1 all the four bases are covering 25% of reads?

 

 

 

ADD REPLY
0
Entering edit mode

Well, genomes aren't typically comprised of 25% of each base. The ideal situation would be 4 flat lines with reasonable percentages.

ADD REPLY
0
Entering edit mode

Yes, I understood.  Thank you.  You easily explained the concept.

DeepS

ADD REPLY
0
Entering edit mode

Sorry for necro'ing this thread, but what exactly constitutes "reasonable percentages" here?

ADD REPLY
0
Entering edit mode

I'm totally new to sequencing concept. Could you help me to suggest some reference papers like other stuffs from which I can understand the concept. Your help will be appreciated.

Thanks

ADD REPLY
2
Entering edit mode
6.7 years ago
Ian 5.7k

There is a handy Youtube video by the author of Fastqc that explains the different concepts.  Per base sequence content is describe at five minutes into the video.

ADD COMMENT

Login before adding your answer.

Traffic: 2338 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6