Question: GC content calculation
1
gravatar for pbigbig
2.5 years ago by
pbigbig190
United States
pbigbig190 wrote:

Hi everyone,

I have some confused results regarding GC content in transcriptome, I have made a transcriptome de novo assembly by Trinity, and exploited some prebuild scripts (i.e TrinityStats) in Trinity package to calculate basic statistics of assembled transcriptome, it returned GC% = 46.64 %.

Meanwhile, I tried to use Prinseq to get some summaries (although it is designed to use with reads), and get the result like: GC Content Distribution Mean GC content: 46.04 ± 6.75 %

Minimum GC content: 19 %

Maximum GC content: 74 %

GC content range: 56 %

Mode GC content: 45 % with 4,275 sequences

On the other hands, when I count number of G and C base in multifasta transcriptome (using simple 'grep' while omitting every fasta header), then calculate GC content by fomula (G+C)/(G+C+A+T), I got 49.96%.

Could anyone help me clarify these differences and suggest which GC content result is accurate? Thank you very much.

ADD COMMENTlink modified 2.5 years ago by Brian Bushnell16k • written 2.5 years ago by pbigbig190

maybe low quality nucleotides or reads are not included (or some very short reads)

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by stolarek.ir580

Do the sequences contain T's and not U's? Are the TrinityStats based on all contigs or just longest isoforms?

ADD REPLYlink written 2.5 years ago by 5heikki8.4k

Thanks, It doesn't contain U, below is the full output of TrinityStats, I think it based on all contigs for GC content calculation:

Counts of transcripts, etc.
Total trinity 'genes':  68518
Total trinity transcripts:  72229
Percent GC: 46.64

Stats based on ALL transcript contigs:

    Contig N10: 5470
    Contig N20: 4094
    Contig N30: 3301
    Contig N40: 2700
    Contig N50: 2201

    Median contig length: 606
    Average contig: 1175.68
    Total assembled bases: 84918026

Stats based on ONLY LONGEST ISOFORM per 'GENE':

    Contig N10: 5265
    Contig N20: 3946
    Contig N30: 3187
    Contig N40: 2593
    Contig N50: 2114

    Median contig length: 574
    Average contig: 1124.04
    Total assembled bases: 77017207
ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by pbigbig190

Some GC% programs are stupid and do (G+C)/(G+C+A+T+N) which is to essentially call all Ns C or T. They often don't mention N specifically, they just divide by the length of the sequence, which is the same thing.

You can usually spot this because it gives you slightly lower GC percentages, which appears to be what you're seeing.

The 46.64 and 46.06 is probably the sum of some rounding errors, but I don't know for sure obviously. The only way to know for sure is to read the code. Can you read the code or?

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by John12k
1

Thanks for your suggestion, I would check the code to be sure

ADD REPLYlink written 2.5 years ago by pbigbig190
1

John, could you elaborate what you mean with this formula "essentially call[s] all Ns C or T", please?

ADD REPLYlink written 2.5 years ago by cschu1811.6k
1
seq = 'ATCGTGATGANAACT'
a = seq.count('A')
c = seq.count('C')
g = seq.count('G')
t = seq.count('T')
float(c + g) / len(seq) # returns 0.3333333 - note we didn't have to count a or t
float(c + g) / (a+c+g+t) # returns 0.3571428
ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by John12k
1

Thank you, however I was referring to what you mean with "to essentially call all Ns C or T"?

Above you stated

Some GC% programs are stupid and do (G+C)/(G+C+A+T+N) which is to essentially call all Ns C or T.

, which I understand as "those programs treat Ns as C or T" and that doesn't make sense to me regarding the formula you provided.

Also, which would be the correct way to compute GC when ambiguity codes are involved? Assign probabilities of G/C observations (N,M,K,R,Y=0.5, W=0, S=1, B,V=2/3, D,H=1/3)? Or, like in your example 1) treat ambiguities as non-GC or 2) just calculate GC as fraction of non-ambiguous bases?

ADD REPLYlink written 2.5 years ago by cschu1811.6k
2

I just calculate GC from the unambiguous bases. Normal sequence doesn't usually contain ambiguous bases other than N; those tend to just be used to describe motifs and primers. When an N is present in normal sequence, it's best to ignore it because treating it as 0.5 will bias the calculation toward 50%.

ADD REPLYlink written 2.5 years ago by Brian Bushnell16k
1

Ah, you're right, i meant to say A or t, not C. Good spot. I won't fix it so your comment makes sense.

Regarding GC% with ambiguity codes, I think it's a recipe for troubles. I suppose if it's the C+G or A+T ambiguity code then it's OK, but it's a mistake to think "A or T or C" means "there is an equal probability of A/T/C". It doesn't. It means N. It's better to throw bad data in the bin than try and make allowances for it. If you try and squeeze all the milk out of a cow you end up with bad milk and a sore cow.

ADD REPLYlink written 2.5 years ago by John12k
4
gravatar for Brian Bushnell
2.5 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

The BBMap package has a tool called "stats" which will accurately calculate GC content. You can use that as a tie-breaker. Usage:

stats.sh assembly.fasta
ADD COMMENTlink written 2.5 years ago by Brian Bushnell16k
1

Thanks, It is really a tie-breaker, here's what I have got:

A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.2773  0.2281  0.2383  0.2563  0.0000  0.0000  0.0000  0.4664  0.0675
ADD REPLYlink written 2.5 years ago by pbigbig190
2

Looks like Trinity was correct, then!

ADD REPLYlink written 2.5 years ago by Brian Bushnell16k

Looking forward to it generating a fourth GC% value :P heheh

ADD REPLYlink written 2.5 years ago by John12k
3

4 would be bad. With 3 or 5 he could use the median.

ADD REPLYlink written 2.5 years ago by Brian Bushnell16k

Hahahahaha xD

ADD REPLYlink written 2.5 years ago by John12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 823 users visited in the last hour