Question: How calculate CG content for each contig in an transcript assembly?
0
gravatar for luzglongoria
7 months ago by
luzglongoria20
luzglongoria20 wrote:

Hi everyone,

I have a .fasta file that I have created by running trinity (so it is an transcript assembly). I want to know the CG% for each contig and keep those with low % of CG. My data belong to a bird infected by Plasmodium and and only want contig with low CG% (because Plasmodium spp have low %).

I have used BBMap for calculating the CG content for all the contigs:

 stats.sh trinity_gmap.fa
A        C        G      T        N     IUPAC   Other    GC   GC_stdev
0.3719  0.1285  0.1326  0.3670  0.0000  0.0000  0.0000  0.2611  0.1979

Main genome scaffold total:             37974
Main genome contig total:               37974
Main genome scaffold sequence total:    32.330 MB
Main genome contig sequence total:      32.330 MB   0.000% gap
Main genome scaffold N/L50:             7241/1.313 KB
Main genome contig N/L50:               7241/1.313 KB
Main genome scaffold N/L90:             26130/354
Main genome contig N/L90:               26130/354
Max scaffold length:                    14.632 KB
Max contig length:                      14.632 KB
Number of scaffolds > 50 KB:            0
% main genome in scaffolds > 50 KB:     0.00%

Minimum     Number          Number          Total           Total           Scaffold
Scaffold    of              of              Scaffold        Contig          Contig 
Length      Scaffolds       Contigs         Length          Length          Coverage
--------    --------------  --------------  --------------  --------------  --------
    All             37,974          37,974      32,329,550      32,329,550   100.00%
    100             37,974          37,974      32,329,550      32,329,550   100.00%
    250             33,820          33,820      31,398,555      31,398,555   100.00%
    500             19,529          19,529      26,343,945      26,343,945   100.00%
   1 KB             10,520          10,520      19,945,249      19,945,249   100.00%
 2.5 KB              1,873           1,873       6,692,233       6,692,233   100.00%
   5 KB                197             197       1,290,084       1,290,084   100.00%
  10 KB                  5               5          57,511          57,511   100.00%

So, the toal % of CG is 26% which is not too bad :)

however, I would like to know any way of calculating the CG content for each contig.

Any idea?

Thank you so much!

ADD COMMENTlink modified 7 months ago by lakhujanivijay4.3k • written 7 months ago by luzglongoria20
1
gravatar for ATpoint
7 months ago by
ATpoint21k
Germany
ATpoint21k wrote:

seqtk comp gives you the absolute numbers of each nucleotide in a (multi)fasta split by entries. GC content is then only an awk command away. See Percentage of bases in a fasta sequence

ADD COMMENTlink modified 7 months ago • written 7 months ago by ATpoint21k

Thank you so much!. Your comment has been very helpful.

I just read that when I run

seqtk comp hg38.fa

I get :

chr, length, #A, #C, #G, #T, #2, #3, #4, #CpG, #tv, #ts, #CpG-ts

I have been reading about what means each column :

ts transition ie. adacent A<=>G or C<=>T tv transversion - the other possible [AGTC]<=>[AGTC] ajdacent pairs CpG CG pair (revcom aware) CpG-ts CG pair (revcom aware) but allowing transitions in 1st (and/or 2nd) base 2 number of ambiguous IUPAC bases with 2 possibile values 3 ditto with 3 4 ditto with 4 (I assume this means "N")

I think that the one I am interesting in is CpG. Or may be I should just focus on C and G columns?

ADD REPLYlink written 7 months ago by luzglongoria20
1

If you want CG content, focus on G and C column and divide the sum of these two by the sum of all A, C, T, G.

ADD REPLYlink written 7 months ago by ATpoint21k
0
gravatar for Joseph Hughes
7 months ago by
Joseph Hughes2.7k
Scotland, UK
Joseph Hughes2.7k wrote:

You could use compseq from EMBOSS, you can get the observed and expected of each dinucleotide. Or if you are looking for CpG islands, use newcpgreport. EMBOSS accepts fasta as an input format.

ADD COMMENTlink written 7 months ago by Joseph Hughes2.7k
0
gravatar for lakhujanivijay
7 months ago by
lakhujanivijay4.3k
India
lakhujanivijay4.3k wrote:

Using seqkit .

 seqkit fx2tab -n -g <your_fasta_file.fa>
ADD COMMENTlink written 7 months ago by lakhujanivijay4.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 552 users visited in the last hour