Question

How calculate CG content for each contig in an transcript assembly?

0

Entering edit mode

5.3 years ago

luzglongoria ▴ 50

Hi everyone,

I have a .fasta file that I have created by running trinity (so it is an transcript assembly). I want to know the CG% for each contig and keep those with low % of CG. My data belong to a bird infected by Plasmodium and and only want contig with low CG% (because Plasmodium spp have low %).

I have used BBMap for calculating the CG content for all the contigs:

 stats.sh trinity_gmap.fa
A        C        G      T        N     IUPAC   Other    GC   GC_stdev
0.3719  0.1285  0.1326  0.3670  0.0000  0.0000  0.0000  0.2611  0.1979

Main genome scaffold total:             37974
Main genome contig total:               37974
Main genome scaffold sequence total:    32.330 MB
Main genome contig sequence total:      32.330 MB   0.000% gap
Main genome scaffold N/L50:             7241/1.313 KB
Main genome contig N/L50:               7241/1.313 KB
Main genome scaffold N/L90:             26130/354
Main genome contig N/L90:               26130/354
Max scaffold length:                    14.632 KB
Max contig length:                      14.632 KB
Number of scaffolds > 50 KB:            0
% main genome in scaffolds > 50 KB:     0.00%

Minimum     Number          Number          Total           Total           Scaffold
Scaffold    of              of              Scaffold        Contig          Contig 
Length      Scaffolds       Contigs         Length          Length          Coverage
--------    --------------  --------------  --------------  --------------  --------
    All             37,974          37,974      32,329,550      32,329,550   100.00%
    100             37,974          37,974      32,329,550      32,329,550   100.00%
    250             33,820          33,820      31,398,555      31,398,555   100.00%
    500             19,529          19,529      26,343,945      26,343,945   100.00%
   1 KB             10,520          10,520      19,945,249      19,945,249   100.00%
 2.5 KB              1,873           1,873       6,692,233       6,692,233   100.00%
   5 KB                197             197       1,290,084       1,290,084   100.00%
  10 KB                  5               5          57,511          57,511   100.00%

So, the toal % of CG is 26% which is not too bad :)

however, I would like to know any way of calculating the CG content for each contig.

Any idea?

Thank you so much!

RNA-Seq assembly contig transcriptome CG content • 3.6k views

ADD COMMENT • link updated 5.3 years ago by lakhujanivijay 5.8k • written 5.3 years ago by luzglongoria ▴ 50

1

Entering edit mode

5.3 years ago

lakhujanivijay 5.8k

Using seqkit .

 seqkit fx2tab -n -g <your_fasta_file.fa>

ADD COMMENT • link 5.3 years ago by lakhujanivijay 5.8k

0

Entering edit mode

5.3 years ago

Joseph Hughes ★ 3.0k

You could use compseq from EMBOSS, you can get the observed and expected of each dinucleotide. Or if you are looking for CpG islands, use newcpgreport. EMBOSS accepts fasta as an input format.

ADD COMMENT • link 5.3 years ago by Joseph Hughes ★ 3.0k

score 1 · Accepted Answer · 2019-01-16

1

Entering edit mode

5.3 years ago

ATpoint 82k

seqtk comp gives you the absolute numbers of each nucleotide in a (multi)fasta split by entries. GC content is then only an awk command away. See Percentage of bases in a fasta sequence

ADD COMMENT • link 5.3 years ago by ATpoint 82k

0

Entering edit mode

Thank you so much!. Your comment has been very helpful.

I just read that when I run

seqtk comp hg38.fa

I get :

chr, length, #A, #C, #G, #T, #2, #3, #4, #CpG, #tv, #ts, #CpG-ts

I have been reading about what means each column :

ts transition ie. adacent A<=>G or C<=>T tv transversion - the other possible [AGTC]<=>[AGTC] ajdacent pairs CpG CG pair (revcom aware) CpG-ts CG pair (revcom aware) but allowing transitions in 1st (and/or 2nd) base 2 number of ambiguous IUPAC bases with 2 possibile values 3 ditto with 3 4 ditto with 4 (I assume this means "N")

I think that the one I am interesting in is CpG. Or may be I should just focus on C and G columns?

ADD REPLY • link 5.3 years ago by luzglongoria ▴ 50

1

Entering edit mode

If you want CG content, focus on G and C column and divide the sum of these two by the sum of all A, C, T, G.

ADD REPLY • link 5.3 years ago by ATpoint 82k