Question

How To Calculate Gc Content Of Each Contig In A Multifasta File

1

Entering edit mode

10.7 years ago

HG ★ 1.2k

Hi all, I have a multi fasta file containing around 200 contig. I want to calculate the base composition, mainly GC content and length of each contig. Can anyone suggest me how to do it using awk or perl??

Thank you.

perl awk • 16k views

ADD COMMENT • link updated 10.7 years ago by Hamish ★ 3.2k • written 10.7 years ago by HG ★ 1.2k

score 7 · Answer 1 · 2013-08-21

While you could write some Perl to do this, if you are only interested in some basic information about the sequences then using a existing tool such as the EMBOSS program infoseq is probably going to be easier. For example, getting the sequence length and GC composition:

$ infoseq -auto -only -accession -length -pgc em_rel_est_env
Accession      Length %GC    
AB446243       43     55.81  
AB446244       174    59.20  
AB446245       195    52.31  
AB446246       205    61.46  
AB446247       133    60.15  
AB446248       106    62.26  
AB446249       73     63.01  
AB446250       216    57.41  
...

While this example uses white-space padded columns, the '-nocolumns' and '-delimiter' options can be used to produce a delimited table for easier parsing, and the header line detailing the columns can be disabled using the '-noheading' option.

If you are interested in extracting other information from the sequences, as a staring point try looking at the other EMBOSS programs: http://emboss.open-bio.org/html/use/apbs02.html

From Perl you could use 'system()' to run an EMBOSS program externally, or you could use the EMBOSS support in BioPerl to run EMBOSS programs (see http://www.bioperl.org/wiki/HOWTO:Beginners#Using_EMBOSS_applications_with_Bioperl).

Alternativly you could use the sequence information support available in BioPerl, see http://www.bioperl.org/wiki/HOWTO:Beginners#Obtaining_basic_sequence_statistics.