How To Calculate Gc Content Of Each Contig In A Multifasta File
1
1
Entering edit mode
10.7 years ago
HG ★ 1.2k

Hi all, I have a multi fasta file containing around 200 contig. I want to calculate the base composition, mainly GC content and length of each contig. Can anyone suggest me how to do it using awk or perl??

Thank you.

perl awk • 16k views
ADD COMMENT
7
Entering edit mode
10.7 years ago
Hamish ★ 3.2k

While you could write some Perl to do this, if you are only interested in some basic information about the sequences then using a existing tool such as the EMBOSS program infoseq is probably going to be easier. For example, getting the sequence length and GC composition:

$ infoseq -auto -only -accession -length -pgc em_rel_est_env
Accession      Length %GC    
AB446243       43     55.81  
AB446244       174    59.20  
AB446245       195    52.31  
AB446246       205    61.46  
AB446247       133    60.15  
AB446248       106    62.26  
AB446249       73     63.01  
AB446250       216    57.41  
...

While this example uses white-space padded columns, the '-nocolumns' and '-delimiter' options can be used to produce a delimited table for easier parsing, and the header line detailing the columns can be disabled using the '-noheading' option.

If you are interested in extracting other information from the sequences, as a staring point try looking at the other EMBOSS programs: http://emboss.open-bio.org/html/use/apbs02.html

From Perl you could use 'system()' to run an EMBOSS program externally, or you could use the EMBOSS support in BioPerl to run EMBOSS programs (see http://www.bioperl.org/wiki/HOWTO:Beginners#Using_EMBOSS_applications_with_Bioperl).

Alternativly you could use the sequence information support available in BioPerl, see http://www.bioperl.org/wiki/HOWTO:Beginners#Obtaining_basic_sequence_statistics.

ADD COMMENT
0
Entering edit mode

infoseq is working fine . But a small problem my sequence are like this

>NODE_209_length_61_cov_14.6667_ID_976848
gaaggggggacagataattggggtgaagtcgtaacaaggtagccgtatcggaaggtgcgg
>NODE_210_length_61_cov_13.1667_ID_4045294
gaaggggggacagataattggggtgaagtcgtaacaaggtagccgtatcggaaggggcgg
>NODE_211_length_61_cov_860.667_ID_4107208
gaaggtgggacagataattggggtgaagtcgtaacaaggtagccgtatcggaaggtgcgg
>NODE_212_length_61_cov_93.1667_ID_4112696
cctaaagagtaacggaggcgcccaaagggtccctcagaatggatggaaatcattcgcaga

and i am getting result like

> Accession      Length %GC    
> -              145078 37.76  
> -              5241   39.48  
> -              257506 37.64  
> -              396    40.91

Could you please let me know how can i get name of each contig in place of accession number

ADD REPLY
0
Entering edit mode

From the 'infoseq' usage message (infoseq -help):

   -name               boolean    [@(!$(only))] Display 'name' column
   -accession          boolean    [@(!$(only))] Display 'accession' column

The default behaviour of 'infoseq' is to display all of the columns with the appropriate headers, so you can always use something like:

infoseq -auto em_rel_est_env

To see what all the columns look like, and then select the appropriate ones for your specific data.

ADD REPLY

Login before adding your answer.

Traffic: 2828 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6