Question: How To Calculate Gc Content Of Each Contig In A Multifasta File
1
gravatar for HG
5.3 years ago by
HG1.1k
Germany
HG1.1k wrote:

Hi all, I have a multi fasta file containing around 200 contig. I want to calculate the base composition, mainly GC content and length of each contig. Can anyone suggest me how to do it using awk or perl??

Thank you.

perl awk • 10k views
ADD COMMENTlink modified 5.3 years ago by Hamish3.0k • written 5.3 years ago by HG1.1k
7
gravatar for Hamish
5.3 years ago by
Hamish3.0k
UK
Hamish3.0k wrote:

While you could write some Perl to do this, if you are only interested in some basic information about the sequences then using a existing tool such as the EMBOSS program infoseq is probably going to be easier. For example, getting the sequence length and GC composition:

$ infoseq -auto -only -accession -length -pgc em_rel_est_env
Accession      Length %GC    
AB446243       43     55.81  
AB446244       174    59.20  
AB446245       195    52.31  
AB446246       205    61.46  
AB446247       133    60.15  
AB446248       106    62.26  
AB446249       73     63.01  
AB446250       216    57.41  
...

While this example uses white-space padded columns, the '-nocolumns' and '-delimiter' options can be used to produce a delimited table for easier parsing, and the header line detailing the columns can be disabled using the '-noheading' option.

If you are interested in extracting other information from the sequences, as a staring point try looking at the other EMBOSS programs: http://emboss.open-bio.org/html/use/apbs02.html

From Perl you could use 'system()' to run an EMBOSS program externally, or you could use the EMBOSS support in BioPerl to run EMBOSS programs (see http://www.bioperl.org/wiki/HOWTO:Beginners#Using_EMBOSS_applications_with_Bioperl).

Alternativly you could use the sequence information support available in BioPerl, see http://www.bioperl.org/wiki/HOWTO:Beginners#Obtaining_basic_sequence_statistics.

ADD COMMENTlink written 5.3 years ago by Hamish3.0k

infoseq is working fine . But a small problem my sequence are like this

>NODE_209_length_61_cov_14.6667_ID_976848
gaaggggggacagataattggggtgaagtcgtaacaaggtagccgtatcggaaggtgcgg
>NODE_210_length_61_cov_13.1667_ID_4045294
gaaggggggacagataattggggtgaagtcgtaacaaggtagccgtatcggaaggggcgg
>NODE_211_length_61_cov_860.667_ID_4107208
gaaggtgggacagataattggggtgaagtcgtaacaaggtagccgtatcggaaggtgcgg
>NODE_212_length_61_cov_93.1667_ID_4112696
cctaaagagtaacggaggcgcccaaagggtccctcagaatggatggaaatcattcgcaga

and i am getting result like

> Accession      Length %GC    
> -              145078 37.76  
> -              5241   39.48  
> -              257506 37.64  
> -              396    40.91

Could you please let me know how can i get name of each contig in place of accession number

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by HG1.1k

From the 'infoseq' usage message (infoseq -help):

   -name               boolean    [@(!$(only))] Display 'name' column
   -accession          boolean    [@(!$(only))] Display 'accession' column

The default behaviour of 'infoseq' is to display all of the columns with the appropriate headers, so you can always use something like:

infoseq -auto em_rel_est_env

To see what all the columns look like, and then select the appropriate ones for your specific data.

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Hamish3.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1310 users visited in the last hour