Question

Determine % coding of a bacterial genome sequence

0

Entering edit mode

5.7 years ago

suzuBell ▴ 60

Reading a paper with interesting statistics on bacterial genomes (here). It states that "The average protein coding content of a bacterial genome is 88% for the 2671 finished genomes in GenBank".

I have assembled the contigs of a bacterial genome and have a contigs.fasta file. Is there a quick & easy way to determine the % protein coding for this assembled genome?

assembly coding protein • 1.9k views

ADD COMMENT • link 5.7 years ago by suzuBell ▴ 60

1

Entering edit mode

As quick and easy I would say predict the open reading frames and write them to an other file. Count the basepairs from both files and calculate the percentage.

ADD REPLY • link 5.7 years ago by gb ★ 2.2k

0

Entering edit mode

Thanks @gb. I'm in the process of trying your advice now. To make sure I clearly understand these points, I am trying to figure out three values for this bacterial genome 1) % coding, 2) Number of genes, 3) Number of protein coding genes. It sounds like %coding is just taking the total number of base pairs in ORFs and dividing by the number of base pairs in the assembled contigs. Would "number of genes" just be the number of ORFs? And if so, how could I determine the "number of protein coding genes"? I am asking because if I use the metric you recommended for objective 1 (% coding), doesn't it assume that all ORFs are not only genes, but are also protein coding genes (i.e. objective 2 and 3 are the same value)?

ADD REPLY • link 5.7 years ago by suzuBell ▴ 60

0

Entering edit mode

Thanks @gb. I also tried the recommendation. My ordered contigs.fasta file has 1,500,360 bases. To determine ORFs, I first tried to use NCBI ORFFinder but it has a 50K limitation. Instead, I then tried getORF from EMBOSS. I used it on Galaxy platform with all the defaults. When I downloaded the resulting ORF file and counted the bases, there were 7,313,741! I was definitely expecting less than the number of bases in the ordered contigs.fasta file so I would have a % coding less than 100. But right now, my coding percent is ~500% (which makes no sense). I am sorry I suspect this is a simple problem to solve, but I do not have much experience analyzing genomics. Do you have any suggestions what might be causing this problem?

ADD REPLY • link 5.7 years ago by suzuBell ▴ 60

0

Entering edit mode

Thanks @gb. This is an example of what the beginning of the output of getORF looks like for me:

>NODE_43_length_251_cov_0.839080_1 [2 - 40] 
tgcttgattgatagcataatagcggttattataagtggc
>NODE_43_length_251_cov_0.839080_2 [36 - 65] 
gtggctagggggtcttgcaaattcaccgca
>NODE_43_length_251_cov_0.839080_3 [90 - 119] 
aaggtctttactacccctcttttattatca
>NODE_43_length_251_cov_0.839080_4 [16 - 141] 
cataatagcggttattataagtggctagggggtcttgcaaattcaccgcataattataat
acccactatcttgaaaggtctttactacccctcttttattatcataagtgtaatttacat
ctacag
>NODE_43_length_251_cov_0.839080_5 [44 - 247] 
ggggtcttgcaaattcaccgcataattataatacccactatcttgaaaggtctttactac
ccctcttttattatcataagtgtaatttacatctacagtagcacgatagaaagttgtaaa
tccctctttattttgcgtgatacttgtatcaatcactctagtaatttctatcgtaataat
agaatctgcatctttttcacttgctag

To get my value of 7,313,741, I simply remove any lines that started with '>' (the headers) and counted the characters (atcg). I do notice there seems to be overlap in the nucleotides. For instance, the last five characters in the first sequence in the file above ('gtggc') is the same as the first five characters in the second sequence in the file above.

This is a screenshot showing the parameter fields available for getORF in Galaxy (I kept them all default).

Is this the expected format you are thinking of for ORFs? If not, do you have any suggestions from experience how I can obtain an appropriate basepair count from the ORF output?

ADD REPLY • link 5.7 years ago by suzuBell ▴ 60

1

Entering edit mode

I am no expert either =) my comment was just to give you a starting point. To predict genes I have used AUGUST (http://augustus.gobics.de/) before. As you can see in the fasta file you have some very short sequences which can probably not be coding for a protein. Maybe it would help to change the setting "What to output" to regions between start and stop codon. I don't know how this tool exactly works. After this you can try to do blastx to double check but then you go beyond "quick and easy". To me quick and easy also means a global result and not a precise result.

ADD REPLY • link 5.7 years ago by gb ★ 2.2k