Question: Fasta length does not match Ensembl info
0
gravatar for rmf
16 months ago by
rmf900
rmf900 wrote:

I am calculating the total genome length from a fasta file using the following code

zcat genome.fa.gz | grep -v ">" | wc | awk '{print $3-$1}'

For Yeast, I get 12,157,105, and the Ensembl info indicates exactly 12,157,105. So, that adds up.

For Human, I get 56,917,651,860, but the Ensembl info indicates 3,609,003,417.

Anyone know why? I must be missing something.

annotation fasta • 335 views
ADD COMMENTlink modified 16 months ago by Nicolas Rosewick8.6k • written 16 months ago by rmf900
1

From the ftp://ftp.ensembl.org/pub/release-94/fasta/homo_sapiens/dna/README:

TOPLEVEL --------- These files contains all sequence regions flagged as toplevel in an Ensembl schema. This includes chromsomes, regions not assembled into chromosomes and N padded haplotype/patch regions.

ADD REPLYlink written 16 months ago by WouterDeCoster42k

Ah right! Ensembl count for human is based on the primary assembly. And some organisms don't have primary assemblies, just the top-level.

ADD REPLYlink modified 16 months ago • written 16 months ago by rmf900
0
gravatar for Nicolas Rosewick
16 months ago by
Belgium, Brussels
Nicolas Rosewick8.6k wrote:

here you are counting the number of lines.

You should try wc -m as mentionned in wc man page :

-m 
displays a character count. You cannot specify this option with -c.

Thus :

zcat genome.fa.gz | grep -v ">" | wc -m

-

edit : to count only character and not newlines :

zcat genome.fa.gz | grep -v ">" |  tr -d '\n' | wc -m
ADD COMMENTlink modified 16 months ago • written 16 months ago by Nicolas Rosewick8.6k

I think wc -m counts the newline characters as well. What we actually want is total_chars - newline_chars = bases.

ADD REPLYlink modified 14 months ago • written 16 months ago by rmf900

I edit my answer accordingly. You have to use tr -d '\n' to remove newlines

ADD REPLYlink written 16 months ago by Nicolas Rosewick8.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1955 users visited in the last hour