Question: Fasta length does not match Ensembl info
0
gravatar for rmf
9 weeks ago by
rmf460
rmf460 wrote:

I am calculating the total genome length from a fasta file using the following code

zcat genome.fa.gz | grep -v ">" | wc | awk '{print $3-$1}'

For Yeast, I get 12,157,105, and the Ensembl info indicates exactly 12,157,105. So, that adds up.

For Human, I get 56,917,651,860, but the Ensembl info indicates 3,609,003,417.

Anyone know why? I must be missing something.

annotation fasta • 119 views
ADD COMMENTlink modified 9 weeks ago by Nicolas Rosewick7.0k • written 9 weeks ago by rmf460
1

From the ftp://ftp.ensembl.org/pub/release-94/fasta/homo_sapiens/dna/README:

TOPLEVEL --------- These files contains all sequence regions flagged as toplevel in an Ensembl schema. This includes chromsomes, regions not assembled into chromosomes and N padded haplotype/patch regions.

ADD REPLYlink written 9 weeks ago by WouterDeCoster35k

Ah right! Ensembl count for human is based on the primary assembly. And some organisms don't have primary assemblies, just the top-level.

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by rmf460
0
gravatar for Nicolas Rosewick
9 weeks ago by
Belgium, Brussels
Nicolas Rosewick7.0k wrote:

here you are counting the number of lines.

You should try wc -m as mentionned in wc man page :

-m 
displays a character count. You cannot specify this option with -c.

Thus :

zcat genome.fa.gz | grep -v ">" | wc -m

-

edit : to count only character and not newlines :

zcat genome.fa.gz | grep -v ">" |  tr -d '\n' | wc -m
ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by Nicolas Rosewick7.0k

I think wc -m counts the newline characters as well. What we actually want is total_chars - newline_chars = bases.

ADD REPLYlink modified 8 hours ago • written 9 weeks ago by rmf460

I edit my answer accordingly. You have to use tr -d '\n' to remove newlines

ADD REPLYlink written 9 weeks ago by Nicolas Rosewick7.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1018 users visited in the last hour