Question: Fasta length does not match Ensembl info
0
gravatar for rmf
12 days ago by
rmf370
rmf370 wrote:

I am calculating the total genome length from a fasta file using the following code

zcat genome.fa.gz | grep -v ">" | wc | awk '{print $3-$1}'

For Yeast, I get 12,157,105, and the Ensembl info indicates exactly 12,157,105. So, that adds up.

For Human, I get 56,917,651,860, but the Ensembl info indicates 3,609,003,417.

Anyone know why? I must be missing something.

annotation fasta • 72 views
ADD COMMENTlink modified 12 days ago by Nicolas Rosewick6.8k • written 12 days ago by rmf370
1

From the ftp://ftp.ensembl.org/pub/release-94/fasta/homo_sapiens/dna/README:

TOPLEVEL --------- These files contains all sequence regions flagged as toplevel in an Ensembl schema. This includes chromsomes, regions not assembled into chromosomes and N padded haplotype/patch regions.

ADD REPLYlink written 12 days ago by WouterDeCoster32k

Ah right! Ensembl count for human is based on the primary assembly. And some organisms don't have primary assemblies, just the top-level.

ADD REPLYlink modified 12 days ago • written 12 days ago by rmf370
0
gravatar for Nicolas Rosewick
12 days ago by
Belgium, Brussels
Nicolas Rosewick6.8k wrote:

here you are counting the number of lines.

You should try wc -m as mentionned in wc man page :

-m 
displays a character count. You cannot specify this option with -c.

Thus :

zcat genome.fa.gz | grep -v ">" | wc -m

-

edit : to count only character and not newlines :

zcat genome.fa.gz | grep -v ">" |  tr -d '\n' | wc -m
ADD COMMENTlink modified 12 days ago • written 12 days ago by Nicolas Rosewick6.8k

I think wc -m counts the newline characters as well. What we actually want is total chars - newline chars = bases.

ADD REPLYlink modified 12 days ago • written 12 days ago by rmf370

I edit my answer accordingly. You have to use tr -d '\n' to remove newlines

ADD REPLYlink written 12 days ago by Nicolas Rosewick6.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2080 users visited in the last hour