Question: Fasta length does not match Ensembl info
0
gravatar for rmf
4 months ago by
rmf540
rmf540 wrote:

I am calculating the total genome length from a fasta file using the following code

zcat genome.fa.gz | grep -v ">" | wc | awk '{print $3-$1}'

For Yeast, I get 12,157,105, and the Ensembl info indicates exactly 12,157,105. So, that adds up.

For Human, I get 56,917,651,860, but the Ensembl info indicates 3,609,003,417.

Anyone know why? I must be missing something.

annotation fasta • 158 views
ADD COMMENTlink modified 4 months ago by Nicolas Rosewick7.3k • written 4 months ago by rmf540
1

From the ftp://ftp.ensembl.org/pub/release-94/fasta/homo_sapiens/dna/README:

TOPLEVEL --------- These files contains all sequence regions flagged as toplevel in an Ensembl schema. This includes chromsomes, regions not assembled into chromosomes and N padded haplotype/patch regions.

ADD REPLYlink written 4 months ago by WouterDeCoster37k

Ah right! Ensembl count for human is based on the primary assembly. And some organisms don't have primary assemblies, just the top-level.

ADD REPLYlink modified 4 months ago • written 4 months ago by rmf540
0
gravatar for Nicolas Rosewick
4 months ago by
Belgium, Brussels
Nicolas Rosewick7.3k wrote:

here you are counting the number of lines.

You should try wc -m as mentionned in wc man page :

-m 
displays a character count. You cannot specify this option with -c.

Thus :

zcat genome.fa.gz | grep -v ">" | wc -m

-

edit : to count only character and not newlines :

zcat genome.fa.gz | grep -v ">" |  tr -d '\n' | wc -m
ADD COMMENTlink modified 4 months ago • written 4 months ago by Nicolas Rosewick7.3k

I think wc -m counts the newline characters as well. What we actually want is total_chars - newline_chars = bases.

ADD REPLYlink modified 9 weeks ago • written 4 months ago by rmf540

I edit my answer accordingly. You have to use tr -d '\n' to remove newlines

ADD REPLYlink written 4 months ago by Nicolas Rosewick7.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 896 users visited in the last hour