How is contig length calculated ?
1
0
Entering edit mode
22 months ago
Alewa ▴ 170

I'm confused how the contig length is calculated. should it not be equal to the total number of nucleotides in the Mycobacterium tuberculosis genome? seems that's not the case given the demo below

Here's example

Give me snippet of Mycobacterium tuberculosis reference genome

$ esearch -db nuccore -query 'Mycobacterium tuberculosis H37Rv[Organism] AND NC_000962.3[ACCN]' | efetch -format fasta | head

>NC_000962.3 Mycobacterium tuberculosis H37Rv, complete genome
TTGACCGATGACCCCGGTTCAGGCTTCACCACAGTGTGGAACGCGGTCGTCTCCGAACTTAACGGCGACC
CTAAGGTTGACGACGGACCCAGCAGTGATGCTAATCTCAGCGCTCCGCTGACCCCTCAGCAAAGGGCTTG
GCTCAATCTCGTCCAGCCATTGACCATCGTCGAGGGGTTTGCTCTGTTATCCGTGCCGAGCAGCTTTGTC
CAAAACGAAATCGAGCGCCATCTGCGGGCCCCGATTACCGACGCTCTCAGCCGCCGACTCGGACATCAGA
TCCAACTCGGGGTCCGCATCGCTCCGCCGGCGACCGACGAAGCCGACGACACTACCGTGCCGCCTTCCGA
AAATCCTGCTACCACATCGCCAGACACCACAACCGACAACGACGAGATTGATGACAGCGCTGCGGCACGG
GGCGATAACCAGCACAGTTGGCCAAGTTACTTCACCGAGCGCCCGCACAATACCGATTCCGCTACCGCTG
GCGTAACCAGCCTTAACCGTCGCTACACCTTTGATACGTTCGTTATCGGCGCCTCCAACCGGTTCGCGCA
CGCCGCCGCCTTGGCGATCGCAGAAGCACCCGCCCGCGCTTACAACCCCCTGTTCATCTGGGGCGAGTCC

how many characters (Nucleotides) are in the reference genome

$ esearch -db nuccore -query 'Mycobacterium tuberculosis H37Rv[Organism] AND NC_000962.3[ACCN]' | efetch -format fasta | grep -v "^>" | wc -c
4474554

To check contig lemgth, create fasta dictionary and open it to check contig length

    $ esearch -db nuccore -query 'Mycobacterium tuberculosis H37Rv[Organism] AND NC_000962.3[ACCN]' | efetch -format fasta > tb_Ref_NC_000962_3.fasta 
    $ samtools faidx /local/workdir/sam/results/vcf2fasta/tb_Ref_NC_000962_3.fasta
    $ cat /local/workdir/sam/results/vcf2fasta/tb_Ref_NC_000962_3.dict
    @HD     VN:1.6
    @SQ     SN:NC_000962.3  LN:4411532      M5:57b12ff2773c5fd3a0f879972f176e50     UR:file:/local/workdir/sam/results/vcf2fasta/tb_Ref_NC_000962_3.fasta
tuberculosis Contig assembly Genome Mycobacterium • 702 views
ADD COMMENT
3
Entering edit mode
22 months ago
GenoMax 142k

You need to remove formatting characters that make the fasta sequence screen readable.

$ esearch -db nuccore -query 'Mycobacterium tuberculosis H37Rv[Organism] AND NC_000962.3[ACCN]' | efetch -format fasta | grep -v "^>" | tr -d '[:space:]' | wc -c
4411532
ADD COMMENT
0
Entering edit mode

that clarifies! thanks GenoMax

ADD REPLY

Login before adding your answer.

Traffic: 1441 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6