Gffread extracts the wrong nucleotides from genome
1
1
Entering edit mode
3 months ago

Long Story Short: gffread is extracting the wrong sequences from the genome but providing the correct length. I don't understand this at all.

Long Story Long:

I have a set of unannotated genes that I need to blast to find similar genes. I need to obtain the sequences from each gene and decided to use gffread with my genome and gff3 file (a coral genome). I would like the genes exported as a fasta file.

To do this I used the following code:

gffread -w Output.fa -g Reference_Genome.fa Reference_Genome_GFF3.gff3 

When I observe the output I notice that the lengths are correct (the CDS lengths match up as expected from the gff3 file). However, the nucleotides from the sequences are incorrect! What could I have done wrong to get this result?

For example: gene 21533 (g21533) exists on sequence name "1" from the 22,321st nucleotide to the 27,293rd nucleotide (i.e., 4973 nucleotides long). Within this gene, there are two CDS the first from 22,321 to 22,608 (288 nucleotides) and the second from 26,301 to 27,293 (993 nucleotides). Gffread outputs a sequence with ATG at the start and TGA at the end. The length is correct (1,281 nucleotides). However, if I look at the raw genome sequence position of sequence name 1 and position 22,321, it starts with CAT. Additionally, position 27,293 ends with ATG.

I used grep to find the sequence that gffread output for gene 21533. It began at position ~449,832,178, on a completely different sequence name. This makes no sense to me. Could anyone help me understand? Below is an image of the GFF header for the gene I am discussing.

enter image description here

Here is the sequence gffread identified:

ATGTGGTCTGAGAACTTAGCTAATGGTCGGATGGTTCTCTTTGTGTGGTTGTTGGCCAGATTTACCACCGTTAACCTTTCCTACGGACGTTTTGCTACAGAACTGACTACCGGCCTCGCTGTGAGCCGCCATCTTGGTCCAGCTAATTTAACTAATTGCCCAAGAGTTGTCTACAGCAGTTCTCAGTTGAGGTTAATTAGAAAGAATCAACCCAAGCAGCATATAGACAGTGATCTATGGAAAAATCTCGGCGATTTGGGAATCCGTAAAAGGTTTAGAGGCAAACGTGGAGGTCGAAACAGGAATTCAATCGGTATTTCTGTGGCCCAGAGCACAGTGAGCCCGCATTCCATTGTGATAACGAACGGTGCACCAACAATCAGGAATCCGGCCGCCGTCCCCACTCTTTTGGTGAGCAACGCTCGCTCACTTGCTCCGAAAATCAGCGAGCTCCAATGTGTTGCGATTCAGAACTCAGCAGACATTGTTTGTATTACGGAGACGTGGCTGACGGATAATATCATAAACGATGCGGTTGTGTTGAGTGGCTATAATCTATTCCGGAAAGACCGTGGATCGCGTGGTGGTGGAATCACTGTTTACATTAGTTCCTCTATTCGAGCGAAGCGGCTTGAAGACCAAGAGTTGTGCGAGGCAGTGTCAGAGTCACTGTGGATTGAGCTGAGACCTACCAGACTTCCTCGGCCGATATCTGCCGTGCTTATTGGGACAGTTTACCATCCCCCACATGCCAAAGCAGAGGACAACAACAGGCTGCGCGATCACATTCAAGAAGTCGTGGACTCACGTCTCCTCCAATACCCTGATTGCCTGGTCTGCGTTGTGGGTGATTTTAACCCAGCATCTACGAATTTTTCTGCGTCGGCCCTCAAACAGTGTTGTGGCCTTACACAAATTGTGCGAATAAAAACACGTGATACTGGAATACTCGATTGGTGCCTTACCAATAAACCCAAGTGTCTGTCTACCCCGCAATTTTACTGTGCCGCTGAACTCTACTCTCCGGCCCATTTTTCGCTGAAAGGTGAAGTCGAGCAATTGCAGCGGCGAGCCAAATTAAAAGGACGTAAGAACGGGGGTTTAGAGATGGGTCCTGGTGTTATAGAGAATAGATTGGAGAAATATCTTTTTACCGTAGACTGCAAAGAAACAGCGAGGTTTTGGAGTCACATCGCTTTATATAGCGCTTCATTTCGAAATAAGCCCAAAACAATTGGTGCTGCGTTTTCACTCTGCCGAAGAGATCCAGTGATTGAATGA

Here is the actual sequence of the entire gene:

CATTCAATCACTGGATCTCTTCGGCAGAGTGAAAACGCAGCACCAATTGTTTTGGGCTTATTTCGAAATGAAGCGCTATATAAAGCGATGTGACTCCAAAACCTCGCTGTTTCTTTGCAGTCTACGGTAAAAAGATATTTCTCCAATCTATTCTCTATAACACCAGGACCCATCTCTAAACCCCCGTTCTTACGTCCTTTTAATTTGGCTCGCCGCTGCAATTGCTCGACTTCACCTTTCAGCGAAAAATGGGCCGGAGAGTAGAGTTCAGCGGCACAGTAAAATTGCTATTTATGAGCAGCATCGACAAAATTCTTCATTTAAGCGGTCGAAAAAGTACAATAATATTTAACAATAGCCCATGAGTCGAAGCCGAATGGGCTATTGTCCCGTGGCCCTTGAAGGCGAATTGTTTTAGTATCACCCAACTAGTCGGACAGAAAAGGCAATAATAAAGTTAGCAAATGCAAGTTGAAGAAATATTTATTTGGGAATAAAACGAAAGAAAGAGTCACGCGTTTCGCTACTCGAGGACTATCGGCACTAATAGTCCTCTAGTAGCGTAGCCAATCAAAATGCAGGATTTGCATTAGTCCACTAGTTGGGTGATACTAAAGGTACTTAGTTGAAGTATTACAGGAACCCATGAGTAGGAGCAAGCAACTTCCATTTATTCTTGTCACGGCGTTTTCACAGACCGATTTATTTTTAGATTGAATTTTCTGCGAATGAGGCTCCCGCAGGAGCCCGATGACCAATCACAGGAAACAAACTTGACGTCATCGTGTCACCGAACCGGAACTGCCTTTGTTGTTTTGCGGCAAAGGTAGTCAGATTGGTCTGTAAAAATGCCGTGACATAAGTTTAATATGGGAGTTGCTCGCTCCTGAGCAGTGAAAATCCTGTTTAGATGGTATAGGGATGGAATCTTGAATGACAAACACACAGGAAGCACAGCAGATCACAAGAGCTATTATTACCTAAACTTATCACATTGTAATCAAACGAACAACCGTAAAGAGGTGTCATGTCTTTGGCAGGGAGATGACAGCTGTACAACCCGGCATACAAACAAACTACTTTCGGAGATGAAGTGTAAAGCATCGCCACACTATCCAACTGACCGTGTGTATAGTGTGGCAACTTAAGTCCAGCTCAGTCTCTCTCCCTTGTGATTACTCAGTGAGAGCAACCCTAAGATCGTAGTAGCAAAACAACCGTAAGTGTTTTTTTTTTTTTCTGTGATCATGCACATTTGGTTTGAAGGTTGACTGTTCTTACTTAGAACCGAAAATTTGCAGTCTTAGTGATCTCTATTTGTTGTTTATTTTTCATATAGCAAGGAGGCTAAGAACATAAACTTAAAATTGTCTTAAATATAATAGTGAGACGGAAAGCAGAAGAGGACATTAGGAACCTCCATAAGGTGAAAAACAAAAAGTATTTTTTAGTTTCATATGCTGAATGCCAGGCTCTTGGGAGTTTTAAACCTTAAAACTTAAAGCACCAAAGAGGTCTAAATAAGTCATCCAACTTTGGATAGCACGCGCACTCTCATTTGTCAATAGCTGTGTTTAGATGGGAGTGATAGTATGGAACACGGCTGTGACATCACACGAATTTTGATTGCTTATGTGTTGTCAGACGCGCGTTTTGATCGGCTAGTAGGAAATATGAGCGTGTATCAAGAACATCTGTTTCAATGATTTGATTGATTGATTTGTTTATTAATCACTTTCGCAGCCACTGGCTGAATTACAGTGAAGGTCAACAATAAATTTATACACAAGAAAACAGTTCCATAATAAATTACTAGATAATAAATAATACTAATAGGTATATATATATTATCTCAATGTTATAAGGATATAACCAAGGACAAAGACTAGTTAATGAGCGCAGGCGCATATTTCACGCTGACAAGACCGCCAAAGCGGTCGGTTTTTGTCGCCATAGTGGTATGTGCTTTTGGTCTCAAATTATAGCGAGAGGACGGTCCGATGATCCGTGAATGTATAAGAGGGTGCAGGGGGTTCCCAGGAATGATTTTGCTTACAAACTTGGCACATGCCACATCCCTGCGCTCCTCTAGGGTGGAAATCCCAGCCTTAGCCAACGCGCTTGCATATGGTAGGAAAGGAAATATGATTGCCAGAGCGCGCTTCTGAACCCTTTCAAGATCATGTGACAGGTATTTGGGAAGGTTGGCGAACACGACACAGGCGTATTCTAAAATAGAACGCACAAGAGAACAGTAGATGCATACCAGGTCTACTGGCGAAACACGGCATTTCTTGAGTTGTCTGATTGCGTAAAGACGCCTGTTAGCCTTCTTCACCACGTACTCGCAGTGGACAGCCCATGAGAGATCGTTAGAGATGTACACCCCGAGTAACTTAAATGATGTCACTTCCTCAATATAGATGCCACCACTGGCAATGGGCTGAAGTTCACAGCTATTGTAGTGTAGAAAGCTAACACGCATCTCCTTGCACTTTTTCGGATTAAGCTGCATGTTGTTATTGCTAGCAAATTCTTGAACATCCGATACAATATGACACATTACTGATGGTGAATTTCTAGGTATTACCTCCAATAGAGTTAGGTCGTCAACAAATTTTGCTCGAGGCCTCCAATCATTAACAAGGTCATTTACCATTATTGCAAAAAGTAAGGGCGCTAATTTCGTGCCTTGTGGGATACCCCCATTAAGATGTTTTGGCAGGGATGAAAACGAGCCAATTTGTACGAACTGTGACCTCCCTAGTAAGAACGCGGCAACCCACCTTACTAGGCTAGGGTGTAGGTCGAAACAGGATAGCTTTGACAATAAAATCTTATGATCAATTAAGTCAAAGCCTTTTTTGAAGTCGGCGAAAAAGAACCGGATAGTACAGTTGCCTCTGTCTAGTGCTTCAAGGGCTAAGTGGAGCAAATAGACAAGGGCATGATCTGTAGATCGCCCGGCAACTGCAAATTGGTTACTATCTAGTTCTGGGACCACTTTAGGTAGCATTCTAGTCAGGGTAAAACTTTCTAATACCTTGGAAATTTGGCATGTTAGTGATATAGGTCTTAGGTCACTTTCTATAGCTCTTGGTGGTCTTTCCTTGGGGATGGGAGAGACAGCAGCCGACTTGAGTAGAGGTGGGAGGTAGCCTTCACATAATGATGTATTGTAAATATCAGCGATAACTGGGGCGAGTTCAAATGAGAAGGTCTTCAGAATGATGTTTGGGATCCCATCAGGGCCTCCTGCTTTTCTCAGCTTGATCGACCGTAAGGCGATATCAGCTTCCCGCGCTGTTACAAACAGGTCAGCTGGTACCTCAGACACGTCCACAGGGATCCCAAGTACATCGTCAACAGTGAGAGGATCGAACGCTGACGTTAGGCTACAAAAGAAGTTGTTAATTCTTTCACATAGCGAGGCAACTGACTCCGTTTCCCCAATGAGTTGGTGAAACCACTGACTATCAGCACTCGAGACACCAGATAGGTTTTTCACCTCTTTCCACCATCGCGCGACGTTTGTTTCCTTTAAAGTTTTCACCTTGGTCTCATAGAACGTCTCCTTGCATTCTTTCATCAGCCTCTGAACTTTATTCCTCCATATCTTAAAGAGAACGGATTCCTTCCCATACTTAGACAAAAATCGTTGACGTTTTGATATTGCAGACTTAATTGCAACAGTGATCCAAGGCTTGTCTGTCGCGTGCATGCGCACTGATCGTACAGGAAGAAATTTGTCGATGGCTTCTGATATCCTACTGTAAAACCACTCAAACTTCTCTTTGCATGATGTTAAGTTGTACAGCTCGTCCCAAGAGTATGATGTTATCCATTGGCCGAATGATCGAATATTGCTGGCCCTTGTGTCTCTTTTGGTAATGGTCGTCTTGGAAGGTTTTAGAGGTGGTTGATTATTCTTAACCAGGAAACAATAATGGTCACTGGTCCCAAGTTTTGGTAGCTGGACCGGGGTAGACAGACACTTGGGTTTATTGGTAAGGCACCAATCGAGTATTCCAGTATCACGTGTTTTTATTCGCACAATTTGTGTAAGGCCACAACACTGTTTGAGGGCCGACGCAGAAAAATTCGTAGATGCTGGGTTAAAATCACCCACAACGCAGACCAGGCAATCAGGGTATTGGAGGAGACGTGAGTCCACGACTTCTTGAATGTGATCGCGCAGCCTGTTGTTGTCCTCTGCTTTGGCATGTGGGGGATGGTAAACTGTCCCAATAAGCACGGCAGATATCGGCCGAGGAAGTCTGGTAGGTCTCAGCTCAATCCACAGTGACTCTGACACTGCCTCGCACAACTCTTGGTCTTCAAGCCGCTTCGCTCGAATAGAGGAACTAATGTAAACAGTGATTCCACCACCACGCGATCCACGGTCTTTCCGGAATAGATTATAGCCACTCAACACAACCGCATCGTTTATGATATTATCCGTCAGCCACGTCTCCGTAATACAAACAATGTCTGCTGAGTTCTGAATCGCAACACATTGGAGCTCGCTGATTTTCGGAGCAAGTGAGCGAGCGTTGCTCACCAAAAGAGTGGGGACGGCGGCCGGATTCCTGATTGTTGGTGCACCGTTCGTTATCACAATGGAATGCGGGCTCACTGTGCTCTGGGCCACAGAAATACCGATTGAATTCCTGTTTCGACCTCCACGTTTGCCTCTAAACCTTTTACGGATTCCCAAATCGCCGAGATTTTTCCATAGATCACTGTCTATATGCTGCTTGGGTTGATTCTTTCTAATTAACCTCAACTGAGAACTGCTGTAGACAACTCTTGGGCAATTAGTTAAATTAGCTGGACCAAGATGGCGGCTCACAGCGAGGCCGGTAGTCAGTTCTGTAGCAAAACGTCCGTAGGAAAGGTTAACGGTGGTAAATCTGGCCAACAACCACACAAAGAGAACCATCCGACCATTAGCTAAGTTCTCAGACCACATG

transcriptomics harvesting genome gff3 gffread sequence • 234 views
ADD COMMENT
1
Entering edit mode
3 months ago

Figured it out! I was not taking into account that the gene was from the - strand.

ADD COMMENT

Login before adding your answer.

Traffic: 1063 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6