Long Story Short: gffread is extracting the wrong sequences from the genome but providing the correct length. I don't understand this at all.
Long Story Long:
I have a set of unannotated genes that I need to blast to find similar genes. I need to obtain the sequences from each gene and decided to use gffread with my genome and gff3 file (a coral genome). I would like the genes exported as a fasta file.
To do this I used the following code:
gffread -w Output.fa -g Reference_Genome.fa Reference_Genome_GFF3.gff3
When I observe the output I notice that the lengths are correct (the CDS lengths match up as expected from the gff3 file). However, the nucleotides from the sequences are incorrect! What could I have done wrong to get this result?
For example: gene 21533 (g21533) exists on sequence name "1" from the 22,321st nucleotide to the 27,293rd nucleotide (i.e., 4973 nucleotides long). Within this gene, there are two CDS the first from 22,321 to 22,608 (288 nucleotides) and the second from 26,301 to 27,293 (993 nucleotides). Gffread outputs a sequence with ATG at the start and TGA at the end. The length is correct (1,281 nucleotides). However, if I look at the raw genome sequence position of sequence name 1 and position 22,321, it starts with CAT. Additionally, position 27,293 ends with ATG.
I used grep to find the sequence that gffread output for gene 21533. It began at position ~449,832,178, on a completely different sequence name. This makes no sense to me. Could anyone help me understand? Below is an image of the GFF header for the gene I am discussing.
Here is the sequence gffread identified:
ATGTGGTCTGAGAACTTAGCTAATGGTCGGATGGTTCTCTTTGTGTGGTTGTTGGCCAGATTTACCACCGTTAACCTTTCCTACGGACGTTTTGCTACAGAACTGACTACCGGCCTCGCTGTGAGCCGCCATCTTGGTCCAGCTAATTTAACTAATTGCCCAAGAGTTGTCTACAGCAGTTCTCAGTTGAGGTTAATTAGAAAGAATCAACCCAAGCAGCATATAGACAGTGATCTATGGAAAAATCTCGGCGATTTGGGAATCCGTAAAAGGTTTAGAGGCAAACGTGGAGGTCGAAACAGGAATTCAATCGGTATTTCTGTGGCCCAGAGCACAGTGAGCCCGCATTCCATTGTGATAACGAACGGTGCACCAACAATCAGGAATCCGGCCGCCGTCCCCACTCTTTTGGTGAGCAACGCTCGCTCACTTGCTCCGAAAATCAGCGAGCTCCAATGTGTTGCGATTCAGAACTCAGCAGACATTGTTTGTATTACGGAGACGTGGCTGACGGATAATATCATAAACGATGCGGTTGTGTTGAGTGGCTATAATCTATTCCGGAAAGACCGTGGATCGCGTGGTGGTGGAATCACTGTTTACATTAGTTCCTCTATTCGAGCGAAGCGGCTTGAAGACCAAGAGTTGTGCGAGGCAGTGTCAGAGTCACTGTGGATTGAGCTGAGACCTACCAGACTTCCTCGGCCGATATCTGCCGTGCTTATTGGGACAGTTTACCATCCCCCACATGCCAAAGCAGAGGACAACAACAGGCTGCGCGATCACATTCAAGAAGTCGTGGACTCACGTCTCCTCCAATACCCTGATTGCCTGGTCTGCGTTGTGGGTGATTTTAACCCAGCATCTACGAATTTTTCTGCGTCGGCCCTCAAACAGTGTTGTGGCCTTACACAAATTGTGCGAATAAAAACACGTGATACTGGAATACTCGATTGGTGCCTTACCAATAAACCCAAGTGTCTGTCTACCCCGCAATTTTACTGTGCCGCTGAACTCTACTCTCCGGCCCATTTTTCGCTGAAAGGTGAAGTCGAGCAATTGCAGCGGCGAGCCAAATTAAAAGGACGTAAGAACGGGGGTTTAGAGATGGGTCCTGGTGTTATAGAGAATAGATTGGAGAAATATCTTTTTACCGTAGACTGCAAAGAAACAGCGAGGTTTTGGAGTCACATCGCTTTATATAGCGCTTCATTTCGAAATAAGCCCAAAACAATTGGTGCTGCGTTTTCACTCTGCCGAAGAGATCCAGTGATTGAATGA
Here is the actual sequence of the entire gene:
CATTCAATCACTGGATCTCTTCGGCAGAGTGAAAACGCAGCACCAATTGTTTTGGGCTTATTTCGAAATGAAGCGCTATATAAAGCGATGTGACTCCAAAACCTCGCTGTTTCTTTGCAGTCTACGGTAAAAAGATATTTCTCCAATCTATTCTCTATAACACCAGGACCCATCTCTAAACCCCCGTTCTTACGTCCTTTTAATTTGGCTCGCCGCTGCAATTGCTCGACTTCACCTTTCAGCGAAAAATGGGCCGGAGAGTAGAGTTCAGCGGCACAGTAAAATTGCTATTTATGAGCAGCATCGACAAAATTCTTCATTTAAGCGGTCGAAAAAGTACAATAATATTTAACAATAGCCCATGAGTCGAAGCCGAATGGGCTATTGTCCCGTGGCCCTTGAAGGCGAATTGTTTTAGTATCACCCAACTAGTCGGACAGAAAAGGCAATAATAAAGTTAGCAAATGCAAGTTGAAGAAATATTTATTTGGGAATAAAACGAAAGAAAGAGTCACGCGTTTCGCTACTCGAGGACTATCGGCACTAATAGTCCTCTAGTAGCGTAGCCAATCAAAATGCAGGATTTGCATTAGTCCACTAGTTGGGTGATACTAAAGGTACTTAGTTGAAGTATTACAGGAACCCATGAGTAGGAGCAAGCAACTTCCATTTATTCTTGTCACGGCGTTTTCACAGACCGATTTATTTTTAGATTGAATTTTCTGCGAATGAGGCTCCCGCAGGAGCCCGATGACCAATCACAGGAAACAAACTTGACGTCATCGTGTCACCGAACCGGAACTGCCTTTGTTGTTTTGCGGCAAAGGTAGTCAGATTGGTCTGTAAAAATGCCGTGACATAAGTTTAATATGGGAGTTGCTCGCTCCTGAGCAGTGAAAATCCTGTTTAGATGGTATAGGGATGGAATCTTGAATGACAAACACACAGGAAGCACAGCAGATCACAAGAGCTATTATTACCTAAACTTATCACATTGTAATCAAACGAACAACCGTAAAGAGGTGTCATGTCTTTGGCAGGGAGATGACAGCTGTACAACCCGGCATACAAACAAACTACTTTCGGAGATGAAGTGTAAAGCATCGCCACACTATCCAACTGACCGTGTGTATAGTGTGGCAACTTAAGTCCAGCTCAGTCTCTCTCCCTTGTGATTACTCAGTGAGAGCAACCCTAAGATCGTAGTAGCAAAACAACCGTAAGTGTTTTTTTTTTTTTCTGTGATCATGCACATTTGGTTTGAAGGTTGACTGTTCTTACTTAGAACCGAAAATTTGCAGTCTTAGTGATCTCTATTTGTTGTTTATTTTTCATATAGCAAGGAGGCTAAGAACATAAACTTAAAATTGTCTTAAATATAATAGTGAGACGGAAAGCAGAAGAGGACATTAGGAACCTCCATAAGGTGAAAAACAAAAAGTATTTTTTAGTTTCATATGCTGAATGCCAGGCTCTTGGGAGTTTTAAACCTTAAAACTTAAAGCACCAAAGAGGTCTAAATAAGTCATCCAACTTTGGATAGCACGCGCACTCTCATTTGTCAATAGCTGTGTTTAGATGGGAGTGATAGTATGGAACACGGCTGTGACATCACACGAATTTTGATTGCTTATGTGTTGTCAGACGCGCGTTTTGATCGGCTAGTAGGAAATATGAGCGTGTATCAAGAACATCTGTTTCAATGATTTGATTGATTGATTTGTTTATTAATCACTTTCGCAGCCACTGGCTGAATTACAGTGAAGGTCAACAATAAATTTATACACAAGAAAACAGTTCCATAATAAATTACTAGATAATAAATAATACTAATAGGTATATATATATTATCTCAATGTTATAAGGATATAACCAAGGACAAAGACTAGTTAATGAGCGCAGGCGCATATTTCACGCTGACAAGACCGCCAAAGCGGTCGGTTTTTGTCGCCATAGTGGTATGTGCTTTTGGTCTCAAATTATAGCGAGAGGACGGTCCGATGATCCGTGAATGTATAAGAGGGTGCAGGGGGTTCCCAGGAATGATTTTGCTTACAAACTTGGCACATGCCACATCCCTGCGCTCCTCTAGGGTGGAAATCCCAGCCTTAGCCAACGCGCTTGCATATGGTAGGAAAGGAAATATGATTGCCAGAGCGCGCTTCTGAACCCTTTCAAGATCATGTGACAGGTATTTGGGAAGGTTGGCGAACACGACACAGGCGTATTCTAAAATAGAACGCACAAGAGAACAGTAGATGCATACCAGGTCTACTGGCGAAACACGGCATTTCTTGAGTTGTCTGATTGCGTAAAGACGCCTGTTAGCCTTCTTCACCACGTACTCGCAGTGGACAGCCCATGAGAGATCGTTAGAGATGTACACCCCGAGTAACTTAAATGATGTCACTTCCTCAATATAGATGCCACCACTGGCAATGGGCTGAAGTTCACAGCTATTGTAGTGTAGAAAGCTAACACGCATCTCCTTGCACTTTTTCGGATTAAGCTGCATGTTGTTATTGCTAGCAAATTCTTGAACATCCGATACAATATGACACATTACTGATGGTGAATTTCTAGGTATTACCTCCAATAGAGTTAGGTCGTCAACAAATTTTGCTCGAGGCCTCCAATCATTAACAAGGTCATTTACCATTATTGCAAAAAGTAAGGGCGCTAATTTCGTGCCTTGTGGGATACCCCCATTAAGATGTTTTGGCAGGGATGAAAACGAGCCAATTTGTACGAACTGTGACCTCCCTAGTAAGAACGCGGCAACCCACCTTACTAGGCTAGGGTGTAGGTCGAAACAGGATAGCTTTGACAATAAAATCTTATGATCAATTAAGTCAAAGCCTTTTTTGAAGTCGGCGAAAAAGAACCGGATAGTACAGTTGCCTCTGTCTAGTGCTTCAAGGGCTAAGTGGAGCAAATAGACAAGGGCATGATCTGTAGATCGCCCGGCAACTGCAAATTGGTTACTATCTAGTTCTGGGACCACTTTAGGTAGCATTCTAGTCAGGGTAAAACTTTCTAATACCTTGGAAATTTGGCATGTTAGTGATATAGGTCTTAGGTCACTTTCTATAGCTCTTGGTGGTCTTTCCTTGGGGATGGGAGAGACAGCAGCCGACTTGAGTAGAGGTGGGAGGTAGCCTTCACATAATGATGTATTGTAAATATCAGCGATAACTGGGGCGAGTTCAAATGAGAAGGTCTTCAGAATGATGTTTGGGATCCCATCAGGGCCTCCTGCTTTTCTCAGCTTGATCGACCGTAAGGCGATATCAGCTTCCCGCGCTGTTACAAACAGGTCAGCTGGTACCTCAGACACGTCCACAGGGATCCCAAGTACATCGTCAACAGTGAGAGGATCGAACGCTGACGTTAGGCTACAAAAGAAGTTGTTAATTCTTTCACATAGCGAGGCAACTGACTCCGTTTCCCCAATGAGTTGGTGAAACCACTGACTATCAGCACTCGAGACACCAGATAGGTTTTTCACCTCTTTCCACCATCGCGCGACGTTTGTTTCCTTTAAAGTTTTCACCTTGGTCTCATAGAACGTCTCCTTGCATTCTTTCATCAGCCTCTGAACTTTATTCCTCCATATCTTAAAGAGAACGGATTCCTTCCCATACTTAGACAAAAATCGTTGACGTTTTGATATTGCAGACTTAATTGCAACAGTGATCCAAGGCTTGTCTGTCGCGTGCATGCGCACTGATCGTACAGGAAGAAATTTGTCGATGGCTTCTGATATCCTACTGTAAAACCACTCAAACTTCTCTTTGCATGATGTTAAGTTGTACAGCTCGTCCCAAGAGTATGATGTTATCCATTGGCCGAATGATCGAATATTGCTGGCCCTTGTGTCTCTTTTGGTAATGGTCGTCTTGGAAGGTTTTAGAGGTGGTTGATTATTCTTAACCAGGAAACAATAATGGTCACTGGTCCCAAGTTTTGGTAGCTGGACCGGGGTAGACAGACACTTGGGTTTATTGGTAAGGCACCAATCGAGTATTCCAGTATCACGTGTTTTTATTCGCACAATTTGTGTAAGGCCACAACACTGTTTGAGGGCCGACGCAGAAAAATTCGTAGATGCTGGGTTAAAATCACCCACAACGCAGACCAGGCAATCAGGGTATTGGAGGAGACGTGAGTCCACGACTTCTTGAATGTGATCGCGCAGCCTGTTGTTGTCCTCTGCTTTGGCATGTGGGGGATGGTAAACTGTCCCAATAAGCACGGCAGATATCGGCCGAGGAAGTCTGGTAGGTCTCAGCTCAATCCACAGTGACTCTGACACTGCCTCGCACAACTCTTGGTCTTCAAGCCGCTTCGCTCGAATAGAGGAACTAATGTAAACAGTGATTCCACCACCACGCGATCCACGGTCTTTCCGGAATAGATTATAGCCACTCAACACAACCGCATCGTTTATGATATTATCCGTCAGCCACGTCTCCGTAATACAAACAATGTCTGCTGAGTTCTGAATCGCAACACATTGGAGCTCGCTGATTTTCGGAGCAAGTGAGCGAGCGTTGCTCACCAAAAGAGTGGGGACGGCGGCCGGATTCCTGATTGTTGGTGCACCGTTCGTTATCACAATGGAATGCGGGCTCACTGTGCTCTGGGCCACAGAAATACCGATTGAATTCCTGTTTCGACCTCCACGTTTGCCTCTAAACCTTTTACGGATTCCCAAATCGCCGAGATTTTTCCATAGATCACTGTCTATATGCTGCTTGGGTTGATTCTTTCTAATTAACCTCAACTGAGAACTGCTGTAGACAACTCTTGGGCAATTAGTTAAATTAGCTGGACCAAGATGGCGGCTCACAGCGAGGCCGGTAGTCAGTTCTGTAGCAAAACGTCCGTAGGAAAGGTTAACGGTGGTAAATCTGGCCAACAACCACACAAAGAGAACCATCCGACCATTAGCTAAGTTCTCAGACCACATG