I am looking for the best way to find the sequences of introns that are located within 5'UTRs of a genome.
- Annotation file from MAKER (maker_annotation.gff3)
- Assembled genome (scaffold_genome_assembly.fasta)
So far, I've used GenomeTools to add intron features to my annotation file:
gt gff3 -addintrons maker_annotation.gff3 > intron_annotation.gff3
I created separate .gff3 files for the 5'UTRs and introns using grep:
cat intron_annotation_file.gff3 | grep five_prime_UTR > five_prime_UTR.gff3 cat intron_annotation_file.gff3 | grep intron > introns.gff3
With these files, I used bedtools to get a .gff3 of the intersection between introns and 5'UTRs:
bedtools intersect -wa -a introns.gff3 -b five_prime_UTR.gff3 > introns_five_prime.gff3
And used this last output to get the sequence from the genome:
bedtools getfasta -name -fo introns_five_seq.fasta -fi scaffold_genome_assembly.fasta -bed introns_five_prime.gff3
However, the sequences I get in introns_five_seq.fasta don't all start with GT and end in AG. Is there anything wrong with this process? I'm new to bioinformatics, any feedback is appreciated!
May be you want to remove the
bedtools intersectif you only want the intron fragments that are located within 5'UTRs as you have stated above otherwise you get the entire intron sequence.