I have to analyze exons' and intron's sequences of many organisms.
My question is: what is the most efficient way to retrieve all those sequences in fasta format?
Or in other words: which Database holds accessible information about exons and introns sequences?
I thought to download gff files of all the organisms, filter exons, and introns and then get their sequences by using bedtools get fasta.
But this process requires to download many genomes and seems not to be effective.
Any suggestion for this purpose?
You can extract the intron features from a GFF3 file using the script shown here:
Run it as follows:
./gff3_to_introns.py -i Homo_sapiens.GRCh38.94.chr.gff3 -o introns_file.tsv
The output file will have introns in the following format:
#chrom intron_start intron_end strand gene_id tx_acc intron_num intron_ct 1 12228 12612 + ENST00000456328 ENST00000456328 1 2 1 12722 13220 + ENST00000456328 ENST00000456328 2 2 1 12058 12178 + ENST00000450305 ENST00000450305 1 5 1 12228 12612 + ENST00000450305 ENST00000450305 2 5
A few notes:
- This script works fine for RefSeq GFF3 and Ensembl GFF3 files. I did not test with others though.
- For Ensembl GFF3 files, the gene_id column does not have the actual gene_id; it has the transcript accession instead.
- The coordinates are 1-based just like they are for GFF3 files. BED files are 0-based.
Once you have the coordinates for introns, you should be able to use bedtools getfasta to fetch the exact sequence.
I think Ensembl API provides an entry to tap into the sequence database and fetch the fasta sequences in a programmatic way. It requires a bit PERL programming, but isn't too bad. For the organisms that are not in Ensembl, you'll have to download the fasta and gff files and fetch the sequences locally, probably by BioPerl.