Question

Finding exons and introns sequences

0

Entering edit mode

6.0 years ago

elisheva ▴ 120

Hello everybody!!
I have to analyze exons' and intron's sequences of many organisms.
My question is: what is the most efficient way to retrieve all those sequences in fasta format?
Or in other words: which Database holds accessible information about exons and introns sequences?
I thought to download gff files of all the organisms, filter exons, and introns and then get their sequences by using bedtools get fasta.
But this process requires to download many genomes and seems not to be effective.
Any suggestion for this purpose?

genome sequence exon intron ensembl • 4.3k views

ADD COMMENT • link updated 5.9 years ago by vkkodali_ncbi ★ 3.7k • written 6.0 years ago by elisheva ▴ 120

0

Entering edit mode

How many genomes are you working with? Will you be downloading all intron and exon sequences for each genome? If so, where are the getting the coordinates from? For RefSeq data, you can download gff3 files, parse them for intron and exon coordinates and use edirect to download sequences in fasta format. But edirect would be an inefficient way to do this if you want to download sequences for all of the introns and exons. Downloading the entire genome sequence in fasta format to disk first would be much more efficient.

ADD REPLY • link 6.0 years ago by vkkodali_ncbi ★ 3.7k

0

Entering edit mode

I am working on 80 organisms from Ensembl db. So this is not practical to download their full genome. Besides, on their gff3/gtf files, there are no introns at all ):

ADD REPLY • link 5.9 years ago by elisheva ▴ 120

0

Entering edit mode

Besides, on their gff3/gtf files, there are no introns at all

prokaryotes ?

ADD REPLY • link 5.9 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

of course not. I am looking only on mammals.

ADD REPLY • link 5.9 years ago by elisheva ▴ 120

1

Entering edit mode

6.0 years ago

Vitis ★ 2.5k

I think Ensembl API provides an entry to tap into the sequence database and fetch the fasta sequences in a programmatic way. It requires a bit PERL programming, but isn't too bad. For the organisms that are not in Ensembl, you'll have to download the fasta and gff files and fetch the sequences locally, probably by BioPerl.

See this:

https://uswest.ensembl.org/info/docs/api/index.html

and this:

https://bioperl.org/howtos/SeqIO_HOWTO.html

ADD COMMENT • link 6.0 years ago by Vitis ★ 2.5k

0

Entering edit mode

I tried it. But I have to say it's quite unclear for someone who is totally new to perl. Besides, I tried only to get transcript Ids of one chromosome and it takes about 3 minutes, so I guess for downloading full sequences it will take too much time.

ADD REPLY • link 5.9 years ago by elisheva ▴ 120

0

Entering edit mode

Ensembl could be slow, depending on your connection speed. But working with all exons and introns of 80 species is also a very very big endeavor. Downloading 80 genomes and extracting sequences based on GFF3 could still be an option if I'm doing this, because at least the extraction part could be very fast and efficient, see this:

https://bioperl.org/howtos/Local_Databases_HOWTO.html

ADD REPLY • link 5.9 years ago by Vitis ★ 2.5k

0

Entering edit mode

Thank you for your response. But as I mentioned above, gff3 doesn't include introns coordinates for some reason. Therefore, I can't see how it will be helpfull. And one more thing, assuming I do have the coordinates and the complete genomes, I guess bedtools getfasta will more efficient for this case than bioperl

ADD REPLY • link 5.9 years ago by elisheva ▴ 120

0

Entering edit mode

I think intron coordinates could be inferred from GFF3 with some carefully designed calculations. Introns are just sequences between the exons. You only need to be careful with the tricky ones involving UTRs.

ADD REPLY • link 5.9 years ago by Vitis ★ 2.5k

score 4 · Accepted Answer · 2018-11-01

You can extract the intron features from a GFF3 file using the script shown here:

Run it as follows:

./gff3_to_introns.py -i Homo_sapiens.GRCh38.94.chr.gff3 -o introns_file.tsv

The output file will have introns in the following format:

#chrom  intron_start  intron_end  strand  gene_id          tx_acc           intron_num  intron_ct
1       12228         12612       +       ENST00000456328  ENST00000456328  1           2
1       12722         13220       +       ENST00000456328  ENST00000456328  2           2
1       12058         12178       +       ENST00000450305  ENST00000450305  1           5
1       12228         12612       +       ENST00000450305  ENST00000450305  2           5

A few notes:

This script works fine for RefSeq GFF3 and Ensembl GFF3 files. I did not test with others though.
For Ensembl GFF3 files, the gene_id column does not have the actual gene_id; it has the transcript accession instead.
The coordinates are 1-based just like they are for GFF3 files. BED files are 0-based.

Once you have the coordinates for introns, you should be able to use bedtools getfasta to fetch the exact sequence.