Question: Finding exons and introns sequences
0
gravatar for elisheva
21 months ago by
elisheva100
Israel
elisheva100 wrote:

Hello everybody!!
I have to analyze exons' and intron's sequences of many organisms.
My question is: what is the most efficient way to retrieve all those sequences in fasta format?
Or in other words: which Database holds accessible information about exons and introns sequences?
I thought to download gff files of all the organisms, filter exons, and introns and then get their sequences by using bedtools get fasta.
But this process requires to download many genomes and seems not to be effective.
Any suggestion for this purpose?

ADD COMMENTlink modified 21 months ago by vkkodali2.1k • written 21 months ago by elisheva100

How many genomes are you working with? Will you be downloading all intron and exon sequences for each genome? If so, where are the getting the coordinates from? For RefSeq data, you can download gff3 files, parse them for intron and exon coordinates and use edirect to download sequences in fasta format. But edirect would be an inefficient way to do this if you want to download sequences for all of the introns and exons. Downloading the entire genome sequence in fasta format to disk first would be much more efficient.

ADD REPLYlink written 21 months ago by vkkodali2.1k

I am working on 80 organisms from Ensembl db. So this is not practical to download their full genome. Besides, on their gff3/gtf files, there are no introns at all ):

ADD REPLYlink written 21 months ago by elisheva100

Besides, on their gff3/gtf files, there are no introns at all

prokaryotes ?

ADD REPLYlink written 21 months ago by Pierre Lindenbaum129k

of course not. I am looking only on mammals.

ADD REPLYlink written 21 months ago by elisheva100
4
gravatar for vkkodali
21 months ago by
vkkodali2.1k
United States
vkkodali2.1k wrote:

You can extract the intron features from a GFF3 file using the script shown here:

Run it as follows:

./gff3_to_introns.py -i Homo_sapiens.GRCh38.94.chr.gff3 -o introns_file.tsv

The output file will have introns in the following format:

#chrom  intron_start  intron_end  strand  gene_id          tx_acc           intron_num  intron_ct
1       12228         12612       +       ENST00000456328  ENST00000456328  1           2
1       12722         13220       +       ENST00000456328  ENST00000456328  2           2
1       12058         12178       +       ENST00000450305  ENST00000450305  1           5
1       12228         12612       +       ENST00000450305  ENST00000450305  2           5

A few notes:

  • This script works fine for RefSeq GFF3 and Ensembl GFF3 files. I did not test with others though.
  • For Ensembl GFF3 files, the gene_id column does not have the actual gene_id; it has the transcript accession instead.
  • The coordinates are 1-based just like they are for GFF3 files. BED files are 0-based.

Once you have the coordinates for introns, you should be able to use bedtools getfasta to fetch the exact sequence.

ADD COMMENTlink modified 21 months ago • written 21 months ago by vkkodali2.1k
1
gravatar for Vitis
21 months ago by
Vitis2.4k
New York
Vitis2.4k wrote:

I think Ensembl API provides an entry to tap into the sequence database and fetch the fasta sequences in a programmatic way. It requires a bit PERL programming, but isn't too bad. For the organisms that are not in Ensembl, you'll have to download the fasta and gff files and fetch the sequences locally, probably by BioPerl.

See this:

https://uswest.ensembl.org/info/docs/api/index.html

and this:

https://bioperl.org/howtos/SeqIO_HOWTO.html

ADD COMMENTlink written 21 months ago by Vitis2.4k

I tried it. But I have to say it's quite unclear for someone who is totally new to perl. Besides, I tried only to get transcript Ids of one chromosome and it takes about 3 minutes, so I guess for downloading full sequences it will take too much time.

ADD REPLYlink written 21 months ago by elisheva100

Ensembl could be slow, depending on your connection speed. But working with all exons and introns of 80 species is also a very very big endeavor. Downloading 80 genomes and extracting sequences based on GFF3 could still be an option if I'm doing this, because at least the extraction part could be very fast and efficient, see this:

https://bioperl.org/howtos/Local_Databases_HOWTO.html

ADD REPLYlink written 21 months ago by Vitis2.4k

Thank you for your response. But as I mentioned above, gff3 doesn't include introns coordinates for some reason. Therefore, I can't see how it will be helpfull. And one more thing, assuming I do have the coordinates and the complete genomes, I guess bedtools getfasta will more efficient for this case than bioperl

ADD REPLYlink modified 21 months ago • written 21 months ago by elisheva100

I think intron coordinates could be inferred from GFF3 with some carefully designed calculations. Introns are just sequences between the exons. You only need to be careful with the tricky ones involving UTRs.

ADD REPLYlink written 21 months ago by Vitis2.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 817 users visited in the last hour