Question

How to get number of exons for each transcript in biomart

0

Entering edit mode

5.2 years ago

lauren.fehrman • 0

I would like to get a table from ensembl that includes the number of exons for each transcript. I can't find any options that correspond to this on biomart, though.

biomart • 4.4k views

ADD COMMENT • link updated 4.5 years ago by asrmpr • 0 • written 5.2 years ago by lauren.fehrman • 0

0

Entering edit mode

I don't think biomart would store such aggregate data. You should be able to use UCSC MySQL tables to write a custom query, if you're not specific about using EnsEMBL. If you need EnsEMBL, you might need to get the CDs information and count exons yourself.

ADD REPLY • link 5.2 years ago by Ram 43k

0

Entering edit mode

In addition to RamRS suggestion, if you could covert GTF to exons, like for example using this, you can groupBy the transcript and count the number of exons per transcript.

ADD REPLY • link 5.2 years ago by GouthamAtla 12k

0

Entering edit mode

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLY • link updated 4.5 years ago by GenoMax 142k • written 5.2 years ago by Ram 43k

score 1 · Answer 1 · 2019-02-13

I wouldn't use biomaRt, but try to use AWK instead. If you download the annotation gtf file from ensemble, you can try something like this with AWK:

awk '$3=="exon" {print $0}' Homo_sapiens.GRCh38.78.gtf | awk '{ count[$10]++ } END { for (word in count) print word, count[word]}' > numberOfExonsPerGene.txt

score 0 · Answer 2 · 2019-02-13

ucsc (not biomart)

$ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -P 3306 -D hg38 -e 'select chrom,name,exonCount from wgEncodeGencodeBasicV28;'
+-------+-------------------+-----------+
| chrom | name              | exonCount |
+-------+-------------------+-----------+
| chr1  | ENST00000619216.1 |         1 |
| chr1  | ENST00000473358.1 |         3 |
| chr1  | ENST00000469289.1 |         2 |
| chr1  | ENST00000607096.1 |         1 |
| chr1  | ENST00000417324.1 |         3 |
| chr1  | ENST00000641515.2 |         3 |
| chr1  | ENST00000335137.4 |         1 |
| chr1  | ENST00000466430.5 |         4 |
| chr1  | ENST00000495576.1 |         2 |
| chr1  | ENST00000610542.1 |         4 |
| chr1  | ENST00000493797.1 |         2 |
| chr1  | ENST00000484859.1 |         2 |
| chr1  | ENST00000466557.6 |         8 |
| chr1  | ENST00000410691.1 |         1 |
| chr1  | ENST00000496488.1 |         2 |
| chr1  | ENST00000612080.1 |         1 |
| chr1  | ENST00000635159.1 |         2 |
| chr1  | ENST00000426406.3 |         1 |
(...)
+-------+-------------------+-----------+

score 0 · Answer 3 · 2019-02-13

Or use R with a transcript database (you can make your own from any GTF using the makeTxDbFromGFF command from the GenomicFeatures library):

## setup transcriptDb
txdb <- TxDb.Mmusculus.UCSC.mm10.ensGene
## get exon locations for each gene
exons <- exonsBy(txdb,'gene')

## print number of exons for each gene - just look at the top 6 with 'head'
head(sapply(exons,length))
ENSMUSG00000000001 ENSMUSG00000000003 ENSMUSG00000000028 ENSMUSG00000000031 
                 9                  9                 24                 15 
ENSMUSG00000000037 ENSMUSG00000000049 
                41                 16

score 0 · Answer 4 · 2019-02-14

I cannot see a a way either to directly get the number of exons in each transcript via biomart.

But you can choose Exon Stable ID as an attribute for an output. You can then use a simple awk script to count the exons per transcript. Let's assume you have choosen the attributes Gene stable ID, Transcript stable ID and Exon stable ID for the output. Make sure you tick the point "Unique results only" and download as a TSV.

$ awk -v FS="\t" -v OFS="\t" 'NR>1 {transcript[$2]++;} END { for(t in transcript) print t, transcript[t] }' mart_export.txt

This will create a list with the transcript name given in column 2 as the key, and count each line with this transcript number. At the end we iterate over the list and print each count. If the transcript id is in a different column in your output, change the $2 to whatever it is.

fin swimmer

score 0 · Answer 5 · 2019-10-21

0

Entering edit mode

4.5 years ago

asrmpr • 0

Exon Number Finder v.1

A tool to find genes of user-specific exon number

https://github.com/CyPH3R-ASR/exonNum

ADD COMMENT • link 4.5 years ago by asrmpr • 0

0

Entering edit mode

Thanks for contributing, but please note:

1) this does not answer the toplevel question as OP asked about Biomart and

2) it is sufficient if you add your tool once, not as an answer and a comment. Removed the comment.

ADD REPLY • link 4.5 years ago by ATpoint 82k