Question: How to get number of exons for each transcript in biomart
0
gravatar for lauren.fehrman
9 months ago by
lauren.fehrman0 wrote:

I would like to get a table from ensembl that includes the number of exons for each transcript. I can't find any options that correspond to this on biomart, though.

biomart • 614 views
ADD COMMENTlink modified 4 weeks ago by asrmpr0 • written 9 months ago by lauren.fehrman0

I don't think biomart would store such aggregate data. You should be able to use UCSC MySQL tables to write a custom query, if you're not specific about using EnsEMBL. If you need EnsEMBL, you might need to get the CDs information and count exons yourself.

ADD REPLYlink written 9 months ago by RamRS24k

In addition to RamRS suggestion, if you could covert GTF to exons, like for example using this, you can groupBy the transcript and count the number of exons per transcript.

ADD REPLYlink written 9 months ago by geek_y10k

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLYlink modified 4 weeks ago by genomax74k • written 9 months ago by RamRS24k
1
gravatar for Benn
9 months ago by
Benn7.9k
Netherlands
Benn7.9k wrote:

I wouldn't use biomaRt, but try to use AWK instead. If you download the annotation gtf file from ensemble, you can try something like this with AWK:

awk '$3=="exon" {print $0}' Homo_sapiens.GRCh38.78.gtf | awk '{ count[$10]++ } END { for (word in count) print word, count[word]}' > numberOfExonsPerGene.txt
ADD COMMENTlink written 9 months ago by Benn7.9k
0
gravatar for Pierre Lindenbaum
9 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

ucsc (not biomart)

$ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -P 3306 -D hg38 -e 'select chrom,name,exonCount from wgEncodeGencodeBasicV28;'
+-------+-------------------+-----------+
| chrom | name              | exonCount |
+-------+-------------------+-----------+
| chr1  | ENST00000619216.1 |         1 |
| chr1  | ENST00000473358.1 |         3 |
| chr1  | ENST00000469289.1 |         2 |
| chr1  | ENST00000607096.1 |         1 |
| chr1  | ENST00000417324.1 |         3 |
| chr1  | ENST00000641515.2 |         3 |
| chr1  | ENST00000335137.4 |         1 |
| chr1  | ENST00000466430.5 |         4 |
| chr1  | ENST00000495576.1 |         2 |
| chr1  | ENST00000610542.1 |         4 |
| chr1  | ENST00000493797.1 |         2 |
| chr1  | ENST00000484859.1 |         2 |
| chr1  | ENST00000466557.6 |         8 |
| chr1  | ENST00000410691.1 |         1 |
| chr1  | ENST00000496488.1 |         2 |
| chr1  | ENST00000612080.1 |         1 |
| chr1  | ENST00000635159.1 |         2 |
| chr1  | ENST00000426406.3 |         1 |
(...)
+-------+-------------------+-----------+
ADD COMMENTlink written 9 months ago by Pierre Lindenbaum124k
0
gravatar for benformatics
9 months ago by
benformatics1.2k
ETH Zurich
benformatics1.2k wrote:

Or use R with a transcript database (you can make your own from any GTF using the makeTxDbFromGFF command from the GenomicFeatures library):

## setup transcriptDb
txdb <- TxDb.Mmusculus.UCSC.mm10.ensGene
## get exon locations for each gene
exons <- exonsBy(txdb,'gene')

## print number of exons for each gene - just look at the top 6 with 'head'
head(sapply(exons,length))
ENSMUSG00000000001 ENSMUSG00000000003 ENSMUSG00000000028 ENSMUSG00000000031 
                 9                  9                 24                 15 
ENSMUSG00000000037 ENSMUSG00000000049 
                41                 16
ADD COMMENTlink modified 9 months ago • written 9 months ago by benformatics1.2k
0
gravatar for finswimmer
9 months ago by
finswimmer12k
Germany
finswimmer12k wrote:

I cannot see a a way either to directly get the number of exons in each transcript via biomart.

But you can choose Exon Stable ID as an attribute for an output. You can then use a simple awk script to count the exons per transcript. Let's assume you have choosen the attributes Gene stable ID, Transcript stable ID and Exon stable ID for the output. Make sure you tick the point "Unique results only" and download as a TSV.

$ awk -v FS="\t" -v OFS="\t" 'NR>1 {transcript[$2]++;} END { for(t in transcript) print t, transcript[t] }' mart_export.txt

This will create a list with the transcript name given in column 2 as the key, and count each line with this transcript number. At the end we iterate over the list and print each count. If the transcript id is in a different column in your output, change the $2 to whatever it is.

fin swimmer

ADD COMMENTlink written 9 months ago by finswimmer12k

Thank you. This worked perfectly for what I needed to do.

ADD REPLYlink written 9 months ago by lauren.fehrman0
0
gravatar for asrmpr
4 weeks ago by
asrmpr0
asrmpr0 wrote:

Exon Number Finder v.1

A tool to find genes of user-specific exon number

https://github.com/CyPH3R-ASR/exonNum

ADD COMMENTlink written 4 weeks ago by asrmpr0

Thanks for contributing, but please note:

1) this does not answer the toplevel question as OP asked about Biomart and

2) it is sufficient if you add your tool once, not as an answer and a comment. Removed the comment.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by ATpoint26k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1398 users visited in the last hour