Question

How to know intron lenghts

1

Entering edit mode

7.6 years ago

jmramos.bio ▴ 10

Hello,

I have a RNA-seq experiment and I would like to use STAR as aligner. I did an RNA-seq course and they told me that introducing as a parameter how long is the longest intron in the genome will save time. But...they forgot to tell us how obtain this information and we forgot to ask...Could you tell me how can I do that?

Thank you! J

RNA-Seq STAR intron • 3.0k views

ADD COMMENT • link updated 7.6 years ago by i.sudbery 19k • written 7.6 years ago by jmramos.bio ▴ 10

1

Entering edit mode

I don't know how providing the longest intron length will help the aligner, but there are a lot of things I don't know. Either way, you can find some nice transcriptome summary statistics from http://genomewiki.ucsc.edu/index.php/Gene_Set_Summary_Statistics.

ADD REPLY • link 7.0 years ago by spvensko ▴ 240

0

Entering edit mode

If you know the sequenced genome, you can make a script that takes as input your annotation file (GFF) and then looking for the longest intron.

ADD REPLY • link 7.6 years ago by glihm ▴ 660

score 4 · Answer 1 · 2016-09-20

4

Entering edit mode

7.6 years ago

Medhat 9.7k

you can use this script

as follow

intron-length.awk TYPE=CDS yourGffFile.gff

it will report you

minimum intron length, maximum intron length, and the maximum sum-of-intron-lengths among all mRNA features

ADD COMMENT • link 7.6 years ago by Medhat 9.7k

0

Entering edit mode

I am trying to use the above script but getting error: please suggest how I can Resolve it.

[root@psgl genome]# awk intron-length.awk TYPE=CDS Rs_1.0.Gene.LFY.gff >intron_statistics awk: cmd. line:1: intron-length.awk awk: cmd. line:1: ^ syntax error

ADD REPLY • link 7.3 years ago by Bioinfonext ▴ 460

0

Entering edit mode

cat file.gff RUS05596 Ver1.2.2 CDS 2580 2690 . + 0 ID=Rs462540.1.cds3;Parent=Rs462540.1 RUS05596 Ver1.2.2 three_prime_UTR 2691 2973 . + . ID=Rs462540.1.utr3;Parent=Rs462540.1

Rs216420 RUS05606 1938 2500 +

RUS05606 Ver1.2.2 gene 1770 2753 . + . ID=Rs216420;Name=Rs216420 RUS05606 Ver1.2.2 promoter 270 1769 . + . Note=promoter region RUS05606 Ver1.2.2 mRNA 1770 2753 . + . ID=Rs216420.1;Parent=Rs216420;Product=Unknown protein RUS05606 Ver1.2.2 protein 1938 2500 . + . ID=Rs216420.1.protein1;Name=Rs216420.1;Derives_from=Rs216420.1;Product=Unknown protein RUS05606 Ver1.2.2 exon 1770 1996 . + . ID=Rs216420.1.exon1;Parent=Rs216420.1 RUS05606 Ver1.2.2 five_prime_UTR 1770 1937 . + . ID=Rs216420.1.utr5;Parent=Rs216420.1 RUS05606 Ver1.2.2 CDS 1938 1996 . + 0 ID=Rs216420.1.cds1;Parent=Rs216420.1 RUS05606 Ver1.2.2 exon 2197 2753 . + . ID=Rs216420.1.exon2;Parent=Rs216420.1 RUS05606 Ver1.2.2 CDS 2197 2500 . + 2 ID=Rs216420.1.cds2;Parent=Rs216420.1 RUS05606 Ver1.2.2 three_prime_UTR 2501 2753 . + . ID=Rs216420.1.utr3;Parent=Rs216420.1

ADD REPLY • link 7.3 years ago by Bioinfonext ▴ 460

score 2 · Answer 2 · 2016-09-20

Specifying the maximum intron length helps because it limits the search space for the "other end" of a read when it is being aligned to the genome. If the second half of your gene maps several MB away, it is unlikely that this represents a valid, biologically relevant, splice junction and is probably the result of a miss-alignment. If this is the case, it makes no sense to spend time looking MBs away for the mapping position of the second half of a split read.

It is also the case that some reference genomes contain gene models with unreasonably long introns, often that merge two genes together (i.e. one half of the junction is in one gene, and the other half is in a different gene, usually a different member of the same protein family).

A little bit of knowledge about your genome of interest can help here. In humans we use 2Mb as our maximum intron length because there is a gene with an intron that long that we are pretty confident is real (I don't remember which right now).

Otherwise you could trust the reference annotation and use the method outlined by Medhat.