Question: Generating Internal Exon Based On Known Mrna Isoform
1
gravatar for Puriney
6.4 years ago by
Puriney330
New York City
Puriney330 wrote:

This is kind of a coding strategy question.

For given gene, it has two isoforms with 3 exons. Isoform_A is exon1-exon2-exo3, while IsoformB is exon1-exon3. Thus, the exon2 here is what I want to filter out, as internal exon.

Now I have downloaded all the exon data from UCSC genome browser UCSC genes track (selected from primary and related fields). And I just want to filter out all the "internal exon" in this question.

The input is somehow like:

#isoform_name    chr    strand    ex_start    ex_end    gene_name
isoformA    chr1    +    10,30,    15,35    geneM
isoformB    chr1    +    10,20,30,    15,25,35    geneM
isoformC    chr1    +    40,50,    45,55    geneM

Thus the exon [20-25] is called the internal exon.

The key is to deal with two string, exstart string and exend string. Can anyone provide some hint about how to cope with this issue efficiently?

p.s. I have known HEXEvent and BioMart can provide such data set. But I am just curious how to do it with local codes? Thanks a lot!

splicing • 1.3k views
ADD COMMENTlink modified 6.4 years ago • written 6.4 years ago by Puriney330

Please, why are there more exstart and exend values provided?

ADD REPLYlink written 6.4 years ago by Biomonika (Noolean)3.0k

isoformC has a missing comma in ex_start

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by JC7.8k

missing comma added like @JC mentioned

ADD REPLYlink written 6.4 years ago by Puriney330
1
gravatar for Pierre Lindenbaum
6.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

filtering out the internal exons, using awk:

 cat input.txt |\
 sed 's/,       /       /g' |\
 awk -F '  ' '{OFS="       "; Sn=split($4,S,","); En=split($5,E,","); $4=sprintf("%s,%s",S[1],S[Sn]);$5=sprintf("%s,%s",E[1],E[En]);print;}'


isoformA       chr1       +       10,30       15,35       geneM
isoformB       chr1       +       10,30       15,35       geneM
isoformC       chr1       +       40,50       45,55       geneM
ADD COMMENTlink modified 6.4 years ago • written 6.4 years ago by Pierre Lindenbaum120k
1
gravatar for JC
6.4 years ago by
JC7.8k
Mexico
JC7.8k wrote:

Perl option (don't forget to fix the comma in the isoformC):

 perl -plane 's/(\d+,).*?(\d+,)(\s+)(\d+,).*?(\d+\s+)/$1$2$3$4$5/' < in > out
ADD COMMENTlink modified 6.4 years ago • written 6.4 years ago by JC7.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1141 users visited in the last hour