Question: Help with finding introns
0
gravatar for elisheva
23 months ago by
elisheva70
Israel
elisheva70 wrote:

Hello everyone!!

I have a question that I guess the answer is simple:
I have a fasta file of human cDNA that looks like this:
(for example these 2 genes)

>ENST00000415118.1 cdna chromosome:GRCh38:14:22438547:22438554:1 gene:ENSG00000223997.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD1 description:T-cell receptor delta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12254]
GAAATAGT
>ENST00000631435.1 cdna chromosome:GRCh38:CHR_HSCHR7_2_CTG6:142847306:142847317:1 gene:ENSG00000282253.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T-cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]
GGGACAGGGGGC

My target is to find the introns of these genes.
How can I do it?
Thanks!!!

introns cdna genome • 793 views
ADD COMMENTlink modified 7 days ago by Biostar ♦♦ 20 • written 23 months ago by elisheva70
1
gravatar for finswimmer
23 months ago by
finswimmer11k
Germany
finswimmer11k wrote:

Hello,

here is a more comfortable solution with python using ensembl's REST-Api.

You will need a file with one transcript id per line. You can get it the way I described before.

Save the following code as intron2fasta.py

Now you can call the script in this way (python3 is required):

python intron2fasta.py -l transcripts.txt -o introns.fa

This produces the file introns.fa containing all intron sequences of the transcripts given in transcripts.txt.

But be careful. A lot of error handling is still missing. Also the REST-Api is called for every single transcript and intron sequence instead of doing this once.

fin swimmer

ADD COMMENTlink modified 13 months ago • written 23 months ago by finswimmer11k
0
gravatar for finswimmer
23 months ago by
finswimmer11k
Germany
finswimmer11k wrote:

Hello,

is it more a general question like "I have an Transcript XY and I like to find out the introns (sequence?)?" Or do you want to know the introns of the two genes you gave us? Should this solved automaticly by a program or do you want to do this by hand.

For the two transcript above you can just go to ensembl an can search for the transcript id. Than you will find out, that these two transcripts have no Introns.

fin swimmer

ADD COMMENTlink modified 23 months ago • written 23 months ago by finswimmer11k

Thanks for the quick answer! And my question is a general question' I have to do it for all the genes. So I need to solve it by program.

ADD REPLYlink written 23 months ago by elisheva70

You might not be able to get the intron sequences from a fasta file with only cDNA. You need a bed/gff file with gene boundaries for the cDNA you are interested in and then you extract the regions that are not cDNA. Just keep in mind that your gene boundaries need to start and end with the CDS you are interested in and not include UTR regions or other adjoining CDS.

But if you just have a handful, then it would be easier to do it manually as finswimmer77 suggested.

ADD REPLYlink written 23 months ago by Rohit1.3k

doesn't it help that the fasta header has the coordinates? (just asking....... maybe I'm wrong)

ADD REPLYlink written 23 months ago by elisheva70

The questions is, what do you excatly want. Are you realy just interested in the intron sequence? Or do you want the whole sequence for the given transcript including the intron? What is your goal?

fin swimmer

ADD REPLYlink written 23 months ago by finswimmer11k

I'm only want the introns - preference to the second one.

ADD REPLYlink written 23 months ago by elisheva70
1

Ok, here is a quick-and-dirty solution:

First we need to extract the transcript numbers in your fasta file.

grep "^>" input.fa|cut -f1 -d " "|sed -r 's/^.{1}//' > transcripts.txt

Now we can use ensembls REST-API to fetch the sequence for the transcript. If we set the mask_feature parameters the introns are in lower case and the exons in upper case. You can than split/regex for the lower case parts to extract the introns.

In example for python3 and a BRCA Transcript:

import re
import requests, sys

server = "http://rest.ensembl.org"
ext = "/sequence/id/ENST00000415118.1?mask_feature=1"

r = requests.get(server+ext, headers={ "Content-Type" : "text/plain"})

if not r.ok:
  r.raise_for_status()
  sys.exit()

sequence = r.text
introns = re.findall('[atgc]+', sequence)
print(introns[0])
ADD REPLYlink modified 23 months ago • written 23 months ago by finswimmer11k

Thanks so much!! I have one more question: What does it mean: "mask_feature"? And Is there is a short way to generate fasta file of all the introns:

>transcript.index_of_intron    
sequence....

.

ADD REPLYlink modified 23 months ago • written 23 months ago by elisheva70

What does it mean: "mask_feature"?

Take a look at the link to the REST-API manual I gave to you above. Typically the sequence is given in upper case letters. With the parameter mask_feature set to 1 the introns sequence is in lower case letters.

And Is there is a short way to generate fasta file of all the introns

In my script re.findall returns a list of all introns. Iterate over it and format your output as you like.

fin swimmer

ADD REPLYlink modified 23 months ago • written 23 months ago by finswimmer11k

Thank you so much!!
I saw you wrote another solution But I allready used th first one.(Mainly because I use python 2.7)
The problem is when I run this transcript:

ENST00000537694.1

I get an error:

Traceback (most recent call last):
  File "introns.py", line 44, in <module>
    main(line)
  File "introns.py", line 15, in main
    r.raise_for_status()
  File "/usr/lib/python2.7/dist-packages/requests/models.py", line 844, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: http://rest.ensembl.org/sequence/id/ENST00000537694.1%0D%0A?mask_feature=1

I guess it might be because the sequence is to long.
Is there is a way to fix it?

ADD REPLYlink modified 23 months ago • written 23 months ago by elisheva70

Hello,

the error results from the %0D%0A in the url. %0D%0A encodes a line break. Remove it and it should work.

fin swimmer

ADD REPLYlink written 23 months ago by finswimmer11k

O.K. thanks. How can I ensure That it wont repet this error? After all I downloaded these transcripts from ensembl. So I don't get why the url is wrong.

ADD REPLYlink written 23 months ago by elisheva70

O.K. thanks. How can I ensure That it wont repet this error?

It depends on how you generate the url. Without knowing it, it is quite hard to help.

ADD REPLYlink written 23 months ago by finswimmer11k

Sorry for all this mess, but now I see that the error was this:

Traceback (most recent call last):
  File "introns.py", line 44, in <module>
    main(line)
  File "introns.py", line 15, in main
    r.raise_for_status()
  File "/usr/lib/python2.7/dist-packages/requests/models.py", line 844, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: http://rest.ensembl.org/sequence/id/ENST00000529724.1%0A?mask_feature=1
ADD REPLYlink written 23 months ago by elisheva70

Gateway Time-Out normaly means something take to long. Just retry it. I cannont reproduce it here this time, even if your linke still contains characters that should not be there (%0A).

Again the question: How do yoi create these urls?

fin swimmer

ADD REPLYlink written 23 months ago by finswimmer11k

I copied the transcripts to a new file and now it works.
I have no idea why it didn't work before....
As for your question I created these urls using some commands - (as you wrote above) on cDNAs fasta file to extract the transcripts ids.
It worked fine till this specific transcript. I don't know why..........

ADD REPLYlink written 23 months ago by elisheva70

Intron needs to be defined according to the gene boundary the cdna falls in, so you would need the gene boundary along with the cdna coordinates

ADD REPLYlink written 23 months ago by Rohit1.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 546 users visited in the last hour