Question: How to Run the Script for Extracting Splice Sites from the Gene Annotation File
0
gravatar for LP
18 months ago by
LP10
South Africa
LP10 wrote:

Hi Biostars Community,

I struggle with running the script extract_splice_sites.py Sbicolor_313_v3.1.gene.gtf > Sbicolor_313_v3.1.splicesites.txt) for extracting the splice sites from my gene annotation file. I was initially advidsed to convert my gene annotation file (Sbicolor_313_v3.1.gff3) to gtf format using gffread script, but still the script did not run. I really have no idea where the problem lies in the script. Please help!

Another problem I have is how to view the output file of HISAT. When I ran HISAT (hisat2 -p 8 -x Sbicolor_313_v3.0.index -1 RSL200mMMn0_R1_paired.fastq -2 RSL200mMmM0_R2_paired.fastq -S RSL200mMMn0.sam | samtools -bS -> RSL200mMMn0.bam) directly after creating the index files from my sorghum (Sbicolor_313_v3.0) genome, I was unable to view or locate my HISAT output file. How does one view the HISAT output file? Is there a command/script for viewing HISAT output files to see how many reads mapped?

I am still new in this field - thanks. Lekgolwa

ADD COMMENTlink modified 18 months ago • written 18 months ago by LP10

Where did you get extract_splice_sites.py? Do you get an error when you run it? What does your command look like?

Is -> how you're trying to create the output file in your HISAT command? If you have a bam file you can view it with samtools view or IGV.

ADD REPLYlink written 18 months ago by st.ph.n2.4k
1
gravatar for Renesh
18 months ago by
Renesh1.6k
United States
Renesh1.6k wrote:

For your first question, the extract_splice_sites.py looks for the term gene_id or transcript_id for exons in attribute section (last description column) of off or gtf file. You need to check the file if these terms are present in attribute section. Generally, gff3 file from Phytozome database has ID and Parent term instead of gene_id and transcript_id. Therefore, if you run the code, you will not get any output.

Developers of HISAT need to fix this issue and should include ID and Parent terms for extracting splice sites.

Alternative,

You can download gtf file for S. biocolor from EnsemblPlants database ftp://ftp.ensemblgenomes.org/pub/release-37/plants/gtf/sorghum_bicolor/

ADD COMMENTlink modified 18 months ago • written 18 months ago by Renesh1.6k

This is technically an issue with a malformed GTF due to the GFF-->GTF conversion, rather than with the script per se - the GTF format is defined here, and should always include gene_id and transcript_id (it should also end each line with a ';' character).

It'll be simplest to just replace the ID and Parent definitions with sed: sed 's/ID=/transcript_id /;s/Parent=/gene_id /' yourfile > yournewfile

If the lines don't end with ';' then also: sed 's/$/;/' somefile > properlyformatted

ADD REPLYlink written 18 months ago by george.ry1.1k

Thanks for your input @ george.ry. So how does sed work? Does it mean direct subtitution of "_ID" with "_id". I was intending to just manually replace "_ID" with "_id" hoping the implemented changes would enable the script to work. I haven't done that yet. So I am really not sure what to do right now, because even the gene annotation file with the gtf format downloaded from Ensembl plants is old version. The

ADD REPLYlink written 18 months ago by LP10

sed is a stream editor. It is replacing pattern delimited in first set of / with the one in second (e.g. ID= is being replaced by transcript_id in example above). This should probably be done as sed 's/ID=/transcript_id /g;s/Parent=/gene_id /g' yourfile > yournewfile (note the g at the end of those patterns) to replace all instances.

ADD REPLYlink modified 18 months ago • written 18 months ago by genomax65k

Thanks genomax for your explanation of what "sed" is. So let me try to use it and see what happens.

ADD REPLYlink written 18 months ago by LP10
1
gravatar for LP
18 months ago by
LP10
South Africa
LP10 wrote:

Thanks guys for your answers - very much appreciated.

st.ph.n - to answer your question as to where I got the extract_splice_sites.py, I would say the script comes with the HISAT package. I think it basically helps to improve your mapping. One has to perfom this step although in my opinion performing this step is not a prerequisite, but I might be wrong. I wish I could go further to explain how it works, but I do not want to say this that I am not sure about - I am still new in this field of bioinformatics. I am just using bioinformatics tools to understand molecular biology at another level or from a different perspective. Thanks st.ph.n again for answering my other question about how to view the ouput file of hisat. Viewing the ouputfile is not just the only problem I have. I still struggle with production of the output file itself after running the script. In other words I still need to be able to instruct the script to produce the output file from which I can then be able to use IGV or samtools to view the mapped reads. So I really don't know how to deal with this (first) problem. Why is my script not producing the output file (please see my very first post).

To Renesh, thanks so much for answering the question that I never hoped I could overcome one day. So I will look at my gene annotation file (.gff3) to see if things are in order. If not, I will then have to change the "ID" to "id" and so on, and hopefully the script will run. If however the script will not run, I then drop the use of my .gff3 file downloaded from phytozome in favour of the .gtf file from EnsemblPlants database. So I have plenty of options and it's all thanks to you.

Just one more thing; what do I have to do in order to receive posts directly from my registerd e-mail so that I can make follow-ups almost immediately. I was lucky to come across your reply by just clicking on "messages". Thanks once again.

ADD COMMENTlink written 18 months ago by LP10

You can follow the posts via emails to keep up to date.

ADD REPLYlink written 18 months ago by Renesh1.6k

Lekgolwa : Any top level posts that you create should be automatically set to "Follow via email" so you are notified when comments/answers are posted in that thread. If this post is not showing up as that then you can try selecting that option in the drop-down at the bottom of your original post.

ADD REPLYlink modified 18 months ago • written 18 months ago by genomax65k

Thanks Renesh and genomax - very much appreciated!

ADD REPLYlink written 18 months ago by LP10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1415 users visited in the last hour