Question

Hisat2 splice sites extract blank files

0

Entering edit mode

6.1 years ago

lucas.caue.jacintho ▴ 30

Hi everybody,

I was trying to extract splices sites of my -gtf file but I had no success with neither "hisat2_splice_sites.py" nor "awk" command. For both options I just had a blank file as output. My final goal is use the splice sites forhisat2-build --ss --exon. For some reason, I could do that with human gtf files but HIV. I tried to download several files of many data site such as NCBI, UCSC genome Browser, Esembl and ENA. Someone could help with another data site for retrieve gff/gtf files, or even how to make a new one by myself? OBs: I downloaded gff files and then converted with gffread command

Thank you guys in advance!

assembly alignment • 3.5k views

ADD COMMENT • link updated 2 hours ago by Ales ▴ 50 • written 6.1 years ago by lucas.caue.jacintho ▴ 30

0

Entering edit mode

Could you provide one example of a gtf that doesn't work? And the commands you used?

ADD REPLY • link 6.1 years ago by h.mon 35k

0

Entering edit mode

Hi! Sure.! I used:

gffread -E hiv.gff -T  -o hiv.gtf #Conversion.                  

hisat2_extract_splice_sites.py hiv.gtf > splice_sites.txt     #Extract

These same commands worked perfectly with other human gff/gtf files and. I got that "splice_sitex.txt" as a blank file ( Zero bytes)

Awk Commands:

awk '{if ($3=="exon") {print $1"\t"$4-1"\t"$5-1}}' hiv.gtf > exonsFile.txt.        
awk '{if ($3=="intron") {print $1"\t"$4-2"\t"$5}}' hiv.gtf > ssFile.txt

About gtf, I tested tons of them. Sending some of them below:

https://www.ncbi.nlm.nih.gov/nuccore/KY112585.1

https://www.ncbi.nlm.nih.gov/genome/genomes/10319? (this one contains I list of possibles assemblies)

Any insight about this issue is welcome!

ADD REPLY • link updated 6.1 years ago by h.mon 35k • written 6.1 years ago by lucas.caue.jacintho ▴ 30

0

Entering edit mode

hisat2_extract...py scripts make several assumptions about the input GTF files. For example, they require gene_id field to be present for all exon records. However, GTF format is not very strict and "gene_id" is not a required field for the exon/CDS records. Unless the attribute is already present in the input, methods like gffread will not add "gene_id" to exons and CDS records. And for the HIV files such as those you retrieve from genbank, you will have to either manually add gene_id attribute for each exon/CDS to match the parent transcript or write a simple script to automate the task.

Also, the hisat2 scripts require presence of actual "exon" records in the file, but since HIV annotations only include CDS records, exons, for some reason, are omitted and have to be re-introduced. Thankfully, gffread does help remedy this and several other standardization issues. To make sure there are no issues with downstream analysis, I like to run the following command on pretty much all of my GTF/GFF files (not just HIV):

gffread --force-exons -T -F -o <output.gtf> <input.gff/gtf>

In short: run gffread to standardize, and then make sure each exon record has gene_id attribute. Then hisat2 extraction scripts should work.

On a side note, I already replied in a thread below, but HIV genome annotations are very limited and those you find on LANL genbank only contain CDS records for the 9 ORFs with a single intron in the tat/rev region. You can use our new resource: https://ccb.jhu.edu/HIV_Atlas/ to obtain comprehensive reference-grade annotations of all major splice sites and alternatively spliced transcripts. You will still need to run gffread on them (for now), but you will have a complete list of all splicing for thousands of HIV/SIV isolates out there allowing you to analyze based on a less mutationally distant genome assembly.

Hope this helps

ADD REPLY • link 2 hours ago by Ales ▴ 50

score 0 · Answer 1 · 2019-08-16

0

Entering edit mode

6.1 years ago

h.mon 35k

The annotation you are using is from a virus, which has an extremely packed genome, and contains no introns. Thus, the ssFile.txt will be empty.

ADD COMMENT • link 6.1 years ago by h.mon 35k

0

Entering edit mode

That's fair. So what is the solution? use a txt file made by myself with --known-splicesite-infile <path> ? Because, I think I am not supposed to have novel splice junctions or new junctions annotations if I do not provide any file like those ones.

ADD REPLY • link 6.1 years ago by lucas.caue.jacintho ▴ 30

0

Entering edit mode

What, are you trying to do though? If you believe there are novel splice sites that the annotation file has misssed then you dont find those novel splice sites by passing in the annotation file..

If you dont care about novel splice sites and want to boost the accuracy using the annotation file, then if the annotation file says there are no introns then therefore there will be an empty file thats not something you solve, that is not an error it means the virus as we know it has no introns and therefore cannot splice them out...

ADD REPLY • link 16 months ago by a1ultima ▴ 890

0

Entering edit mode

Interestingly, HIV/SIV specifically does have quite an elaborate splicing system which regulates translation. However, only one of the splice junctions (D4-A7) is contained within an ORF (tat/rev) and the rest of them are in the UTRs. And all available transcriptome annotations for HIV/SIV (such as Genbank or LANL) only contain the ORFs and not all tens (or even hundreds by some accounts) of alternative isoforms documented for the virus but available only as figures and tables in manuscripts.

To address this specific issue we have recently completed reference-grade annotation for several thousand HIV-1 genomes including the HXB2 (K03455.1), 89.6 and NL4-3 reference genomes: https://ccb.jhu.edu/HIV_Atlas/. Each annotation features full set of US, PS and FS messages along with protein assignment and major donor and acceptor sites. The annotations are provided in GTF and GFF formats, you can browse them directly on the web interface via integrated JBrowse2 and use them with any transcriptomic utilities you would use for human or other genomes (assembly, quantification, gene/tx expression, etc).

This project started from my personal need to improve spliced alignment with HISAT2/STAR and minimap around splice sites and to be able to compute transcript and junction expression effectively. Eventually, this took me down the rabbit hole of creating several methods for annotation transfer in HIV and annotation of thousands of LANL complete genome assemblies. You can also read more about the resource in our preprint: https://www.biorxiv.org/content/10.1101/2025.09.24.675449v1 Hope this helps!

ADD REPLY • link 2 hours ago by Ales ▴ 50