Hi everybody,
I was trying to extract splices sites of my -gtf file but I had no success with neither "hisat2_splice_sites.py" nor "awk" command. For both options I just had a blank file as output. My final goal is use the splice sites forhisat2-build --ss --exon. For some reason, I could do that with human gtf files but HIV. I tried to download several files of many data site such as NCBI, UCSC genome Browser, Esembl and ENA. Someone could help with another data site for retrieve gff/gtf files, or even how to make a new one by myself? OBs: I downloaded gff files and then converted with gffread command
Thank you guys in advance!
Could you provide one example of a gtf that doesn't work? And the commands you used?
Hi! Sure.! I used:
These same commands worked perfectly with other human gff/gtf files and. I got that "splice_sitex.txt" as a blank file ( Zero bytes)
Awk Commands:
About gtf, I tested tons of them. Sending some of them below:
https://www.ncbi.nlm.nih.gov/nuccore/KY112585.1
https://www.ncbi.nlm.nih.gov/genome/genomes/10319? (this one contains I list of possibles assemblies)
Any insight about this issue is welcome!
hisat2_extract...py scripts make several assumptions about the input GTF files. For example, they require gene_id field to be present for all exon records. However, GTF format is not very strict and "gene_id" is not a required field for the exon/CDS records. Unless the attribute is already present in the input, methods like gffread will not add "gene_id" to exons and CDS records. And for the HIV files such as those you retrieve from genbank, you will have to either manually add gene_id attribute for each exon/CDS to match the parent transcript or write a simple script to automate the task.
Also, the hisat2 scripts require presence of actual "exon" records in the file, but since HIV annotations only include CDS records, exons, for some reason, are omitted and have to be re-introduced. Thankfully, gffread does help remedy this and several other standardization issues. To make sure there are no issues with downstream analysis, I like to run the following command on pretty much all of my GTF/GFF files (not just HIV):
gffread --force-exons -T -F -o <output.gtf> <input.gff/gtf>
In short: run gffread to standardize, and then make sure each exon record has gene_id attribute. Then hisat2 extraction scripts should work.
On a side note, I already replied in a thread below, but HIV genome annotations are very limited and those you find on LANL genbank only contain CDS records for the 9 ORFs with a single intron in the tat/rev region. You can use our new resource: https://ccb.jhu.edu/HIV_Atlas/ to obtain comprehensive reference-grade annotations of all major splice sites and alternatively spliced transcripts. You will still need to run gffread on them (for now), but you will have a complete list of all splicing for thousands of HIV/SIV isolates out there allowing you to analyze based on a less mutationally distant genome assembly.
Hope this helps