Question

HISAT2 index creation

0

Entering edit mode

6.4 years ago

devbt15 ▴ 30

Dear all, I am trying to create index file using the splice site and exon information from the .gtf file. However, using the hisat2 python commands in Ubuntu, I get empty output files. I verified hisat2 command with Saccharomyces gtf file and it produces respective files properly. In conclusion, I concluded that my .gtf file may have some format differences with the Saccharomyces gtf file. So I compared both but it seems correct in format too.

Here is a preview of my .gff3 file:

chloro  .   exon    5158    6606    .   -   .   ID=Ljchlorog3v0000040.1.exon.1;Parent=Ljchlorog3v0000040.1;sequencetype=Protein coding
chloro  .   CDS 5158    6585    .   -   0   ID=Ljchlorog3v0000040.1.CDS.1;Parent=Ljchlorog3v0000040.1;sequencetype=Protein coding

I feel that the description in the last column may have something to do with the splice site information extraction. However, for splice site info, exon number description in the last column maybe needed. I wonder if instead of exon.1 in the last column, I need to state it as "exon number = 1", for the splice site information extraction using the hisat2 python command.

FYI: I extracted the exons file by simply selecting exon rows and subtracting 1 from the positions in column 4 & 5.

Please help. Regards.

RNA-Seq hisat2 splice site index • 3.0k views

ADD COMMENT • link 6.4 years ago by devbt15 ▴ 30

2

Entering edit mode

6.4 years ago

Istvan Albert 100k

While you seem to want to use the types interchangeably the unfortunate reality is that a GTF2 file is only remotely similar to a GFF3 file. The way the transcripts are encoded is different. Hence, unless the program was written to handle both you can't substitute one for the other.

A GTF2 file will have a tag called transcript_id that connects exons to transcripts. In a GFF3 file the parent_id is supposed to connect transcripts. It appears that the hisat2_extract_splice_sites.py program (I suspect is the one you are asking about though never explicitly mention) requires a GTF2 format and will not work with GFF3.

The solution is to convert your GFF3 to GFT2

http://mblab.wustl.edu/GTF22.html

https://github.com/infphilo/hisat2/blob/master/hisat2_extract_splice_sites.py

http://gmod.org/wiki/GFF3#GFF3_Format

ADD COMMENT • link 6.4 years ago by Istvan Albert 100k

0

Entering edit mode

Dear Istavan, Sure, the problem is the non-availability of .gtf files in our organism database and so I had to get it converted from .gff3 to .gtf. I have already tried this .gtf file with the usegalaxy.org server for RNA-Seq and it seems to work properly.

**Here is a preview of my .gtf file

chloro  .   CDS 14009   14228   .   +   0   transcript_id "Ljchlorog3v0000100.1"; gene_id "Ljchlorog3v0000100"; gene_name "AV774975.path1";

And, I have already tried this .gtf file as well with the python script and does not seem to work (sorry for not mentioning it in the original post). It still returns a blank output. The same command, however, works properly for the Saccharomyces .gtf file. Does it need to mention the exon number in the last column? So I am lost as to why the python script is not able to extract splice site and exon information from thisfile while it works nicely for the yeast one. In the end, for the exon file output, the exon positions are mentioned properly in this file as well.

Thank you and regards, Debatosh Das.

ADD REPLY • link updated 6.4 years ago by Istvan Albert 100k • written 6.4 years ago by devbt15 ▴ 30

0

Entering edit mode

The script here

https://github.com/infphilo/hisat2/blob/master/hisat2_extract_splice_sites.py

is a fairly simple line-oriented processing, you should be able to read through it and see what it does. It appears that it only requires the exon to be in the type then transcript_id and gene_id to be listed.

Add a few print() commands here and there and you can follow through what the tool does. As far a tool troubleshooting goes this is an easier case.

It is likely that the junctions variable is empty at line 88 and that's why you don't get any output. Now why that is empty, is anybody's guess. No one can really troubleshoot this without seeing your entire file or at least the first few columns that do list a full transcript.

ADD REPLY • link 6.4 years ago by Istvan Albert 100k

0

Entering edit mode

Dear Istvan, 1) Thank you for replying so promptly. I will surely try print commands to see if it yields something... 2) I did not get your suggestion of checking junctions variable at line 88? You mean splice junctions? 2) As for the file contents, I will try to paste first few columns of Lotus japonicus.gtf (converted from .gff3 downloaded from LOTUS BASE website) and Saccharomyces.gtf (downloaded from ENSEMBL website).

Preview of Lotus japonicus.gtf (column-wise contents, each row has been separated by space while pasting):

chloro  .   exon    519 1371    100 -   .   transcript_id "Ljchlorog3v0000020.1"; gene_id "Ljchlorog3v0000020"; gene_name "NODE_65561_length_799_cov_134.377975.path1";    
chloro  .   CDS 520 1365    .   -   0   transcript_id "Ljchlorog3v0000020.1"; gene_id "Ljchlorog3v0000020"; gene_name "NODE_65561_length_799_cov_134.377975.path1";    
chloro  .   exon    1323    1431    99  +   .   transcript_id "Ljchlorog3v0000030.1"; gene_id "Ljchlorog3v0000030"; gene_name "NODE_120091_length_55_cov_10.472727.path1";    
chloro  .   exon    5158    6606    .   -   .   transcript_id "Ljchlorog3v0000040.1"; gene_id "Ljchlorog3v0000040"; gene_name "gene.g67805.t1";    
chloro  .   CDS 5158    6585    .   -   0   transcript_id "Ljchlorog3v0000040.1"; gene_id "Ljchlorog3v0000040"; gene_name "gene.g67805.t1";
chloro  .   exon    6864    7201    100 -   .   transcript_id "Ljchlorog3v0000050.1"; gene_id "Ljchlorog3v0000050"; gene_name "AV412967.path1";    
chloro  .   CDS 6937    7083    .   -   0   transcript_id "Ljchlorog3v0000050.1"; gene_id "Ljchlorog3v0000050"; gene_name "AV412967.path1";    
chloro  .   exon    7413    9107    100 +   .   transcript_id "Ljchlorog3v0000060.1"; gene_id "Ljchlorog3v0000060"; gene_name "NODE_22089_length_1641_cov_77.516151.path1";

Preview of Saccharomyces cerevisiae .gtf (column-wise contents, each row has been separated by space while pasting):

IV  SGD gene    1802    2953    .   +   .   gene_id "YDL248W"; gene_name "COS7"; gene_source "SGD"; gene_biotype "protein_coding";
IV  SGD transcript  1802    2953    .   +   .   gene_id "YDL248W"; transcript_id "YDL248W"; gene_name "COS7"; gene_source "SGD"; gene_biotype "protein_coding"; transcript_name "COS7"; transcript_source "SGD"; transcript_biotype "protein_coding";
IV  SGD exon    1802    2953    .   +   .   gene_id "YDL248W"; transcript_id "YDL248W"; exon_number "1"; gene_name "COS7"; gene_source "SGD"; gene_biotype "protein_coding"; transcript_name "COS7"; transcript_source "SGD"; transcript_biotype "protein_coding"; exon_id "YDL248W.1";
IV  SGD CDS 1802    2950    .   +   0   gene_id "YDL248W"; transcript_id "YDL248W"; exon_number "1"; gene_name "COS7"; gene_source "SGD"; gene_biotype "protein_coding"; transcript_name "COS7"; transcript_source "SGD"; transcript_biotype "protein_coding"; protein_id "YDL248W"; protein_version "1";
IV  SGD start_codon 1802    1804    .   +   0   gene_id "YDL248W"; transcript_id "YDL248W"; exon_number "1"; gene_name "COS7"; gene_source "SGD"; gene_biotype "protein_coding"; transcript_name "COS7"; transcript_source "SGD"; transcript_biotype "protein_coding";
IV  SGD stop_codon  2951    2953    .   +   0   gene_id "YDL248W"; transcript_id "YDL248W"; exon_number "1"; gene_name "COS7"; gene_source "SGD"; gene_biotype "protein_coding"; transcript_name "COS7"; transcript_source "SGD"; transcript_biotype "protein_coding";
IV  SGD gene    3762    3836    .   +   .   gene_id "YDL247W-A"; gene_source "SGD"; gene_biotype "protein_coding";
IV  SGD transcript  3762    3836    .   +   .   gene_id "YDL247W-A"; transcript_id "YDL247W-A"; gene_source "SGD"; gene_biotype "protein_coding"; transcript_source "SGD"; transcript_biotype "protein_coding";

Both files seem to have the same contents and structured in a similar way. What I don't understand is using the python script which you have also mentioned in your reply above, I get a properly written output for the yeast .gtf but not for Lotus one?

Thank you. Regards, Debatosh Das.

ADD REPLY • link updated 6.4 years ago by Istvan Albert 100k • written 6.4 years ago by devbt15 ▴ 30

score 3 · Accepted Answer · 2017-11-22

3

Entering edit mode

6.4 years ago

Istvan Albert 100k

In your example, there are no splice junctions.

Every exon is labeled as a distinct transcript id. Hence the script cannot produce splice junctions.

I changed a few transcripts to have the same id, and now the script does produce splice sites on your data.

ADD COMMENT • link 6.4 years ago by Istvan Albert 100k