Hello,
I´m trying to index a reference genome available in both .gff and .fasta formats in NCBI but the hisat2-build is just for one file format. How could I join together these 2 formats to create a single complete reference index?
That does not really make sense. The fasta file is intended to provide the actual DNA sequence. The annotation file lists positions of genomic elements such as exons, transcripts, coding sequences etc. One typically uses the GTF to extract splice sites, e.g. using the hisat2_extract_splice_sites.py. What is the aim of your analysis?
If you want to incorporate the annotation into the index, you have to use the --ss and --exon options of hisat2-build.
--ss <path> Note this option should be used with the following --exon option.
Provide a list of splice sites (in the HISAT2's own format) as follows
(four columns).
chromosome name <tab> zero-offset based genomic position of the flanking base on the left side of an intron <tab> zero-offset based genomic position of the flanking base on the right <tab> strand
Use hisat2_extract_splice_sites.py (in the HISAT2 package) to extract
splice sites from a GTF file.
--exon <path> Note this option should be used with the above --ss option. Provide a
list of exons (in the HISAT2's own format) as follows (three columns).
chromosome name <tab> zero-offset based left genomic position of an exon <tab> zero-offset based right genomic position of an exon
Use hisat2_extract_exons.py (in the HISAT2 package) to extract exons
from a GTF file.
You may need to convert the GFF to GTF to use these scripts, though.
That does not really make sense. The fasta file is intended to provide the actual DNA sequence. The annotation file lists positions of genomic elements such as exons, transcripts, coding sequences etc. One typically uses the GTF to extract splice sites, e.g. using the
hisat2_extract_splice_sites.py
. What is the aim of your analysis?