Hello,
I have a .gtf.gz file which I am going to use in a python code. for using the pysam module in python it requires an indexed file for gtf.gz?
How can I index that file? Thank you in advance.
Hello,
I have a .gtf.gz file which I am going to use in a python code. for using the pysam module in python it requires an indexed file for gtf.gz?
How can I index that file? Thank you in advance.
I have solved the question!
Firstly you should sort you GTF file (when downloaded, usually it is unsorted.) you may can sort it like:
(grep ^"#" in.gtf; grep -v ^"#" in.gtf | sort -k1,1 -k4,4n) | bgzip > in.sorted.gtf.gz
Then you can create your .tbi file of .gtf.gz with commds like this:
tabix -p gff human.gtf.gz
.
You can try it.
In addition: tabix can be used for indexing and query any tab separated data, that have a column with a name, one with a number and is sorted by this two columns. Use the option -s
, -b
and optional -e
during indexing to define in which columns the name, the beginning and optional the en position is stored.
This fails on RefSeq GTFs from NCBI.
You may also need to set the delimiter and sort on end position. I've had better luck with the -V sort (natural version sort) algo on chromosome names, but I think that is a matter of personal preference, whether you want alt contigs interleaved with primary chromosomes, or at the end
(grep ^"#" in.gtf; grep -v ^"#" in.gtf | sort -t $'\t' -k1,1V -k4,4n -k5,5n) | bgzip > in.sorted.gtf.gz
Usage: tabix <in.tab.bgz> [region1 [region2 [...]]]
Options: -p STR preset: gff, bed, sam, vcf, psltbl [gff]
-s INT sequence name column [1]
-b INT start column [4]
-e INT end column; can be identical to '-b' [5]
-S INT skip first INT lines [0]
-c CHAR symbol for comment/meta lines [#]
-r FILE replace the header with the content of FILE [null]
-B region1 is a BED file (entire file will be read)
-0 zero-based coordinate
-h print also the header lines
-H print only the header lines
-l list chromosome names
-f force to overwrite the index
Yes, that was my first guess. But from this it does not take input of gtf
Interesting !
Because their command deos not show it works with gtf:
tabix [-0lf] [-p gff|bed|sam|vcf] [-s seqCol] [-b begCol] [-e endCol] [-S lineSkip] [-c metaChar] in.tab.bgz [region1 [region2 [...]]
I tried using it anyway:
$ tabix -p gtf human.gtf.gz
[main] unrecognized preset 'gtf'
does not work for me
https://meta.stackexchange.com/questions/147616/what-do-you-mean-it-doesnt-work
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Another option:
Yes,you are right ! and you have given a better introduction to this Data Format!
Please use ADD REPLY instead of the answer field.