how to make a .tbi file of .gtf.gz?
2
5
Entering edit mode
6.3 years ago
KVC_bioinfo ▴ 590

Hello,

I have a .gtf.gz file which I am going to use in a python code. for using the pysam module in python it requires an indexed file for gtf.gz?

How can I index that file? Thank you in advance.

tbi index • 12k views
ADD COMMENT
0
Entering edit mode

Another option:

$ gunzip -c foo.gtf.gz | gtf2bed - | bgzip -c > foo.bed.gz
$ tabix -p bed foo.bed.gz
ADD REPLY
0
Entering edit mode

Yes,you are right ! and you have given a better introduction to this Data Format!

ADD REPLY
0
Entering edit mode

Please use ADD REPLY instead of the answer field.

ADD REPLY
3
Entering edit mode
5.2 years ago
Caizhaoqing ▴ 30

I have solved the question!

Firstly you should sort you GTF file (when downloaded, usually it is unsorted.) you may can sort it like:

 (grep ^"#" in.gtf; grep -v ^"#" in.gtf | sort -k1,1 -k4,4n) | bgzip  > in.sorted.gtf.gz

Then you can create your .tbi file of .gtf.gz with commds like this: tabix -p gff human.gtf.gz.

You can try it.

ADD COMMENT
1
Entering edit mode

you are the best

ADD REPLY
0
Entering edit mode

In addition: tabix can be used for indexing and query any tab separated data, that have a column with a name, one with a number and is sorted by this two columns. Use the option -s, -b and optional -e during indexing to define in which columns the name, the beginning and optional the en position is stored.

ADD REPLY
0
Entering edit mode

This fails on RefSeq GTFs from NCBI.

You may also need to set the delimiter and sort on end position. I've had better luck with the -V sort (natural version sort) algo on chromosome names, but I think that is a matter of personal preference, whether you want alt contigs interleaved with primary chromosomes, or at the end

(grep ^"#" in.gtf; grep -v ^"#" in.gtf | sort  -t $'\t' -k1,1V -k4,4n -k5,5n) | bgzip  > in.sorted.gtf.gz
ADD REPLY
1
Entering edit mode
6.3 years ago
GenoMax 141k

Using tabix from samtools.

ADD COMMENT
0
Entering edit mode
Usage:   tabix <in.tab.bgz> [region1 [region2 [...]]]

Options: -p STR     preset: gff, bed, sam, vcf, psltbl [gff]
         -s INT     sequence name column [1]
         -b INT     start column [4]
         -e INT     end column; can be identical to '-b' [5]
         -S INT     skip first INT lines [0]
         -c CHAR    symbol for comment/meta lines [#]
         -r FILE    replace the header with the content of FILE [null]
         -B         region1 is a BED file (entire file will be read)
         -0         zero-based coordinate
         -h         print also the header lines
         -H         print only the header lines
         -l         list chromosome names
         -f         force to overwrite the index

Yes, that was my first guess. But from this it does not take input of gtf

ADD REPLY
3
Entering edit mode

From this page:

Firstly, tabix directly works with a lot of widely used TAB-delimited formats such as GFF/GTF and BED.

ADD REPLY
2
Entering edit mode

in addition, gtf =~ gff2

ADD REPLY
0
Entering edit mode

Interesting !

Because their command deos not show it works with gtf:

tabix [-0lf] [-p gff|bed|sam|vcf] [-s seqCol] [-b begCol] [-e endCol] [-S lineSkip] [-c metaChar] in.tab.bgz [region1 [region2 [...]]

I tried using it anyway:

$ tabix -p gtf  human.gtf.gz
[main] unrecognized preset 'gtf'
ADD REPLY
2
Entering edit mode
tabix -p gff human.gtf.gz

formats such as GFF/GTF

probably means the same preset can be used for both GFF and GTF files.

ADD REPLY
1
Entering edit mode

does not work for me

ADD REPLY

Login before adding your answer.

Traffic: 2567 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6