Question: tabix indexing on a non vcf/bed/sam txt file
18 days ago
I'm struggling to get the tabix index on a simple 3 columns bgzipped txt file:

#chr    pos     score
1       1       0.061011
1       2       0.061011
1       3       0.061011

Oddly, the indexing step is really fast (like 2 seconds), considering the file size (9Gb) and when a query a position I get no result without any warning. Has anyone faced a similar issue ?

tabix -s1 -b2 file.txt.bgz

tabix file.txt.bgz 1:2-3 -> empty result

This works for me. Some troubleshooting questions:

  • Are there any messages during the index creation?
  • Is the file tab delimited?
  • Is the file sorted by the first and second column?
  • Is the file compressed by bgzip?
  • Have you tried your little example as well, or just your large data file?

Indeed, it worked for my little example. I'm now running a large sort on the file (sort -V -k1,1 -k2,2) to see if this was the problem. Although I wasn't expecting that as I zcatted all chromosome files in the proper order, and in theory I donwloaded them already sorted.

Thanks for the suggestions, i'll let you know how it went.

@finswimmer, it didn't work, unfortunately.

These are my full commands, if you see any possible source of error let me know. This "wrong" index takes 2 seconds to be created. Never happened before.

header_file=$(head -n1 $files) 
zcat $header_file | head -1 | cut -f1,2,3 | bgzip > fitcons_v1.01_header.txt.bgz
srun cat $files | xargs zcat | grep -v "^#" | sort -V -k1,1 -k2,2 | awk -v OFS='\t' '{print $1,$2,$3}' |  bgzip > fitcons_v1.01.txt.gz
srun cat fitcons_v1.01_header.txt.bgz fitcons_v1.01.txt.gz > fitcons_v1.01.txt.bgz
tabix -s 1 -b 2 fitcons_v1.01.txt.bgz
