Can't find chr1 in a large fasta file using samtools/htsjdk
1
0
Entering edit mode
2.8 years ago
ljw1001 • 0

I'm working with a fasta file that isn't directly organized by chromosome. In the fasta, the chromosome info is held in the range= section:

>hg19_refGene_NM_032291 range=chr1:66999725-67210868 5'pad=0 3'pad=0 strand=+ repeatMasking=none
ggagatgcaggctggct etc.

The index file looks like this:

hg19_refGene_NM_032291 4894 97 60 61

so that when I search for a chromosome:position, i get an error message saying 'chr1' cannot be found. I've had this problem with a tool built on the c version of samtools, and in htsjdk. I know java so I tried working through the htsjdk code, and it is trying to find 'chr1' in a dictionary where the keys all look like the index entries, ie hg19_refGene_NM_032291.

any help is greatly appreciated. thanks.

Assembly fasta reference samtools • 744 views
ADD COMMENT
0
Entering edit mode

The index file looks like this: hg19_refGene_NM_032291 4894 97 60 61

you can clearly see that the entries don't contain the word 'chr1' , why would you expect to find such word ?

ADD REPLY
0
Entering edit mode

BTWn, most indexers (samtools faidx, stop reading the name after the first blank)

ADD REPLY
0
Entering edit mode

Thanks. I don't expect to find it directly (now that I understand how the file is indexed). I had hoped that someone would be familiar with this kind of index organization and explain how I could locate a coordinate in the reference,

  • Using some kind of intermediate map from chr/pos to index key.
  • By way of function that could access the data in the "range" attribute.
  • By reorganizing the file if nothing else works
  • or some other thing I hadn't thought of.
ADD REPLY
0
Entering edit mode
2.8 years ago
Eric Lim ★ 1.7k

Instead of working with reference derived fasta, download it from UCSC. Because sequences are based on chromosomes, you can use samtools to index and query with coordinates.

$ wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
$ pigz -d hg38.fa.gz
$ samtools faidx hg38.fa
$ samtools faidx hg38.fa chr1:1000-2000
ADD COMMENT

Login before adding your answer.

Traffic: 1763 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6