Quickly retrieve reference genome sequence within python

1

Entering edit mode

5.6 years ago

gewa ▴ 20

Hi all, For a project I'm working on, I need to be able to quickly retrieve the sequence at a given 2kb window of the reference genome hg38 within python. The windows I need might not be consecutive (i.e. one thread might need to get the sequence for chr1:2500-4500 and another might need chrX:0-2000). Currently, the best way I've found is by making system tools from within python to samtools faidx, but this is slowing down my program quite a bit. I'd like to avoid a ton of file i/o as this slows things down, but I don't know an efficient way to have everything I need in memory, either. Any suggestions for an efficent way to do this from within python would be really appreciated. Thanks!

sequence python • 2.4k views

ADD COMMENT • link updated 5 months ago by Feng Tian ▴ 20 • written 5.6 years ago by gewa ▴ 20

1

Entering edit mode

Pyfaidx?

ADD REPLY • link 5.6 years ago by WouterDeCoster 47k

2

Entering edit mode

I would recommend pysam instead. This project is more active.

You just have to create the fasta index file (.fai) by your own using e.g. samtools faidx to have random access.

ADD REPLY • link 5.6 years ago by finswimmer 16k

0

Entering edit mode

Agree, use pysam.FastaFile and fetch.

ADD REPLY • link 5 months ago by Feng Tian ▴ 20

0

Entering edit mode

When I've tried using pyfaidx in the past, it's used nearly all my memory to load in the fasta. is this unexpected?