Getting Genomic Sequences And Phastcons Scores Using Python From Ensembl/Ucsc?
3
3
Entering edit mode
13.8 years ago
User 9996 ▴ 840

How can I fetch genomic sequence efficiently using Python? For example, from a .fa file or some other easily obtained format? I basically want an interface fetch_seq(chrom, strand, start, end) which will return the sequence [start, end] on the given chromosome on the specified strand.

Analogously, is there a programmatic python interface for getting phastCons scores?

thanks.

python biopython ucsc ensembl sequence • 8.5k views
ADD COMMENT
8
Entering edit mode
13.8 years ago

For extracting sequences from Fasta files use SeqIO in Biopython along with standard Python slicing of sequence regions. The random subsequence example in the documentation is pretty close to what you are looking for:

If you're dealing with alignments from UCSC or Ensembl, as you imply by mentioning phastCons, bx-python is very useful for dealing with the precomputed alignments in MAF and AXT formats:

For phastCons here is some code I wrote for a project that calls phastCons and retrieves scores for regions of interest. It's not all bundled up and ready to go, but could be a useful starting place.

ADD COMMENT
2
Entering edit mode
8.1 years ago
hdashnow ▴ 20

I found pysam worked well for this. It uses the htslib C-API (which is in very common usage in bioinformatics) so it's not ready the fasta file into memory, making it very efficient.

import pysam

fastafile = pysam.Fastafile("reference_genome.fa")
fastafile.fetch(chrom, start, end)
ADD COMMENT
0
Entering edit mode
13.2 years ago
Chronos ▴ 610

Unfortunately, Biopython's SeqIO was slowish in my scenario of randomly accessing chromosome positions in a human genome fasta file: on the order of up to several seconds per records[chrom].seq[start:end] call (on a 3GHz CPU with file already cached, where records is a SeqIO.index). These calls also visibly consume RAM (memory use grows until a proper segment is found, then returns to base level).

I found calling samtools faidx human_37.fasta 1:1500001-1500101 with subprocess module much faster - up to dozens of milliseconds per call, and unnoticeable memory use. The file has to be indexed by samtools first.

Important things to know about samtools:

  • version 0.1.12-10 (r896) had problems returning proper sequences of chrs 15 and beyond from a gzipped fasta file; uncompressing the file solved this problem. You can also recompress with razip, which is a component of samtools and works nicely.
  • you need to start += 1 before calling samtools
  • check actual samtools output before calling it (it returns fasta record header, for example, and wraps sequence lines; fasta headers allow calling samtools with many ranges at once - then one could use SeqIO to store and further access fasta produced by samtools)
ADD COMMENT
0
Entering edit mode

Samtools is definitely useful, but it is disappointing to see non-qualified statements such as "Biopython is dead-slow and uses lots of memory." This depends entirely on your usage. You haven't defined what records is, but if it's a SeqIO.index object then you are parsing the record from disk on every call. Try getting the record with 'records[chrom]' then slicing it from memory. If you want to access random sequences on disk with an index from Python, try the 2bit support in bx-python.

ADD REPLY
0
Entering edit mode

Yes, an index, and I figured it is seeking to the slice on every call (there's no disk I/O though - after first call data is cached and is only using CPU cycles). Thanks for suggesting bx-python, I'll have a look at that. I've also edited my answer a bit to not be so peremptory.

ADD REPLY

Login before adding your answer.

Traffic: 2539 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6