I want to fetch some sequences from reference genome by coordinates. I've downloaded Homo_sapiens.GRCh38.dna.primary_assembly.fa
from the Ensembl's FTP server.
I've tried both pysam and Biopython to get the sequence from the reference genome like so:
import pysam
ref_genome = pysam.FastaFile("data/Homo_sapiens.GRCh38.dna.primary_assembly.fa")
ref_genome.fetch("1", 37566812, 37595985)
from Bio import SeqIO
chrs = {}
for seq in SeqIO.parse("data/Homo_sapiens.GRCh38.dna.primary_assembly.fa", "fasta"):
chrs[seq.id] = seq.seq
str(chrs['1'][37566812:37595985])
Both fetch the same sequence. However, when I look up the sequence at NCBI https://www.ncbi.nlm.nih.gov/nuccore/NC_000001.11?report=fasta&from=37566812&to=37595985&strand=true I get different data, while I assumed they're supposed to match. Am I doing something wrong?
Your link above (with
strand=true
) brings back sequence that is this (abbreviated):If I take a chunk of that sequence and blat it at UCSC then the top hit is (truncated)
The alignment is on the
-
strand (truncated)If you remove
strand=true
in the URL then you get this sequence (truncated):Which when blatted at UCSC produces a hit that starts at right base pair (truncated query)
And results in the alignment on
+
strand:My guess would be that the python code is zero based coordinate system, the NCBI server is 1 based, thus you are off by one.
The one I get when using python starts as:
And the one at NCBI is:
Doesn't seem to be off just by one, unless I'm missing something.