Fasta Format Of Chromosomes And Biopython
1
0
Entering edit mode
12.4 years ago
Ma ▴ 140

Hi: I am a newbie in this thing of bioinformatics and what I need is to extract a portion of a sequence in FASTA format, for that I have downloaded from ftp.ncbi.nih.gov/genomes/ the fasta file of chromosome 22, which is hsaltHuRef_chr22.fa.gz; and from that file I need to extract from nucleotide 23522552 to nucleotide 23660200 aproximately; how I can do that using BioPython? Also what does these headers refer to inside the fasta file?

gi|157812454|ref|NW001838719.2| Homo sapiens chromosome 22 genomic contig, alternate assembly HuRef DEGEN1103279049253, whole genome shotgun sequence

gi|157697908|ref|NW001838720.1| Homo sapiens chromosome 22 genomic contig, alternate assembly HuRef DEGEN1103279105977, whole genome shotgun sequence

Specially to those numbers, they are making reference to the nucleotide position that I am at that moment?

Thanks for your help

chromosome biopython • 6.0k views
ADD COMMENT
0
Entering edit mode

StackExchange works better with one question at a time - you only get to "accept" one person's answer.

ADD REPLY
4
Entering edit mode
12.4 years ago

You have the human assembly from craig venter institute. It's an alternate assembly from the NCBI assembly. You might as well just stick with the NCBI version unless you specifically want to use this version.

To get a region from the fasta file with BioPython:

from Bio import SeqIO

inFile = open('path to your fasta file','r')

for record in SeqIO.parse(inFile,'fasta'):
    if record.id == "sequence id you want to extract from":
        print str(record.seq)[startCoordinates:endCoordinates + 1]

So let's say I want to extract nucleotide 100 to 200 from one of those sequences you listed: gi|157697908|ref|NW_001838720.1| Homo sapiens chromosome 22 genomic contig, alternate assembly HuRef DEGEN_1103279105977, whole genome shotgun sequence

I would:

from Bio import SeqIO

inFile = open('hs_alt_HuRef_chr22.fa','r')

for record in SeqIO.parse(inFile,'fasta'):
    if record.id == "gi|157697908|ref|NW_001838720.1|":
        print str(record.seq)[100:201]
ADD COMMENT
1
Entering edit mode

You might find the Bio.SeqIO.index(...) or index_db(...) functions more concise than that loop approach.

ADD REPLY
0
Entering edit mode

thank for your help @DK, by the way one question does record.seq takes into consideration the lines starting with >? I want to know this because if I got one record.id that is between the sequences that I want to retrieve, then it will count the characters include in the comment also

ADD REPLY
0
Entering edit mode

record seq will only look at the sequence. It will not include the header line.

ADD REPLY

Login before adding your answer.

Traffic: 2480 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6