Question: Fasta Format Of Chromosomes And Biopython
0
gravatar for Ma
7.2 years ago by
Ma140
Ma140 wrote:

Hi: I am a newbie in this thing of bioinformatics and what I need is to extract a portion of a sequence in FASTA format, for that I have downloaded from ftp.ncbi.nih.gov/genomes/ the fasta file of chromosome 22, which is hsaltHuRef_chr22.fa.gz; and from that file I need to extract from nucleotide 23522552 to nucleotide 23660200 aproximately; how I can do that using BioPython? Also what does these headers refer to inside the fasta file?

gi|157812454|ref|NW001838719.2| Homo sapiens chromosome 22 genomic contig, alternate assembly HuRef DEGEN1103279049253, whole genome shotgun sequence

gi|157697908|ref|NW001838720.1| Homo sapiens chromosome 22 genomic contig, alternate assembly HuRef DEGEN1103279105977, whole genome shotgun sequence

Specially to those numbers, they are making reference to the nucleotide position that I am at that moment?

Thanks for your help

biopython chromosome • 3.5k views
ADD COMMENTlink written 7.2 years ago by Ma140

StackExchange works better with one question at a time - you only get to "accept" one person's answer.

ADD REPLYlink written 7.2 years ago by Peter5.7k
4
gravatar for Damian Kao
7.2 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

You have the human assembly from craig venter institute. It's an alternate assembly from the NCBI assembly. You might as well just stick with the NCBI version unless you specifically want to use this version.

To get a region from the fasta file with BioPython:

from Bio import SeqIO

inFile = open('path to your fasta file','r')

for record in SeqIO.parse(inFile,'fasta'):
    if record.id == "sequence id you want to extract from":
        print str(record.seq)[startCoordinates:endCoordinates + 1]

So let's say I want to extract nucleotide 100 to 200 from one of those sequences you listed: gi|157697908|ref|NW_001838720.1| Homo sapiens chromosome 22 genomic contig, alternate assembly HuRef DEGEN_1103279105977, whole genome shotgun sequence

I would:

from Bio import SeqIO

inFile = open('hs_alt_HuRef_chr22.fa','r')

for record in SeqIO.parse(inFile,'fasta'):
    if record.id == "gi|157697908|ref|NW_001838720.1|":
        print str(record.seq)[100:201]
ADD COMMENTlink written 7.2 years ago by Damian Kao15k
1

You might find the Bio.SeqIO.index(...) or index_db(...) functions more concise than that loop approach.

ADD REPLYlink written 7.2 years ago by Peter5.7k

thank for your help @DK, by the way one question does record.seq takes into consideration the lines starting with >? I want to know this because if I got one record.id that is between the sequences that I want to retrieve, then it will count the characters include in the comment also

ADD REPLYlink written 7.2 years ago by Ma140

record seq will only look at the sequence. It will not include the header line.

ADD REPLYlink written 7.2 years ago by Damian Kao15k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1532 users visited in the last hour