Question

Retrieving Genomic Sequence Based On Coordinate Information From Annotation File

1

Entering edit mode

10.3 years ago

es55spam ▴ 10

I have a large FASTA file (a genetic sequence, an entire chromosome), where each line contains 50 characters (bases a,g,t, and c). There are about 4 million lines in this file. It is downloaded from UCSC downloads page.

I want to reorganize the file so that each character of a line in the FASTA file is placed in its own line of a new file. That is, turn each 50-character line in the original file into 50, single-character lines. This will result in the entire sequence rewritten as a single column. Ultimately, I want the sequence as a single column so I can then place an adjacent column containing the genomic coordinate position for each base. The purpose is to use the new file to retrieve sequences of genetic elements, the coordinates of which I find in other files that contain genomic annotation information (where specific genes are in terms of coordinates). These two kinds of information, sequence and coordinates, do not seem to be combined in a single file from the UCSC downloads page. Any advice?

perl database sequence • 5.3k views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 10.3 years ago by es55spam ▴ 10

score 2 · Answer 1 · 2013-12-31

2

Entering edit mode

10.3 years ago

Pierre Lindenbaum 161k

grep  -v '^>' your.fasta  | tr -d '\n\r' | sed 's/\(.\)/\1#/g' | tr "#" "\n"

but you're reinventing the wheel.

Have a look a the simple samtools faidx algorithm

ADD COMMENT • link 10.3 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

Agree.

If you want to get the genome sequence for given coordinates, please try samtools faidx chromosome.fa chr1:1000-11000 as Pierre suggests.

As you mentioned gene annotation file, if you want to get the transcripts file you can try gffread from cufflinks like gffread -w output.fa -g gename_assembl.fa refgene.gtf

ADD REPLY • link 10.3 years ago by chentong.biology ▴ 50

0

Entering edit mode

Thanks for the advice. I am new to this sort of genomic analysis, and I am nearly positive I am reinventing wheels frequently. Part of the learning curve. I will check out samtools. Can I ask a question? In the FASTA file for the chromsome I downloaded from UCSC, there are many more characters than what appears in the browser's sliding window. The FASTA file contains large stretches of N's, especially at the beginning but interspersed throughout. Do I "cut" those out of the FASTA file to obtain the genomic coordinates of bases as they appear in the sliding window?

ADD REPLY • link 10.3 years ago by es55spam ▴ 10

1

Entering edit mode

Don't trim off any of the N's, that will make any mapping results completely meaningless to anyone else anywhere in the world. If you observe a length difference between the file you downloaded and the UCSC genome browser then there's likely a genome version difference between what you have and what UCSC is displaying. When in doubt, you might post that discrepancy here for input.

Edit: Also, post exactly how you determined the lengths in case there's an error there.

ADD REPLY • link 10.3 years ago by Devon Ryan 104k

0

Entering edit mode

@dpryan79, the length of the FASTA file I observed by opening it in MacVIM, which reports the number of lines and number of characters. The number of characters for the chr1 FASTA file I have is 201,139,347. The length of chr 1 according to the browser is 197,195,432 bases. (This is chr 1 for mouse, mm9 version, not the most recent version but the version I require).

ADD REPLY • link 10.3 years ago by es55spam ▴ 10

0

Entering edit mode

The discrepancy appears to be in the invisible end of line characters in the FASTA file, which make it longer than the actual length of the sequence per se. The N's must stay in the sequence, as you say.

ADD REPLY • link 10.3 years ago by es55spam ▴ 10

0

Entering edit mode

Yeah, forgetting to account for line endings will cause this sort of confusion.

ADD REPLY • link 10.3 years ago by Devon Ryan 104k

0

Entering edit mode

I recommend keep the FASTA file unchanged. I do not understand what do you mean by many more characters than what appears in the browser's sliding window. Do you means the length of your FASTA sequence of one chromosome larger than the max coordinates of that chromosome in UCSC genome browser?

ADD REPLY • link 10.3 years ago by chentong.biology ▴ 50

0

Entering edit mode

yes, the FASTA file is longer. That is at least in part due to the presence of N's in the sequence. If I cut those, will I obtain the exact sequence found in the browser window? In the end, what I want is a file with the sequence with each base matched to its coordinate position.

ADD REPLY • link 10.3 years ago by es55spam ▴ 10

score 0 · Answer 2 · 2013-12-31

Here is a python script assuming your coordinate file is a tab delimited file where the columns are: sequence id, feature name, start coordinate, end coordinate. You'll need to modify it depending on whether your coordinates are 0 or 1 based.

import sys
def readFasta(fileName):
    _id = ''
    _seq = ''
    _file = open(fileName,'r')
    for line in _file:
        if line[0] == ">":
            if _id:
                yield (_id,_seq)

            _id = line.strip()[1:]
            _seq = ''
        else:
            _seq += line.strip()
    yield (_id,_seq)

coordFile = open(sys.argv[1],'r')
coords = {}
for line in coordFile:
    data = line.strip().split()
    data[2] = int(data[2])
    data[3] = int(data[3])
    if not coords.has_key(data[0]):
        coords[data[0]] = []
    coords[data[0]].append(data[1:])

stream = readFasta(sys.argv[2])
for record in stream:
    header = record[0]
    seq = record[1]
    if coords.has_key(header):
        for feature in coords[header]:
            print '>' + feature[0]
            print seq[feature[1]:feature[2] + 1]