Question

How To Pick Out Refgene Parts Of Chromosome In Fasta Format

0

Entering edit mode

11.1 years ago

Click downvote ▴ 720

I have the refgene regions of the rattus norvegicus chromosomes, given by the indexes of the start and end positions.

I also have the chromosomes of interest as fasta files.

Example:

> chr11
aaactaatcgtcttggcaccaaaacaaagagaatgaaagcacacaaacat
aacctcacatccaaatatgaatataaagggaaacaataatcactattcct
caatcctaaatatctatgccccaaatacaagggcacctacatacgtaaaa

What I want to do is to pick out the refgene regions of the chromosome files.

The way I do it now is simply to load the chromosome into a string in Python like this:

chrom_string = ''' '''.strip()
for line in input_file:
  chrom_string +=line.rstrip()

Then I pick out the regions of interest by substring indexing:

chrom_string[current_start:current_end]

Problem is, doing it this way I get plenty of fasta reads like

>chr1+758657
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNgagagagagagagagagagagagagag`

I guess it is unlikely that a known refgene region should contain N's so I must be doing something wrong.

Is there a library that does what I want to do?

python fasta chromosome • 2.6k views

ADD COMMENT • link updated 11.1 years ago by Neilfws 49k • written 11.1 years ago by Click downvote ▴ 720

0

Entering edit mode

I wouldn't be too surprised if you were using a masked genome. If so, you could try rn5.2bit version found on UCSC

ADD REPLY • link 11.1 years ago by fo3c ▴ 450

score 3 · Answer 1 · 2013-03-15

3

Entering edit mode

11.1 years ago

Istvan Albert 100k

Make sure to account for the right genomic builds and also the fact that the python string is zero based index.

In general I would recommend that you use tools such as bedtools or seqtk to extract your sequences. They work well and are faster and more flexible and you should already have them installed anyhow because they solve many other tasks.

ADD COMMENT • link 11.1 years ago by Istvan Albert 100k

0

Entering edit mode

Bedtools was incredible. Thanks.

ADD REPLY • link 11.1 years ago by Click downvote ▴ 720

score 1 · Answer 2 · 2013-03-15

I don't know the state of the rat genome, but these could be gaps or regions that are difficult to assemble. It looks like you are in a low-complexity region of the genome, so it may help to use the map location to see if you are near a centromere/telomere or some other repetitive region. This will also tell you if your script is working correctly (if you see the mapped genes). I agree with you though, that does not look like a gene.

score 1 · Answer 3 · 2013-03-15

1

Entering edit mode

11.1 years ago

Matt Shirley 10k

If you are set on doing this in Python, I suggest you take a look at the fastahack-python module. You can use the faidx fast index created by samtools to query like so:

>>> f.get_sub_sequence('1', 0, 10)
'TAACCCTAACC'

You'll still want to make sure you're not using a masked fast file.

ADD COMMENT • link 11.1 years ago by Matt Shirley 10k