How To Pick Out Refgene Parts Of Chromosome In Fasta Format
3
0
Entering edit mode
11.1 years ago

I have the refgene regions of the rattus norvegicus chromosomes, given by the indexes of the start and end positions.

I also have the chromosomes of interest as fasta files.

Example:

> chr11
aaactaatcgtcttggcaccaaaacaaagagaatgaaagcacacaaacat
aacctcacatccaaatatgaatataaagggaaacaataatcactattcct
caatcctaaatatctatgccccaaatacaagggcacctacatacgtaaaa

What I want to do is to pick out the refgene regions of the chromosome files.

The way I do it now is simply to load the chromosome into a string in Python like this:

chrom_string = ''' '''.strip()
for line in input_file:
  chrom_string +=line.rstrip()

Then I pick out the regions of interest by substring indexing:

chrom_string[current_start:current_end]

Problem is, doing it this way I get plenty of fasta reads like

>chr1+758657
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNgagagagagagagagagagagagagag`

I guess it is unlikely that a known refgene region should contain N's so I must be doing something wrong.

Is there a library that does what I want to do?

python fasta chromosome • 2.6k views
ADD COMMENT
0
Entering edit mode

I wouldn't be too surprised if you were using a masked genome. If so, you could try rn5.2bit version found on UCSC

ADD REPLY
3
Entering edit mode
11.1 years ago

Make sure to account for the right genomic builds and also the fact that the python string is zero based index.

In general I would recommend that you use tools such as bedtools or seqtk to extract your sequences. They work well and are faster and more flexible and you should already have them installed anyhow because they solve many other tasks.

ADD COMMENT
0
Entering edit mode

Bedtools was incredible. Thanks.

ADD REPLY
1
Entering edit mode
11.1 years ago
SES 8.6k

I don't know the state of the rat genome, but these could be gaps or regions that are difficult to assemble. It looks like you are in a low-complexity region of the genome, so it may help to use the map location to see if you are near a centromere/telomere or some other repetitive region. This will also tell you if your script is working correctly (if you see the mapped genes). I agree with you though, that does not look like a gene.

ADD COMMENT
1
Entering edit mode
11.1 years ago

If you are set on doing this in Python, I suggest you take a look at the fastahack-python module. You can use the faidx fast index created by samtools to query like so:

>>> f.get_sub_sequence('1', 0, 10)
'TAACCCTAACC'

You'll still want to make sure you're not using a masked fast file.

ADD COMMENT

Login before adding your answer.

Traffic: 1942 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6