Question: Extract Sequences From Fasta File
0
gravatar for sbb
4.9 years ago by
sbb10
sbb10 wrote:

I have a fasta file of approx 3 Megabasepairs (Accession no. CM001634.1) and want to extract the sequences of all regions given below and then save the extracted sequences in fasta format in new text file. Also want that new file should be containing accession of file and the name of repeat such as low_complexity, simple repeat.
Kindly suggest me perl script.

SW perc perc perc query position in query matching repeat position in repeat score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID

21 3.6 0.0 0.0 gi|411024077|gb|CM001634.1| 11554 11581 (28623556) + AT_rich Low_complexity 1 28 (0) 1
22 0.0 0.0 0.0 gi|411024077|gb|CM001634.1| 12224 12245 (28622892) + AT_rich Low_complexity 1 22 (0) 2
22 8.0 0.0 0.0 gi|411024077|gb|CM001634.1| 12609 12658 (28622479) + AT_rich Low_complexity 1 50 (0) 3
22 5.6 0.0 0.0 gi|411024077|gb|CM001634.1| 12691 12726 (28622411) + AT_rich Low_complexity 1 36 (0) 4
26 3.0 0.0 0.0 gi|411024077|gb|CM001634.1| 13965 13997 (28621140) + AT_rich Low_complexity 1 33 (0) 5
222 3.7 0.0 0.0 gi|411024077|gb|CM001634.1| 16373 16399 (28618738) + (TA)n Simple_repeat 1 27 (0) 6
219 12.5 0.0 0.0 gi|411024077|gb|CM001634.1| 17247 17286 (28617851) + (G)n Simple_repeat 1 40 (0) 7
198 0.0 0.0 0.0 gi|411024077|gb|CM001634.1| 20074 20095 (28615042) + (CAAAAA)n Simple_repeat 3 24 (0) 8
189 0.0 0.0 0.0 gi|411024077|gb|CM001634.1| 20344 20364 (28614773) + (TA)n Simple_repeat 1 21 (0) 9
23 0.0 0.0 0.0 gi|411024077|gb|CM001634.1| 21437 21459 (28613678) + AT_rich Low_complexity 1 23 (0) 10
198 0.0 0.0 0.0 gi|411024077|gb|CM001634.1| 22420 22441 (28612696) + (GAA)n Simple_repeat 2 23 (0) 11
189 0.0 0.0 0.0 gi|411024077|gb|CM001634.1| 27191 27211 (28607926) + (TTTTTG)n Simple_repeat 4 24 (0) 12
21 0.0 0.0 0.0 gi|411024077|gb|CM001634.1| 27481 27501 (28607636) + AT_rich Low_complexity 1 21 (0) 13
24 16.1 0.0 0.0 gi|411024077|gb|CM001634.1| 33144 33174 (28601963) + AT_rich Low_complexity 1 31 (0) 14
186 4.3 0.0 0.0 gi|411024077|gb|CM001634.1| 34503 34525 (28600612) + (CGAAT)n Simple_repeat 2 24 (0) 15

perl • 1.2k views
ADD COMMENTlink modified 2.9 years ago by Biostar ♦♦ 20 • written 4.9 years ago by sbb10
3

What have you tried so far? This would seem to be solved simply by iterating over the rows (you can use samtools faidx to retrieve an arbitrary region of an indexed fasta file).

ADD REPLYlink written 4.9 years ago by Devon Ryan88k
1
gravatar for Matt Shirley
4.9 years ago by
Matt Shirley8.8k
Cambridge, MA
Matt Shirley8.8k wrote:

First I would echo what Devon suggested - both stating what you've attempted as well as using samtools faidx. I'm not sure about a Perl solution, but you can do this using my python fasta indexing module Pyfaidx: Efficient, "Pythonic" Random Access To Fasta Files Using Samtools-Compatible Indexing:

from pyfaidx import Fasta

with open("regions.txt") as regions, Fasta("sequence.fasta") as fasta:
  for line in regions:
    fields = line.rstrip().split()
    rname, start, end = fields[4:7]
    repeat = ' '.join(fields[9:11])
    seq = fasta[rname][int(start)-1:int(end)-1]
    printseq.name + repeat)
    print(seq.seq)

Which produces:

gi|411024077|gb|CM001634.1|AT_rich Low_complexity
TTTATATATATAATAATAATAGTAATA
gi|411024077|gb|CM001634.1|AT_rich Low_complexity
TATATTTATTTATTTTTTATT
gi|411024077|gb|CM001634.1|AT_rich Low_complexity
TTTTAAATGTAATAAAAAATTAAAACTACTTTCATTTTTATTATAATAT
gi|411024077|gb|CM001634.1|AT_rich Low_complexity
TATTATCATTTAAATAAATTTACAATATATATATA
gi|411024077|gb|CM001634.1|AT_rich Low_complexity
AATAAAATAATTAAAAATGTAAAAAATAAAAA
gi|411024077|gb|CM001634.1|(TA)n Simple_repeat
TATATATATATATATATATATATACA
gi|411024077|gb|CM001634.1|(G)n Simple_repeat
GGGGGGTAGGGGGGGGGGGGGGGATTGGGGGGGGGGGGG
...

Here is the Gist

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Matt Shirley8.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 902 users visited in the last hour