How to extract short sequence from FASTA file with certain step size?
1
0
Entering edit mode
8.9 years ago

The code below extract short sequence in every sequence with the window size 100. The window will shift by step size one and extract the sequence. I would like to extract the short sequence with every step size 50. Can anyone help me?

from Bio import SeqIO

with open("B.fasta","w") as f:
    for seq_record in SeqIO.parse("A.fasta", "fasta"):
        for I in range(len(seq_record.seq) - 99) :
            f.write(str(">"+seq_record.id) + "\n")
            f.write(str(seq_record.seq[i:i+100]) + "\n")

Example of fasta file:

>hg17_ct_ER_ER_142    CTAAAAAAGTAAAAAAGAAAAAAAGAGAAAGAAAGAATATAGAAGCAACAAGTGTAGATTTACATTCTATTAGACAGTGACCCATTAGACCCGGACAAGGGG

Example output:

>hg17_ct_ER_ER_142    CTAAAAAAGTAAAAAAGAAAAAAAGAGAAAGAAAGAATATAGAAGCAACAAGTGTAGATTTACATTCTATTAGACAGTGACCCATTAGACCCGGACAAGG
>hg17_ct_ER_ER_142    TAAAAAAGTAAAAAAGAAAAAAAGAGAAAGAAAGAATATAGAAGCAACAAGTGTAGATTTACATTCTATTAGACAGTGACCCATTAGACCCGGACAAGGG
>hg17_ct_ER_ER_142    AAAAAAGTAAAAAAGAAAAAAAGAGAAAGAAAGAATATAGAAGCAACAAGTGTAGATTTACATTCTATTAGACAGTGACCCATTAGACCCGGACAAGGGG

Expected output:

>hg17_ct_ER_ER_142
CTAAAAAAGTAAAAAAGAAAAAAAGAGAAAGAAAGAATATAGAAGCAACA
>hg17_ct_ER_ER_142
AGTGTAGATTTACATTCTATTAGACAGTGACCCATTAGACCCGGACAAGG
fasta alignment biopython sequence • 3.2k views
ADD COMMENT
2
Entering edit mode

You can actually specify in the step size in the range function. The full range function takes in: start, end, step arguments.

For example:

for i in range(0, len(seq_record.seq) - 99, 50)
ADD REPLY
1
Entering edit mode

Hi !

Have you considered using a "while" loop instead of your second "for" loop ? Just like :

i = 0
while (i < len(seq_record.seq)) :
  f.write(">" + str(seq_record.id) + "\n")
  f.write(str(seq_record.seq[i:i+50]) + "\n")
  i += 50
ADD REPLY
0
Entering edit mode

It solved perfectly. Thanks for reply. But there's one issue there are few base pairs left out in this case.

ADD REPLY
0
Entering edit mode

My fault, I've edited my last post.

ADD REPLY
0
Entering edit mode

the codes are not equivalent - the original code generates overlapping window shifted by one base, this code creates non-overlapping windows that are shifted by a window size.

ADD REPLY
0
Entering edit mode

I agree. However, I tried to fit the expected output.

ADD REPLY
1
Entering edit mode
8.9 years ago

The BBMap package (latest version, 35.07) also has a tool for this, Shred:

shred.sh in=file.fa out=shredded.fa length=100 overlap=50 minlength=51

Those settings will give you 100bp sequences, every 50bp (length-overlap=50), and the last sequence will be as short as 51bp (assuming the main sequence is not divisible by 50), ensuring that every base is covered at least once.

ADD COMMENT

Login before adding your answer.

Traffic: 2855 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6