Question: How to extract short sequence from FASTA file with certain step size?
0
gravatar for allyson1115ar
3.9 years ago by
allyson1115ar20 wrote:

The code below extract short sequence in every sequence with the window size 100. The window will shift by step size one and extract the sequence. I would like to extract the short sequence with every step size 50. Can anyone help me?

     from Bio import SeqIO

     with open("B.fasta","w") as f:
             for seq_record in SeqIO.parse("A.fasta", "fasta"):
                 for i in range(len(seq_record.seq) - 99) :
                    f.write(str(">"+seq_record.id) + "\n")
                    f.write(str(seq_record.seq[i:i+100]) + "\n")

Example of fasta file:

    >hg17_ct_ER_ER_142    CTAAAAAAGTAAAAAAGAAAAAAAGAGAAAGAAAGAATATAGAAGCAACAAGTGTAGATTTACATTCTATTAGACAGTGACCCATTAGACCCGGACAAGGGG

Example output:

    >hg17_ct_ER_ER_142    CTAAAAAAGTAAAAAAGAAAAAAAGAGAAAGAAAGAATATAGAAGCAACAAGTGTAGATTTACATTCTATTAGACAGTGACCCATTAGACCCGGACAAGG
    >hg17_ct_ER_ER_142    TAAAAAAGTAAAAAAGAAAAAAAGAGAAAGAAAGAATATAGAAGCAACAAGTGTAGATTTACATTCTATTAGACAGTGACCCATTAGACCCGGACAAGGG
    >hg17_ct_ER_ER_142    AAAAAAGTAAAAAAGAAAAAAAGAGAAAGAAAGAATATAGAAGCAACAAGTGTAGATTTACATTCTATTAGACAGTGACCCATTAGACCCGGACAAGGGG

Expected output:<br>

    >hg17_ct_ER_ER_142
    CTAAAAAAGTAAAAAAGAAAAAAAGAGAAAGAAAGAATATAGAAGCAACA
    >hg17_ct_ER_ER_142
    AGTGTAGATTTACATTCTATTAGACAGTGACCCATTAGACCCGGACAAGG

ADD COMMENTlink modified 3.9 years ago by Brian Bushnell16k • written 3.9 years ago by allyson1115ar20
2

You can actually specify in the step size in the range function. The full range function takes in: start, end, step arguments.

For example:

for i in range(0, len(seq_record.seq) - 99, 50)
ADD REPLYlink written 3.9 years ago by Damian Kao15k
1

Hi !

Have you considered using a "while" loop instead of your second "for" loop ? Just like :

i = 0
while (i < len(seq_record.seq)) :
  f.write(">" + str(seq_record.id) + "\n")
  f.write(str(seq_record.seq[i:i+50]) + "\n")
  i += 50
ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Thibault D.510

It solved perfectly. Thanks for reply. But there's one issue there are few base pairs left out in this case. 

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by allyson1115ar20

My fault, I've edited my last post.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Thibault D.510

the codes are not equivalent - the original code generates overlapping window  shifted by one base, this code creates non-overlapping windows that are shifted by a window size.

ADD REPLYlink written 3.9 years ago by Istvan Albert ♦♦ 80k

I agree. However, I tried to fit the expected output.

ADD REPLYlink written 3.9 years ago by Thibault D.510
1
gravatar for Brian Bushnell
3.9 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

The BBMap package (latest version, 35.07) also has a tool for this, Shred:

shred.sh in=file.fa out=shredded.fa length=100 overlap=50 minlength=51

Those settings will give you 100bp sequences, every 50bp (length-overlap=50), and the last sequence will be as short as 51bp (assuming the main sequence is not divisible by 50), ensuring that every base is covered at least once.

 

ADD COMMENTlink written 3.9 years ago by Brian Bushnell16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1174 users visited in the last hour