Question

How to loop on two fasta files for hetero-dimerization analysis

0

Entering edit mode

3.0 years ago

Apex92 ▴ 280

I am taking an approach but have not been able to get my expected output. I have two fasta files as below - what I do is I make reverse-complement of 5mers of each record in file1 and then I look for any record in file2 if has or does not have any of reverse-complement 5mers.

file1.fa

>seq1
ACCGGT
>seq2
AATTTC

file2.fa

>seq3
CCGGT
>seq9
GGGGGGCCC

So based on the dummy examples above - the reverse-complement of ACCGG in seq1 (file1) exists in the seq3 of file2.fa (the CCGGT) - So then the seq1 should be marked against the seq3in the output (like having a counter - I am not sure how to do it).

At the end I would like to have a table as below showing every seq in file1 against every seq in file2 with a mark representing if they hetero-dimerized or not. Based on my example only seq1 from file1 hetero-dimerizes with the seq3 in the file2.

     file2   seq3   seq9
file1 

seq1          1         0

seq2          0         0

This is the code that I have tried so far - it does not give expected output - I also could not set the script in a way to write output in table format - any help to fix the code is appreciated.

Thanks

with open('file1.fa', 'r') as fw, open('file2.fa', 'r') as rv:
    for fw_record in SeqIO.parse(fw, 'fasta'):
        fcnt = 0
        lst = []
        for i in range(len(fw_record.seq)):
            kmer = str(fw_record.seq[i:i + 5])
            if len(kmer) == 5:
                C_kmer = Seq(kmer).complement()
                lst.append(C_kmer[::-1])
        #print(lst)

        for rv_record in SeqIO.parse(rv, 'fasta'):
            rcnt = 0
            if any(items in rv_record.seq for items in lst):
                fcnt+=1
                rcnt+=1

        if fcnt == 0 and rcnt == 0:
            print('>>' + fw_record.id)
            print('>>' + fw_record.seq)

        if fcnt > 0 and rcnt > 0:
            print('>>>' + fw_record.id)
            print('>>>' + fw_record.seq)

fasta script python sequence • 3.1k views

ADD COMMENT • link updated 3.0 years ago by cpad0112 21k • written 3.0 years ago by Apex92 ▴ 280

0

Entering edit mode

Does the order of sequences matter here? i.e. are you only looking for the reverse-complement of File 1, Seq 1 in File 2, Seq 1 (numbered sequentially, not by your IDs here).

Or are you looking for the R.C. of Seq1 wherever it occurs in File 2?

ADD REPLY • link 3.0 years ago by Joe 21k

0

Entering edit mode

Thank you for your reply: the order of sequences does not matter - I make a reverse complement of 5mers of each record in file1 and look for them in all records of file2. If any exist then both of the records (from file1 and file2) should be marked - it is not like seq1 from file1 should be checked only with the seq1 from file2. The reverse complement 5mers coming from every record in file1 should be checked against all records of file2

ADD REPLY • link 3.0 years ago by Apex92 ▴ 280

score 2 · Answer 1 · 2021-04-08

2

Entering edit mode

3.0 years ago

cpad0112 21k

Check if this code works and edit as per your needs. Same precautions from earlier post apply:

#! /usr/bin/env python3
import os
from Bio import SeqIO
from Bio.Seq import Seq

records=list(SeqIO.parse("test1.fa","fasta"))

window_size = 5
step_size = 1

target_records=list(SeqIO.parse("test2.fa","fasta"))

for i in records:
    for j in range(0, len(i.seq)-window_size+1):
        for k in (i.seq.reverse_complement()[j: j+5].split()):
            for l in target_records:
                if l.seq.find(k)!=-1:
                    print('---\n{}\n>{}\n{}\n{}\n>{}\n{}'.format ("file1.seq", i.id,i.seq, "file2.seq", l.id,l.seq))

ADD COMMENT • link 3.0 years ago by cpad0112 21k

0

Entering edit mode

Thank you a lot - yes this code works as expected :) - now I have to change the output in a table format (if you have some hints it would be great if you share) - thank you again :)

ADD REPLY • link 3.0 years ago by Apex92 ▴ 280

0

Entering edit mode

table format? Post example output.

ADD REPLY • link 3.0 years ago by cpad0112 21k

0

Entering edit mode

yes exactly - something like below:

     file2   seq3   seq9
file1 

seq1          1         0

seq2          0         0

ADD REPLY • link 3.0 years ago by Apex92 ▴ 280

0

Entering edit mode

I still get seq1 and seq3 as output which is correct but then I do not have the table format. My current output is:

---
file1.seq
>seq1
ACCGGT
file2.seq
>seq3
CCGGT

ADD REPLY • link 3.0 years ago by Apex92 ▴ 280

0

Entering edit mode

I need to have the other sequences in the report as well - so basically in the table format I have all of the seq.ids against each other - if they heterodimerize then should be marked 1 and if they do not hetero-dimerize then should be marked 0

as the table I shared above.

ADD REPLY • link 3.0 years ago by Apex92 ▴ 280

score 0 · Answer 2 · 2021-04-08

0

Entering edit mode

3.0 years ago

cpad0112 21k

This is not a python solution, OP wants.. But this can be done with seqkit in following way:

input:

$ cat test1.fa                                                                                                                                                                          

>seq1
ACCGGT
>seq2
AATTTC

$ cat test2.fa                                                                                                                                                                          
>seq3
CCGGT
>seq9
GGGGGGCCC

output:

$ seqkit sliding -W 5 -s 1 test1.fa | seqkit --quiet seq -rp -t dna  | seqkit seq -s  | parallel 'seqkit grep -Psrp {} test2.fa'                                                        

>seq3
CCGGT

ADD COMMENT • link 3.0 years ago by cpad0112 21k

0

Entering edit mode

I make reverse complement of every 5mer in the file1 (for each seq) and then look if they exist inside the records in file2.
I only look if any reverse-complement from records of file1 exists in the records of file 2 - I do not need to make reverse-complement from the records of file2
each sequence in the file1 has two 5mers - so for the seq1 in file 1 there are ACCGG and CCGGT as 5mers

ADD REPLY • link 3.0 years ago by Apex92 ▴ 280

0

Entering edit mode

Thanks for replying. I have edited the reply based on your feedback. However note that seq1 is tricky. Both the 5 mers from seq1 (ACCGG and CCGGT) matches with CCGGT of seq3. ACCGG is rc match of seq3.

ADD REPLY • link 3.0 years ago by cpad0112 21k

0

Entering edit mode

Actually not - the seq1 has two 5mers as you mentioned (ACCGG and CCGGT) and their respective reverse complements are (CCGGT and ACCGG) that then only CCGGT can be found in seq3. ACCGG is not in seq3.

ADD REPLY • link 3.0 years ago by Apex92 ▴ 280

0

Entering edit mode

Pentamers from seq1 hexamer, ACCGG and CCGGT are reverse complementary to each other. seqkit searches are in both strands, by default. Function I wrote above searches only positive strands (-P). If I remove -P option it would print seq3 twice

ADD REPLY • link 3.0 years ago by cpad0112 21k

0

Entering edit mode

actually, I need to mention that I do not make reverse complement 5mer from the sequences in file2 _ I only want to look if the reverse complement 5mer that I have from file 1 in file2. With that being said the CCGGT is the reverse complement of the first 5mer in seq1 (ACCGG) and CCGGT is in the file2 so that's it :)

ADD REPLY • link 3.0 years ago by Apex92 ▴ 280

0

Entering edit mode

thank you for your input - I would like to have all sequence ids against each other (2*2) showing if they heterodimerize (as I made an expected output example). Can I run your command on MacOS?

ADD REPLY • link 3.0 years ago by Apex92 ▴ 280

0

Entering edit mode

seqkit is available for mac as I understand. I am not sure how big fasta files (number and length of sequences). you can do a serialization, but it would take long time, but it would not hang the machine. Btw, is the output from example files correct ?

ADD REPLY • link 3.0 years ago by cpad0112 21k

0

Entering edit mode

yes the output is correct :)

ADD REPLY • link 3.0 years ago by Apex92 ▴ 280

0

Entering edit mode

dear @cpad0112 for some unknown reason I can not find your python script on this post - I copy-pasted it from the notification section - I am pasting it here please let me know if this is correct. I ran it - it gives the seq3 as output (multiple times) - however I would like to know which sequences hetero-dimerizes with the seq3 (both should be printed).

import os
from Bio import SeqIO
from Bio.Seq import Seq
records=list(SeqIO.parse("test1.fa","fasta"))
window_size = 5
step_size = 1
motifs=[]
for i in records:
    for j in range(0, len(i.seq)-window_size+1):
        motifs.append(i.seq.reverse_complement()[j:j+window_size])
        motif_find=(r for r in SeqIO.parse("test2.fa", "fasta") if r.seq in motifs)
        for i in motif_find:
            print('>{0}\n{1}'.format(i.id, i.seq), end="")

ADD REPLY • link 3.0 years ago by Apex92 ▴ 280

0

Entering edit mode

Earlier code had an issue. Issue is that it would print if sequence is exactly as motif i.e if motif is part of larger sequence it would not print. I updated the code. May be some one else here will update it and post a better code.

ADD REPLY • link 3.0 years ago by cpad0112 21k

score 0 · Answer 3 · 2021-04-08

I am not a good python programmer. Try following with input files first:

#! /usr/bin/env python3
import os
from Bio import SeqIO
from Bio.Seq import Seq

records=list(SeqIO.parse("test1.fa","fasta"))

window_size = 5
step_size = 1

motifs=[]

for i in records:
    for j in range(0, len(i.seq)-window_size+1):
        motifs.append(i.seq.reverse_complement()[j:j+window_size])

target_records=list(SeqIO.parse("test2.fa","fasta"))

for i in motifs:
    for j in target_records:
        if j.seq.find(i)!=-1:
            print ('>{}\n{}'.format(j.id,j.seq))

test1.fasta is file1.fa and test2.fasta is file2.fa from above. Let me know if code works.