Question: Parsing dataframe and sort a fasta file
0
gravatar for Darill
17 months ago by
Darill30
Darill30 wrote:

Hi all, I need some help. In fact in the contexte of my work I have a dataframe of seq names paired together, in each row there is one column for the seq_id of the sp1 and one fore the seq _idof the sp2. In another hand I have two fasta files which contain all these sequences (same seq Id) + the sequences in fasta format. But in these files sequences are totaly mixed and what I need to do is to reorganize two new fasta file by parsing my dataframe and say, ok for each row, put the seqx_A in fasta file 1 and seqx_B in fasta file 2. By keeping the order in the dataframe. Here is an exemple:

I actually have one dataframe with sequences in order such :

Seq_1.id    Seq_2.id
seq1_A     seq8_B
seq2_A     Seq9_B
seq3_A     Seq10_B
seq4_A     Seq11_B

and two fasta files such : first one:

>Seq11_B
ACTG
>seq8_B
ATGC
>seq3_A
ACTG
>seq2_A
ATGC

second one: 
>seq4_A
 ACTG
>seq1_A
 ACTG
>Seq10_B
 ATGC
>Seq9_B
 ATCG

As you can see _A and _B are mixed in bot fasta file but I would like to order my fastafiles by creating a new ones and put all seq A in a file and all seqB in another file in the same order as in the dataframe (paires sequence as always to be added in the same time in the file). here would be the output of the exemple:

fasta1:

>seq1_A
ATGC
>seq2_A 
ATGG
>seq3_A 
ATGC
>seq4_A 
ATGC

and fasta2:

>seq8_B
ATGc
>Seq9_B
ATGC
>Seq10_B
ATGC
>Seq11_B
ATGC

Here would be the name of the files:

candidate_df.read_csv("dn_ds.out_test",sep='\t')

#--------------------------------------
#Load the sequences comming from the cluster filtering and range them into ordered files per species

#here is the two columns of the dataframe
seq1_id=candidate_df["seq1_id"]
seq2_id=candidate_df["seq2_id"]

#Here is the output desired files:
output_aa_sp1 = open('candidates_aa_0042.fasta','w')
output_aa_sp2 = open('candidates_aa_0035.fasta','w')


#Here are the 2 fasta file to be modified
record_dict_sp1_aa = SeqIO.to_dict(SeqIO.parse("result1_aa.fasta", "fasta"))
record_dict_sp2_aa = SeqIO.to_dict(SeqIO.parse("result2_aa.fasta", "fasta"))

Does someone have an idea?

Thank you :)

pandas python fasta • 1.7k views
ADD COMMENTlink modified 17 months ago by Bastien Hervé4.5k • written 17 months ago by Darill30
3
gravatar for Bastien Hervé
17 months ago by
Bastien Hervé4.5k
Limoges, CBRS, France
Bastien Hervé4.5k wrote:

Biopython is powerful !

import pandas as pd
from Bio import SeqIO

output_handle1 = open("new_fasta1.fasta", "a")
output_handle2 = open("new_fasta2.fasta", "a")

records1 = SeqIO.index("fasta1.fasta", "fasta")
records2 = SeqIO.index("fasta2.fasta", "fasta")

candidate_df=pd.read_csv("dispatch.csv",sep='\t')

for i in candidate_df['Seq_1.id']:
    if i in records1:
        SeqIO.write(records1[i], output_handle1, 'fasta')
    elif i in records2:
        SeqIO.write(records2[i], output_handle1, 'fasta')

for i in candidate_df['Seq_2.id']:
    if i in records1:
        SeqIO.write(records1[i], output_handle2, 'fasta')
    elif i in records2:
        SeqIO.write(records2[i], output_handle2, 'fasta')
ADD COMMENTlink modified 17 months ago • written 17 months ago by Bastien Hervé4.5k

Hi think it is working, do you know how can I count how many record I have in a fasta file with biopython?

ADD REPLYlink written 17 months ago by Darill30

You mean in new_fasta1 and new_fasta2 ?

  • Put a count for each file

Or in general ?

records = SeqIO.index("your_file.fasta", "fasta")
print(len(records))
ADD REPLYlink modified 17 months ago • written 17 months ago by Bastien Hervé4.5k

BioPython is powerful but slow and, perhaps, overkill for this task.

ADD REPLYlink written 17 months ago by Alex Reynolds29k

Indeed, if time really matter to him, he should swap to high level language as C. But he mentions python so I gave him this solution. He also could have write something in python from scratch like you did but i'm not even sure that would be quicker. I know, by testing, that Biopython is memory consuming. I'm just curious, do you have a paper on hand on this biopython downside ? Thanks !

ADD REPLYlink written 17 months ago by Bastien Hervé4.5k

I do not have a paper on this, I'm sorry. I'm just relating past (anecdotal) experience processing FASTA files with this library.

ADD REPLYlink modified 17 months ago • written 17 months ago by Alex Reynolds29k
1
gravatar for Alex Reynolds
17 months ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

If you must use Python, it's a two-step procedure. First, write the headers into ordered lists ("arrays", but really lists) and then use a hash table/dictionary to lookup sequences from the headers in the ordered lists.

First, read the columns of the first file into a list:

seq1 = []
seq2 = []
with open("ordered_sequences.txt", "r") as osh:
  # skip header
  osh.readline() 
  for line in osh:
    elems = line.rstrip().split('\t')
    seq1.append(elems[0])
    seq2.append(elems[1])

Once you have built seq1 and seq2, read each FASTA file separately into a hash table ("dictionary" in Python-speak).

For example, for the first FASTA file first.fa:

records = {}
sequence = ""
with open("first.fa", "r") as fh:
  for line in fh:
    line = line.rstrip()
    if line.startswith('>'):
      if len(sequence) > 0:
        records[header] = sequence
        sequence = ""
      header = line[1:]
    else
      sequence += line
# read last record into dictionary
records[header] = sequence

(Repeat for the second FASTA file, to a separate hash table or run instance.)

Then write out records from the hash table records in order, using the headers in the ordered lists.

For example, using headers in seq1:

for seq in seq1:
  sys.stdout.write(">%s\n%s\n" % (seq, records[seq]))

Or for seq2, say:

for seq in seq2:
  sys.stdout.write(">%s\n%s\n" % (seq, records[seq]))

Use the redirection operator > when running the Python script to write standard output to a file:

$ python someScript.py > orderedOutput.fa

Or use open and write to a writable file handle.

ADD COMMENTlink modified 17 months ago • written 17 months ago by Alex Reynolds29k
2
gravatar for shenwei356
17 months ago by
shenwei3564.8k
China
shenwei3564.8k wrote:

You've finished half the code, next step is just iterating the seq1_id and check if one id exists in any of record_dict_sp1_aa or record_dict_sp2_aa and printing the result.


Actually, there's no need to write scripts.

# concat two files

cat f1.fasta f2.fasta > all.fasta

# export ids for seq1 and seq2

cut -f 1 id.tsv | sed 1d > seq1_id.txt
cut -f 2 id.tsv | sed 1d > seq2_id.txt

# extract seqs by every id in ids file, using samtools faidx

cat seq1_id.txt  | while read id; do samtools faidx all.fasta $id; done > result1.fasta
cat seq2_id.txt  | while read id; do samtools faidx all.fasta $id; done > result2.fasta

# verify

$ grep '>' result1.fasta 
>seq1_A
>seq2_A
>seq3_A
>seq4_A

$ grep '>' result2.fasta 
>seq8_B
>Seq9_B
>Seq10_B
>Seq11_B
ADD COMMENTlink modified 17 months ago • written 17 months ago by shenwei3564.8k

Hi, thank you, I think it will be usefull for other peoples but I actually need to code it with python because all my pipeline is coded with it..

ADD REPLYlink written 17 months ago by Darill30

I've just updated the answer.

You've finished half the code, next step is just iterating the seq1_id and check if one id exists in any of record_dict_sp1_aa or record_dict_sp2_aa and printing the result.

ADD REPLYlink written 17 months ago by shenwei3564.8k

I wrote an answer to show you what I tried

ADD REPLYlink written 17 months ago by Darill30
0
gravatar for Darill
17 months ago by
Darill30
Darill30 wrote:

Hi tried this :

for a, b in zip(seq1_id,seq2_id): 
    if a in record_dict_sp1_aa:

        print(">",record_dict_sp1_aa[a].id,file=output_aa_sp1,sep="")
        print(record_dict_sp1_aa[a].seq,file=output_aa_sp1)

        print(">",record_dict_sp2_aa[b].id,file=output_aa_sp2,sep="")
        print(record_dict_sp2_aa[b].seq,file=output_aa_sp2)

    else:

        print(">",record_dict_sp2_aa[a].id,file=output_aa_sp1,sep="")
        print(record_dict_sp2_aa[a].seq,file=output_aa_sp1)

        print(">",record_dict_sp1_aa[b].id,file=output_aa_sp2,sep="")
        print(record_dict_sp1_aa[b].seq,file=output_aa_sp2)

But in my first dataframa I have 196 rows, so there should be 196 seq in the fasta1 and 196 in the fasta2 but I actually get 191 in fasta1 and 184 in fasta2.. And I check theses sequences are well present in the two first fasta files...

ADD COMMENTlink written 17 months ago by Darill30

Bad logic. You need to iterate seq1_id and seq2_id separately.

ADD REPLYlink written 17 months ago by shenwei3564.8k
1
for id in seq1_id:
    if id in record_dict_sp1_aa:
        # output record_dict_sp1_aa[id]
        pass
    elif id in record_dict_sp2_aa:
        # output record_dict_sp2_aa[id]
        pass
    else:
        # reporting a id not in any files

for id in seq2_id:
    pass
ADD REPLYlink written 17 months ago by shenwei3564.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1721 users visited in the last hour