Question: split seq in fasta file in groups according to seq similarity?
0
gravatar for Mostafa Mahmoud
2.8 years ago by
china
Mostafa Mahmoud0 wrote:

hi i made py script to split uniq and identical seq in fasta file but it actully too slow it finish 100mb file in 3 days and my data is about 1gb 18000000 seq if there is faster solution.

f=open('example2.fas','r')
f1 = open(r'identical.fasta','w')
f2 = open(r'uniq.fasta','w')

seq1=[]
ID=[]

i=0
for line in f.readlines():
    i+=1
    if i%2 != 0:
        ID.append(line)
    else:
        seq1.append(line)

for i in range(0,len(seq1)):
    if seq1.count(seq1[i])>=2:
        f1.write(ID[i]+seq1[i])
    else:
        f2.write(ID[i]+seq1[i])

f.close()
f1.close()
f2.close()
python • 826 views
ADD COMMENTlink modified 6 months ago by Biostar ♦♦ 20 • written 2.8 years ago by Mostafa Mahmoud0

You are storing everything in memory, do except if you have a lot of RAM I think runtime would not be your biggest problem for the full dataset.

ADD REPLYlink written 2.8 years ago by WouterDeCoster44k

i dont quit understand i am beginner in py and i run thise script on HPC 125000gb ram and 46 core. could you give me and example or any kind of tool do the same jop.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Mostafa Mahmoud0

I think your approach is naive but not too bad. Given that you have 125000gb RAM you might have enough.

A few suggestions:

  • don't use f.readlines(). Just use for line in f. The file itself is an iterator, no need to read the full file in memory first
  • I would try to use hashing of the sequences you get. In a first iteration, hash all sequences. Then, search all hashes which are duplicates, only keep those. In the second iteration, check if the sequence is a duplicate yes or no and write to the appropriate file. That might keep memory usage low, and I'd expect it to be faster although I have never done anything like that. I also don't have an appropriate dataset to test on.
ADD REPLYlink written 2.8 years ago by WouterDeCoster44k

still the same problem also i dont know how to hash the seq.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Mostafa Mahmoud0

If only we had method to search the internet, googling "python hash a string" gave me https://www.pythoncentral.io/hashing-strings-with-python/

Without hashing you could still modify seq1 to only contain the duplicate elements, that will make searching much faster.

ADD REPLYlink written 2.8 years ago by WouterDeCoster44k

what if i convert the data into tabular ID $1 and seq $2 there is possibility to do the job awk or grep

ADD REPLYlink written 2.8 years ago by Mostafa Mahmoud0

Do you have a multi-line or single line fasta? What is the output of grep -c '>' example2.fas?

ADD REPLYlink written 2.8 years ago by st.ph.n2.5k

full dataset 18255486

ADD REPLYlink written 2.8 years ago by Mostafa Mahmoud0

what if i convert the data into tabular ID $1 and seq $2 there is possibility to do the job awk or grep

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Mostafa Mahmoud0

awk or grep will also work with FASTA. It is not the format of your data, it is the quantity, and efficiency of your script. Do you know the average length of your sequences? Again, do you have a multi-line or single-line FASTA? This will help in the efficacy of reading the file.

ADD REPLYlink written 2.8 years ago by st.ph.n2.5k

You need a much more sophisticated approach to reduce run time. But I would first check whether you are really looking for identical sequences, such that one base error would be a new sequence or are you looking for similarity only.

ADD REPLYlink written 2.8 years ago by jomo018610
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1255 users visited in the last hour