Split the fasta file based on sequence type
1
0
Entering edit mode
16 months ago
Ayish • 0

Hello,

I have a large fasta file containing both nucleotide and protein sequences. I need to separate the sequences into two files based on the type of sequence. Is there any Python module that can look for ?

Thanks in advance.

python Fasta biopython • 761 views
ADD COMMENT
1
Entering edit mode

Does the sequence identifier lines tell you whether they're DNA or protein? Or you gonna have to guess for a peptide made out of Glycine, Alanine, Cysteine and Threonine?

ADD REPLY
0
Entering edit mode

Unfortunately, No. It would be guess work, I think.

ADD REPLY
1
Entering edit mode

Here is a snippet in python. Or you can try out the biopython module too. But be aware this guessing work can go very wrong if you have UIPAC nucleotide symbols other than ATCG. https://www.bioinformatics.org/sms/iupac.html

https://colab.research.google.com/drive/1XSQBDoLIyQUGwUJvXRtZHkcXxsVU6oRH?usp=sharing

with open('example.fasta', 'w') as f:
  f.write('>seq1DNA\nATCG\n>seq2DNA\nTCGT\nTCTC\n>seq1Protein\nSSTCG\n>seq2Protein\nHYRN\nKQES')
from collections import defaultdict
def fastaparser(fasta):
    '''
    Read fasta file and return a dict, each record with seq name as key
    '''
    records = defaultdict(list)  
    with open(fasta,  'r') as f:
        lines = f.read().split('\n')[:-1]
        for line in lines:
            if line.startswith('>'):
                key = line
                continue
            records[key].append(line) 
    f.close()
    return records
DNA_alphabet = {'A', 'T', 'C', 'G'}
for k, seqs in records.items():
  if len(set("".join(seqs)).union(DNA_alphabet)) > 4:
    with open('protein.fa', 'a') as f:
      f.write(f"{k}\n")
      f.write(f"{''.join(seqs)}\n")
  else:
    with open('DNA.fa', 'a') as f:
      f.write(f"{k}\n")
      f.write(f"{''.join(seqs)}\n")
ADD REPLY
1
Entering edit mode
16 months ago

Does the solution need to be Python? Otherwise, you could use seqkit grep or seqkit fish and search for non-nucleotide letters in the sequences?

ADD COMMENT

Login before adding your answer.

Traffic: 2411 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6