I am trying to emove fasta sequences from a fasta file based on a txt file with ID's that need to be removed. The fasta file contains 44306447 reads (RNAseq data). I have tried to use the following script from another thread: How To Remove Certain Sequences From A Fasta File
#!/usr/bin/env python # -*- coding: utf-8 -*- import sys from Bio import SeqIO fasta_file = sys.argv # Input fasta file remove_file = sys.argv # Input wanted file, one gene name per line result_file = sys.argv # Output fasta file remove = set() with open(remove_file) as f: for line in f: line = line.strip() if line != "": remove.add(line) fasta_sequences = SeqIO.parse(open(fasta_file),'fasta') with open(result_file, "w") as f: for seq in fasta_sequences: **nam = seq.id **nuc = str(seq.seq) **if nam not in remove and len(nuc) > 0: SeqIO.write([seq], f, "fasta")
But unfortunately using grep ">" does result in the exact same number of reads, so I am concluding there must be a mistake in the code somewhere but I cannot figure out where. Is someone able to help me out here to get a script running that removes reads from a fasta file based on a text file with headers and that can handle LARGE files?