I have big fasta files (about 1-2 GB ) received by bedtools getfasta command each file has a header with information (chromosome, coordinates, and strand) and a sequence of 10 nucleotides.
I have to select only the sequences which have pyrimidines in 4-5 columns and generate fasta and bed files for them.
I wrote a python script for it, but the problem is it takes a lot of time. here is my code:
#This script generates new files which contain only pyrimidines. from Bio import SeqIO import sys import traceback import warnings import argparse PYRIMIDINES = ["TT","TC","CT","CC"] def bedFormat(Id): """This function manipulates the fasta file header for getting appropriate bed file """ chromosome = Id[Id.find(':') +2:Id.rfind(':')] start = Id[Id.rfind(':') +1:Id.find('-')] end = Id[Id.find('-') +1:Id.find('()')] strand = Id[Id.find('>') +1:Id.find(':')] lineBed = (chromosome,start,end,strand) line = "\t".join(lineBed) + "\n" return line def writeFiles(path_to_directory,path_to_file,name): """ This function gets fasta file and writes new filtered bed and fasta files according to the original file""" bed_file = (open(path_to_directory + "/filtered_bed/" + name + ".bed", "wb+")) #filtered bed file. fasta_file = (open(path_to_directory + "/filtered_fasta/" + name, "wb+")) #filtered fasta file. with open(path_to_file, mode='r') as handle: for record in SeqIO.parse(handle, 'fasta'): seq = record.seq.upper() if seq[3:5] in PYRIMIDINES: #If there are pyrimidines on the 4 and 5 columns. bed_file.writebedFormatrecord.id)) SeqIO.write(record, fasta_file, "fasta") bed_file.close() fasta_file.close() def main(): path_to_directory = sys.argv #The directory path path_to_file = sys.argv #The file's path name = sys.argv #The file's name writeFiles(path_to_directory,path_to_file,name) if __name__ == '__main__': main()
For example, if this is the input file:
>+::chrY:59351341-59351351() tttcctagcc >+::chrY:59352450-59352460() agctttagta >+::chrY:59352832-59352842() ggttttaggg >-::chrY:59358570-59358580() GCTGAGAGCC >-::chrY:59363409-59363419() ggttaggggt
the desired output fasta file is:
>+::chrY:59351341-59351351() tttcctagcc >+::chrY:59352450-59352460() agctttagta >+::chrY:59352832-59352842() ggttttaggg
without the sequences:
>-::chrY:59358570-59358580() GCTGAGAGCC >-::chrY:59363409-59363419() ggttaggggt
which have GA or TA in 4,5 columns.
and the matching desired output bed file - contains only the sequences in the fasta file is:
chrY 59351341 59351351 + chrY 59352450 59352460 + chrY 59352832 59352842 +
Can anyone suggest a better solution for this goal?