So I need to find any instances where two LTRs are within a certain distance of each other in the genome. I started Python recently and wrote the first thing I could think of that would achieve this purpose. It's a very stupid way of going about it, I think. It uses three for loops to first find an instance of three fasta records with the same ID, so I think that means it has to check for any possible combination of elements between three lists. It takes forever at larger list lengths in the thousands.
This entire script is actually meant to retrieve the alleged flanks of full-length proviruses and single LTRs and put them into separate files but the relevant part is just the three for loops starting at "for hit_record1 in hits".
"""This script retrieves 100bp flanking sequences on either side of single LTR and full-length retroviral elements in the human genome. The script relies on two separate fasta files: One with aligned hit sequences and one with complete genomic sequences. Flanks for full-length proviruses are filtered into a separate file by comparing the distance between every pair of LTRs within the same genomic region. Any hits which do not possess 100bp sequence information on either side of the viral element are filtered out.""" import os from Bio import SeqIO print "Enter dataset" dataset = raw_input() flist = list("") completes = list(SeqIO.parse("complete" + str(dataset) + ".fasta", "fasta")) hits = list(SeqIO.parse("hit" + str(dataset) + ".fasta", "fasta")) flanks = open("flanks" + str(dataset) + ".fasta", "w") fullflanks = open("fullflanks" + str(dataset) + ".fasta", "w") for hit_record1 in hits: for hit_record2 in hits: for complete_record in completes: if hit_record1.id == complete_record.id and hit_record2.id == complete_record.id): start1 = complete_record.seq.find(hit_record1.seq) start2 = complete_record.seq.find(hit_record2.seq) if (7000 < start2-start1 and start2-start1 < 8600): flist.append(hit_record1.seq) flist.append(hit_record2.seq) if(start1>100 and (len(complete_record.seq)-start2)>100): print >> fullflanks, '>' + strcomplete_record.id) print >> fullflanks, complete_record.seq[start1-100:start1]+complete_record.seq[start2:start2+100] for index, hit_record in enumerate(hits): for complete_record in completes: if hit_record.id == complete_record.id and hit_record.seq not in flist): start = complete_record.seq.find(hit_record.seq) if (start>100 and (len(complete_record.seq)-(start+len(hit_record.seq)))>100): print >> flanks, '>' + strhit_record.id) + str(index) print >> flanks, complete_record.seq[start-100:start]+complete_record.seq[start+len(hit_record):start+len(hit_record)+100] flanks.close() fullflanks.close() sltrs = list(SeqIO.parse("flanks" + str(dataset) + ".fasta", "fasta")) fulls = list(SeqIO.parse("fullflanks" + str(dataset) + ".fasta", "fasta")) print "Total hits:" print len(hits) print "LTR elements with flanks:" print len(sltrs)+len(fulls) print "Single LTRs with flanks:" print len(sltrs) print "Full-length elements with flanks:" print len(fulls) for record in sltrs: if len(record.seq)<200: print record.id, str(len(record.seq))+"bp", "unflanked sLTR" for record in fulls: if len(record.seq)<200: print record.id, str(len(record.seq))+"bp", "unflanked full-length element"