Question

How to efficiently find any instances of two hits (LTRs) in close proximity to one another in the genome? (Finding full-length proviruses)

0

Entering edit mode

6.8 years ago

rmartson ▴ 30

So I need to find any instances where two LTRs are within a certain distance of each other in the genome. I started Python recently and wrote the first thing I could think of that would achieve this purpose. It's a very stupid way of going about it, I think. It uses three for loops to first find an instance of three fasta records with the same ID, so I think that means it has to check for any possible combination of elements between three lists. It takes forever at larger list lengths in the thousands.

This entire script is actually meant to retrieve the alleged flanks of full-length proviruses and single LTRs and put them into separate files but the relevant part is just the three for loops starting at "for hit_record1 in hits".

"""This script retrieves 100bp flanking sequences on either side of single LTR and full-length retroviral elements in the human genome.
The script relies on two separate fasta files: One with aligned hit sequences and one with complete genomic sequences.
Flanks for full-length proviruses are filtered into a separate file by comparing the distance between every pair of LTRs within the same genomic region.
Any hits which do not possess 100bp sequence information on either side of the viral element are filtered out."""

import os
from Bio import SeqIO

print "Enter dataset"
dataset = raw_input()

flist = list("")
completes = list(SeqIO.parse("complete" + str(dataset) + ".fasta", "fasta"))
hits = list(SeqIO.parse("hit" + str(dataset) + ".fasta", "fasta"))
flanks = open("flanks" + str(dataset) + ".fasta", "w")
fullflanks = open("fullflanks" + str(dataset) + ".fasta", "w")

for hit_record1 in hits:
    for hit_record2 in hits:
        for complete_record in completes:
            if hit_record1.id == complete_record.id and hit_record2.id == complete_record.id):
                start1 = complete_record.seq.find(hit_record1.seq)
                start2 = complete_record.seq.find(hit_record2.seq)              
                if (7000 <  start2-start1 and start2-start1 < 8600):
                    flist.append(hit_record1.seq)
                    flist.append(hit_record2.seq)
                    if(start1>100 and (len(complete_record.seq)-start2)>100): 
                        print >> fullflanks, '>' + strcomplete_record.id)
                        print >> fullflanks, complete_record.seq[start1-100:start1]+complete_record.seq[start2:start2+100]

for index, hit_record in enumerate(hits):
    for complete_record in completes:
        if hit_record.id == complete_record.id and hit_record.seq not in flist):
            start = complete_record.seq.find(hit_record.seq)
            if (start>100 and (len(complete_record.seq)-(start+len(hit_record.seq)))>100):
                print >> flanks, '>' + strhit_record.id) + str(index)
                print >> flanks,  complete_record.seq[start-100:start]+complete_record.seq[start+len(hit_record):start+len(hit_record)+100]


flanks.close()
fullflanks.close()

sltrs = list(SeqIO.parse("flanks" + str(dataset) + ".fasta", "fasta"))
fulls = list(SeqIO.parse("fullflanks" + str(dataset) + ".fasta", "fasta"))

print "Total hits:"
print len(hits)
print "LTR elements with flanks:"
print len(sltrs)+len(fulls)
print "Single LTRs with flanks:"
print len(sltrs)
print "Full-length elements with flanks:"
print len(fulls)

for record in sltrs:
    if len(record.seq)<200:
         print record.id, str(len(record.seq))+"bp", "unflanked sLTR"

for record in fulls:
    if len(record.seq)<200:
      print record.id, str(len(record.seq))+"bp", "unflanked full-length element"

Biopython Blast • 1.5k views

ADD COMMENT • link 6.8 years ago by rmartson ▴ 30

1

Entering edit mode

Could you be more specific in your question. What type of data do you have ? I guess a reference genome file and a LTR sequences file. Both in fasta.

If so, my first idea would be to align the LTR sequence file to the reference genome. To extract their alignment position and to analyze them in R using GenomicRanges (distance or distanceToNearest function for example)

I guess you could also use bedops or bedtools if you have a bed file as results for your alignment.

ADD REPLY • link 6.8 years ago by Nicolas Rosewick 11k

0

Entering edit mode

Yes, one reference genome file (many complete sequences) and one aligned sequences file both in fasta format as you said.

This whole thing would be much easier if there were some format of BLAST output that gave complete sequences as well as the locations of hit starts. The best I get is the location of "features".

So how would I go about aligning thousands of LTR sequences to their many genomic locations? And what kind of results would that give me?

Since I'm dealing with two files full of different IDs, I still need a step to match aligned sequence IDs to complete sequence IDs before alignment can take place I think

ADD REPLY • link 6.8 years ago by rmartson ▴ 30