Question

Looking for sequence within an ORF

0

Entering edit mode

8.0 years ago

emilia.skirmuntt ▴ 10

Hi,

I have sequences in two text files. One of them contains sequences in a number of contigs and the other have positions for ORFs in same contigs. I need to compare those two checking if sequences are within ORFs. I made two dictionaries from both files with contig as a key. Then I wanted to compare positions of my sequences and my ORFs to check if sequence is within an orf. And I'm kind of stuck and not sure how to tackle it right now. Thanks in advance for any help!

I'm using Python 3

Part of my seq.txt (contig, query id, query start, query end, subject start, subject end - I concentrate on contig as a key, query id because I need those for the output, and subject start and end becuase this is the thing I am comparing with orfs positions ) :

NW_005871048.1  gi|63108247|dbj|BAD98245.1| 347 549 24077230    24077775
NW_005871048.1  gi|334705|gb|AAA47712.1|    327 504 24077224    24077706
NW_005871048.1  gi|288549222|gb|ADC52789.1| 320 479 17303856    17304323
NW_005871048.1  gi|288549222|gb|ADC52789.1| 336 500 24077257    24077706
NW_005871049.1  gi|8272468|gb|AAF74215.1| 267   412 17303904    17304350
NW_005871049.1  gi|21070180|gb|AAM34209.1|  478 622 5615586 5616002
NW_005871049.1  gi|3309126|gb|AAC31806.1|   531 661 5615616 5615993
NW_005871048.1  gi|55979255|gb|AAV68489.1|  470 600 5615616 5615993
NW_005871050.1  gi|334705|gb|AAA47712.1|    443 561 5615637 5615993

Part of my orf.txt (contig - as a key, start, end, seq len, frame, seq start, seq end -I need only last two values for my comparison):

NW_005871048.1  NNLQMNLSVLLPMQELHPWPHPRLMHFSPV  KRT 223 -1  0   666
NW_005871048.1  GNSFSSSCGVTTGNGCPWLPCHSCSGVRTG  VGR 103 1   135 447
NW_005871048.1  ESGGTLSTLPDLSDCLTFQPWDGCMQVQQS  ESC 124 1   836 1211
NW_005871048.1  LSDLPALGWLHAGSAESPASSPHHPHNFLL  LRK 113 1   877 1219
NW_005871048.1  YRQIYMPRPRRTANTMNNHGNEDDQKENEK  HSL 156 1   2204    2675
NW_005871048.1  IFESKEQTEYLVTSLQRGPFQVPLVLKLSK  PQL 136 1   3020    3431
NW_005871048.1  QSGSKTKLNHMLAAKIHLNYKDKVKVNGWE  IVH 116 1   3923    4274
NW_005871049.1  LANGRCSEGMVNYHVCLLLDRIDDPLYICD  LVL 111 -1  5510    5840
NW_005871049.1  VPSNGVKPAGPGQEQRVKAGEAVESKEGGW  ISF 171 1   6367    6883
NW_005871049.1  QWSEAGWPGTGAASEGRRGSGEQRGGMGGF  FWL 149 1   6375    6825
NW_005871049.1  NTVIGITKCDTDGSGHVKGEGQKGTHGLPF  AVC 169 1   6931    7441
NW_005871050.1  IFHPWNIPFPRESALIPQPGPGGAAQAVGT  KKF 103 -1  7103    7409
NW_005871050.1  ASAPWWPVRIIAIGHSAVQSMCILAFYYYY  EPL 110 -1  8475    8802
NW_005871050.1  ATAAGFPLLLPGAPAGFPLPLPGAVAGFPS  ISP 362 -1  9247    10330

my script so far:

vir_seq = open("seq.txt", "r")
orf = open("ORF.txt", "r")
results = open("output.txt", "w")

#files to dictionaries with contig as a key

dict_orf={}
with open('ORF.txt') as file1:
    for line in file1:
        lineSplit=line.split('\t')
        dict_orf.setdefault(lineSplit[0],[]).append((lineSplit[5],lineSplit[6].split('\n')[0]))

dict_virseq = {}
with open('seq.txt') as file2:
    for line in file2:
        lineSplit = line.split('\t')
        dict_virseq.setdefault(lineSplit[0], []).append((lineSplit[1], lineSplit[4], lineSplit[5].split('\n')[0]))

# check if seq is within an ORF : start of the seq smaller than end of orf and end of seq bigger than start of orf
def is_overlap():
        for k in dict_virseq.keys():
            for v in dict_virseq[k]:
                [v[1]>i for i in [orf[0] for orf in dict_orf[k]]]
                [v[2]<i for i in [orf[1] for orf in dict_orf[k]]]
            return True
        else:
            return False

python orfs • 2.0k views

ADD COMMENT • link updated 8.0 years ago by harold.smith.tarheel ★ 4.9k • written 8.0 years ago by emilia.skirmuntt ▴ 10

score 1 · Accepted Answer · 2016-04-15

1

Entering edit mode

8.0 years ago

harold.smith.tarheel ★ 4.9k

Do you have a compelling reason for using Python 3? If not, then interval comparison is what BEDtools is designed for. Convert your two tables to contig, start, end (minimal requirements for BED format), and name=query ID (for seq. txt), then intersect the two (see here for BEDtools download and command parameters).

ADD COMMENT • link 8.0 years ago by harold.smith.tarheel ★ 4.9k

0

Entering edit mode

Ok, I will try that. Seems easy enough for me;) I don't need to use Python 3. I was just using it before while doing some tutorials and continued with it because of that. I can use Python 2.7 for that too. Thanks for your help!

ADD REPLY • link 8.0 years ago by emilia.skirmuntt ▴ 10