Question

I have start and end position of sequences, want to remove redundancy

0

Entering edit mode

7.7 years ago

liningouc • 0

I have a list of start and end position information from protein database, and some sequences are overlapped with each other because of the existence of isoforms. I want to remove overlapped sequences and keep the longest one. How could I achieve this?

Data is like:

KN150702.1  512 66743 
KN150702.1  4526    75660 
KN150702.1  51685   52551 
KN150702.1  75503   111816 
KN150702.1  126256  146772 
KN150702.1  155049  175903 
KN150702.1  177161  211884 
KN150703.1  4605    14526 
KN150703.1  16536   18921 
KN150703.1  16536   18879 
KN150703.1  23158   47525 
KN150703.1  36969   40261 
KN150703.1  42415   46815

And the results should be:

KN150702.1  4526    75660
KN150702.1  126256  146772
KN150702.1  155049  175903
KN150702.1  177161  211884
KN150703.1  4605    14526
KN150703.1  16536   18921
KN150703.1  23158   47525

genome sequence • 1.2k views

ADD COMMENT • link updated 7.6 years ago by GouthamAtla 12k • written 7.7 years ago by liningouc • 0

0

Entering edit mode

7.6 years ago

second_exon ▴ 210

If you are still looking for a solution, here is my solution based on Python 3.x Pandas

Your data file with headers (dat.txt)

name    start   end
KN150702.1  512 66743 
KN150702.1  4526    75660 
KN150702.1  51685   52551 
KN150702.1  75503   111816 
KN150702.1  126256  146772 
KN150702.1  155049  175903 
KN150702.1  177161  211884 
KN150703.1  4605    14526 
KN150703.1  16536   18921 
KN150703.1  16536   18879 
KN150703.1  23158   47525 
KN150703.1  36969   40261 
KN150703.1  42415   46815

Python Script

import pandas as pd
df = pd.read_table('dat.txt')
df = df.sort_values('name', ascending=1)
df['diff'] = df['end']-df['start']
final = pd.DataFrame(columns=list(df.columns.values))
while True:
    mask = df['name'] == df.iloc[0, 0]
    df1, df = df[mask], df[~mask]
    df1 = df1.sort_values('start', ascending=1)
    i = 0
    while True:
        if i == len(df1)-1: break
        if int(df1.iloc[i+1, 1]) <= int(df1.iloc[i,2]):
            if int(df1.iloc[i+1, 3]) < int(df1.iloc[i,3]):
                df1.drop(df1.index[[i+1]], inplace=True)
            elif int(df1.iloc[i+1, 3]) > int(df1.iloc[i,3]):
                df1.drop(df1.index[[i]], inplace=True)
        else:
            i+=1
    final = final.append(df1)
    if len(df)==0: break
del final['diff']
print(final)
final.to_csv("non_overlapping.csv", index=False)

This saves the output to a csv file (non_overlapping.csv) in the same working directory.

Output:

name,start,end
KN150702.1,4526.0,75660.0
KN150702.1,126256.0,146772.0
KN150702.1,155049.0,175903.0
KN150702.1,177161.0,211884.0
KN150703.1,4605.0,14526.0
KN150703.1,16536.0,18921.0
KN150703.1,23158.0,47525.0

ADD COMMENT • link 7.6 years ago by second_exon ▴ 210

score 4 · Accepted Answer · 2016-08-31

4

Entering edit mode

7.7 years ago

GouthamAtla 12k

cat in.bed | sort -k1,1 -k2,2n | mergeBed -i - > out.bed

Makes sure that its a tab delimited file.

Update: I do not want to merge them. I only need to keep the longest one and remove the others

clusterBed -i test.bed | \
awk -v OFS="\t" '{ print $0,$3-$2}' | \
sort -k4,4 -k5,5nr | \
groupBy -full -i - -g 4 -c 5 -o max | \
cut -f1,2,3

output:

KN150702.1  4526    75660
KN150702.1  126256  146772
KN150702.1  155049  175903
KN150702.1  177161  211884
KN150703.1  4605    14526
KN150703.1  16536   18921
KN150703.1  23158   47525

ADD COMMENT • link 7.6 years ago by GouthamAtla 12k

1

Entering edit mode

Very minor nitpick: the pipe character can finish the line, that is, it does not need to be followed by an escaped newline.

ADD REPLY • link 7.6 years ago by Charles Plessy ★ 2.9k

0

Entering edit mode

Thanks for your answer, but actually I do not want to merge them. I only need to keep the longest one and remove the others.

ADD REPLY • link 7.7 years ago by liningouc • 0

0

Entering edit mode

I update the answer. Accept the answer if it works so that it won't bump again in future.

ADD REPLY • link 7.6 years ago by GouthamAtla 12k