Compare two files to get the position
3
0
Entering edit mode
6.6 years ago
skjobs1234 ▴ 40

I have two files, I want to know the starting and ending positions after the matches of the sequences. For example

File_1

VMLLVHYAIIGPGLQAKATREAQKRTAAGIMKNPTVDGITVIDLEPISYD PKFEKQLGQVMLLVLCAGQLLLMRTTWAFCEVLTLATGPILTLWEGNP FWNTTIAVSTANIFRGSYLAGAGLAFSLIKNAQTPRRGTGTTGETLGE KRQLNSLDRKEFEEYKRSGILEVDRTEAKSALKDGSKIKHAVSRGSS RWIVERGMVKPKGKVVDLGCGRGGWSYYMATLKNVTEVKGYTKGGP

File_2

FWNTTIAVSTANIFRGSYLAGAGLAFSLIKNAQTPRRGTGTTGETLGE KRQLNSLDRKEFEEYKRSGILEVDRTEAKSALKDGSKIKHAVSRGSS RWIVERGMVKPKGKVVDLGCGRGGWSYYMATLKNVTEVKGYTKGGP

So, here File_2 starting FWNTTIAVST..... sequences are matching in file_1 at at position 100 and ending YTKGGP at 250. So I want to print this starting and ending position 100-250

Script should be python or perl

Python Perl shoujun.gu • 1.3k views
ADD COMMENT
3
Entering edit mode
6.6 years ago
5heikki 11k

Assuming those are actually fasta formated files:

blastp -query File_2.faa -subject File_1.faa -outfmt '6 qlen length nident sstart send' \
| awk 'BEGIN{OFS=FS="\t"}{if($1==$2 && $1==$3){print $4,$5}}'
ADD COMMENT
1
Entering edit mode
6.6 years ago

Hello,

you can tak pythons str.index() to find the position of a given substring. When you now the starting position, you can calculate the end position by taken the length of your sequence into account.

fin swimmer

ADD COMMENT
0
Entering edit mode

Hi, Fin Swimmer, I don't have to idea about the python scripting. If you can write script then it's good for me.. Otherwise please guide this str.index() in perl. I know little bit perl script

ADD REPLY
1
Entering edit mode

For python2:

file1 = "VMLLVHYAIIGPGLQAKATREAQKRTAAGIMKNPTVDGITVIDLEPISYDPKFEKQLGQVMLLVLCAGQLLLMRTTWAFCEVLTLATGPILTLWEGNPFWNTTIAVSTANIFRGSYLAGAGLAFSLIKNAQTPRRGTGTTGETLGEKRQLNSLDRKEFEEYKRSGILEVDRTEAKSALKDGSKIKHAVSRGSSRWIVERGMVKPKGKVVDLGCGRGGWSYYMATLKNVTEVKGYTKGGP"
file2 = "FWNTTIAVSTANIFRGSYLAGAGLAFSLIKNAQTPRRGTGTTGETLGEKRQLNSLDRKEFEEYKRSGILEVDRTEAKSALKDGSKIKHAVSRGSSRWIVERGMVKPKGKVVDLGCGRGGWSYYMATLKNVTEVKGYTKGGP"

start_index=file1.index(file2)

print start_index+1,start_index+len(file2)
ADD REPLY
1
Entering edit mode
6.6 years ago

a one liner using bash :-)

sdiff \
     <(echo "VMLLVHYAIIGPGLQAKATREAQKRTAAGIMKNPTVDGITVIDLEPISYD PKFEKQLGQVMLLVLCAGQLLLMRTTWAFCEVLTLATGPILTLWEGNP FWNTTIAVSTANIFRGSYLAGAGLAFSLIKNAQTPRRGTGTTGETLGE KRQLNSLDRKEFEEYKRSGILEVDRTEAKSALKDGSKIKHAVSRGSS RWIVERGMVKPKGKVVDLGCGRGGWSYYMATLKNVTEVKGYTKGGP" | tr -d ' ' | grep -o .) \
     <(echo "FWNTTIAVSTANIFRGSYLAGAGLAFSLIKNAQTPRRGTGTTGETLGE KRQLNSLDRKEFEEYKRSGILEVDRTEAKSALKDGSKIKHAVSRGSS RWIVERGMVKPKGKVVDLGCGRGGWSYYMATLKNVTEVKGYTKGGP" |   tr -d ' ' | grep -o .)  | \
   awk '($1==$2)' | cut -f1 | tr -d '\n' | fold -w 20 


FWNTTIAVSTANIFRGSYLA
GAGLAFSLIKNAQTPRRGTG
TTGETLGEKRQLNSLDRKEF
EEYKRSGILEVDRTEAKSAL
KDGSKIKHAVSRGSSRWIVE
RGMVKPKGKVVDLGCGRGGW
SYYMATLKNVTEVKGYTKGG
ADD COMMENT
0
Entering edit mode

I want only starting and ending position of File 1 in File_2. What is the position of same sequences of file1 in file2 . Suppose that

File_1

VMLLVHYAIIGPGLQAKATREAQKRTAAGIMKNPTVDGITVIDLEPISYD PKFEKQLGQVMLLVLCAGQLLLMRTTWAFCEVLTLATGPILTLWEGNP FWNTTIAVSTANIFRGSYLAGAGLAFSLIKNAQTPRRGTGTTGETLGE KRQLNSLDRKEFEEYKRSGILEVDRTEAKSALKDGSKIKHAVSRGSS RWIVERGMVKPKGKVVDLGCGRGGWSYYMATLKNVTEVKGYTKGGP

File_2

FWNTTIAVSTANIFRGSYLAGAGLAFSLIKNAQTPRRGTGTTGETLGE KRQLNSLDRKEFEEYKRSGILEVDRTEAKSALKDGSKIKHAVSRGSS RWIVERGMVKPKGKVVDLGCGRGGWSYYMATLKNVTEVKGYTKGGP

So, here File_2 starting FWNTTIAVST..... sequences are matching in file_1 at at position 100 and ending YTKGGP at 250. So I want to print this starting and ending position 100-250, not the sequences. Just I want position in numeric letter. like Starting 100 and ending 250

ADD REPLY

Login before adding your answer.

Traffic: 2014 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6