Question: How to extract a region of protein or small domain '150-300' from a long sequences of protein from a protein length of 2000AA. through command.
0
gravatar for sanjeet00001992
3.7 years ago by
United States
sanjeet000019920 wrote:

hello all,

i have a large number of protein made up of 1000 to 2000 amino acid. now in these i want to extract only a small region for example '150-300' from each protein. how can i do this through terminal.i need to extract different region from different protein what i have to do.

ADD COMMENTlink modified 3.7 years ago by venu6.1k • written 3.7 years ago by sanjeet000019920

well do you know which sequences are involved in your processing and which regions you want to splice out ?? if yes then :

 

perl -lne 'chomp; if($l ==1){push(@a, split("",$_))}; if(/>(.*)/){if(@a){print splice @a, 0,100; @a = ()}; if( $1 =~ /1/){$l = 1;print $_}else{$l=0;}}; ' in_file

 

NOTE: code not tested!!

 

the idea is to locate a sequence :$1 =~ /my_seq_id/  and then extract a region starting at position 0 of length 100 ( print splice @a, 0,100;) Please note that the above line of code can be extended to extract the same position within any sequence sharing soem regex in its header.

ADD REPLYlink written 3.7 years ago by mxs530

Thank you sir..

but if i need to extract different region from different protein what i have to do

ADD REPLYlink written 3.7 years ago by sanjeet000019920

repeat the procedure :

extract 122-222 from  sequence xxcx :

 
perl -lne 'chomp; if($l ==1){push(@a, split("",$_))}; if(/>(.*)/){if(@a){print splice @a, 121,100; @a = ()}; if( $1 =~ /xxcx/){$l = 1;print $_}else{$l=0;}}; ' in_file

 

extract 22-522 from  sequence yyy :

 

perl -lne 'chomp; if($l ==1){push(@a, split("",$_))}; if(/>(.*)/){if(@a){print splice @a, 21,500; @a = ()}; if( $1 =~ /yyy/){$l = 1;print $_}else{$l=0;}}; ' in_file 

 

ADD REPLYlink written 3.7 years ago by mxs530

write a script like python or perl and run it from terminal , which will do your job!!!

ADD REPLYlink written 3.7 years ago by Pallab Bhowmick20
1
gravatar for Lesley Sitter
3.7 years ago by
Lesley Sitter460
Netherlands
Lesley Sitter460 wrote:

in python you could do something like this

 

import sys
file_open = open(sys.argv[1],'r').readlines()
file_out = open("outputfile.txt",'w')
for lines in file_open:
    if '>' in lines:
        file_out.write(lines)
    else:
        file_out.write(lines[149:300]+"\n")

 

You can then start this script by typing
Python my_script_name.py my_input_file.fasta

ADD COMMENTlink written 3.7 years ago by Lesley Sitter460
1
gravatar for Pierre Lindenbaum
3.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

prepare a list with the regions  you want to extract (fasta-header, start,end)

$ cat list.txt
>gi|72201761    10    20
>gi|14666145    11    40
>gi|83671954    13    50

 

linearize the fasta , join both sorted files, extract the substring with awk

join -t '      ' -1 1 -2 1 <(awk '/>/ {printf("\n%s\t",$0);next;} {printf("%s",$0);} END {printf("\n");}' input.fasta  | sort -k1,1) <(sort -k1,1 list.tsv) | awk -F '     ' '{printf("%s(%s-%s)\n%s\n",$1,$3,$4,substr($2,int($3),int($4)-int($3)));}'
>gi|14666145(11-40)
CATTCATCATGATAACAGCACCCTAAATA
>gi|72201761(10-20)
GTGCCATTTA
>gi|83671954(13-50)
CCGGAATTCCCGGGATATCGTCGACCCACGCGTCCGC

 

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by Pierre Lindenbaum119k
0
gravatar for venu
3.7 years ago by
venu6.1k
Germany
venu6.1k wrote:

You can use this perl script. It also works if it is MSA file.

$ perl select_sites.pl  file_name.faa  <start_position>  <end_position>

or

grep -v '^>' file_name.faa | tr -d "\n" | cut -c 150-300

but this works well when there is one sequence per one file. However you can loop it over all the files.

 

 

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by venu6.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1339 users visited in the last hour