how to remove multiple redundent sequence id of protein
2
0
Entering edit mode
7.0 years ago

Hi everyone...

I have a list of id in which many are redundant (not similar) like i have following id

Traes_1DL_9344DD117.1
Traes_1DL_9344DD117.2
Traes_1DL_9344DD117.3
Traes_1DL_9344DD117.4
Traes_1DL_BDDF1198A.1
Traes_1DL_BDDF1198A.2
Traes_1DL_BDDF1198A.3
Traes_1DL_BDDF1198A.4
Traes_1DL_BDDF1198A.5
Traes_1DL_C91636A55.1
Traes_1DL_C91636A55.2
Traes_1DL_C91636A55.3
Traes_1DL_C91636A55.4
Traes_1DL_C91636A55.5
Traes_1DL_C91636A55.7
Traes_1BL_43408C9B0.2
Traes_2AL_48B239DD6.1
Traes_2AL_48B239DD6.2
Traes_2AL_48B239DD6.3

Traes_2AL_48B239DD6.4
Traes_2AL_C00552444.1
Traes_2AL_C00552444.2
Traes_2AL_C00552444.3
Traes_2AS_146D16572.1
Traes_2AS_146D16572.2
Traes_2AS_146D16572.3
Traes_2AS_146D16572.4
Traes_2AS_146D16572.5

i am assuming that the id having similar words before point (.) but differ later like

(Traes_2AS_146D16572.1
Traes_2AS_146D16572.2
Traes_2AS_146D16572.3
Traes_2AS_146D16572.4
Traes_2AS_146D16572.5) having only one significant id i.e Traes_2AS_146D16572.1, and i want to retain that only from rest.

so how i can do it throught programme.

in sort, i want to extract only id's which have .1 in last from all among.  and if .1 is not present but .2 or .3 or so on is present and they are uniq means initial id is different then i also want to print them. like Traes_1BL_43408C9B0.2 is only present and i want to print them...

thank you..

sequence perl programme protein • 1.6k views
1
Entering edit mode

grep '\.1$' input.txt ADD REPLY 1 Entering edit mode what if there is only version .2 of the entry in the input? ADD REPLY 1 Entering edit mode he said "i want to extract only id's which have .1 in last from all among. " ADD REPLY 0 Entering edit mode Thanks nterhoeven, I have put this as a separate question, but the moderator closed it telling me to search not put as a new question. The grep gives a pattern,so for 10,000 or so I cannot do grep always for individual, that's y I was searching for awk command as I mentioned earlier. Any other suggestions are welcome... thanks a lot RV ADD REPLY 0 Entering edit mode you can use the --file option with grep to search for multiple pattern, but this can be quite slow, if your fasta file is large. A faster solution would be a perl script using a hash and the "exists" function. BTW: you should use answers only if you are answering the question on the top. Other stuff (discussions, etc) should be done in comments ADD REPLY 0 Entering edit mode Ok Thanks nterhoeven | ADD REPLY 0 Entering edit mode 7.0 years ago nterhoeven ▴ 120 ## Edits: An edit after discussing in the comments of this answer to give a clear solution: ### Scenario 1: You only need the part before the "." cat input.txt | cut -d"." -f1 | sort | uniq  will go through the IDs, remove the part after the "." and make the list unique cat input.txt | cut -d"." -f1 | sort | uniq -c  will additionally show how often the ID was seen. ### Scenario 2: You want to keep the part after the "." cat input.txt | sort | perl -ne 'print$_ unless $_=~/^$patt\.\d+$/; ($patt)=$_=~/(^\S+)\./'  This will read the list of IDs, sort them and print an ID, if the part before the "." has not been seen before. You should try all three commands I posted here and maybe make some changes to your test data (removing the ID with ".1", etc) to see how they work and then decide, which command suits you the best. ## Original answer: This should work for you perl -ne 'print$_ unless $_=~/^$patt\.\d+$/; ($patt)=$_=~/(^\S+)\./'  echo "Traes_1DL_9344DD117.1 Traes_1DL_9344DD117.2 Traes_1DL_9344DD117.3 Traes_1DL_9344DD117.4 Traes_1DL_BDDF1198A.1 Traes_1DL_BDDF1198A.2 Traes_1DL_BDDF1198A.3 Traes_1DL_BDDF1198A.4 Traes_1DL_BDDF1198A.5 Traes_1DL_C91636A55.1 Traes_1DL_C91636A55.2 Traes_1DL_C91636A55.3 Traes_1DL_C91636A55.4 Traes_1DL_C91636A55.5 Traes_1DL_C91636A55.7 Traes_2AL_48B239DD6.1 Traes_2AL_48B239DD6.2 Traes_2AL_48B239DD6.3 Traes_2AL_48B239DD6.4 Traes_2AL_C00552444.1 Traes_2AL_C00552444.2 Traes_2AL_C00552444.3 Traes_2AS_146D16572.1 Traes_2AS_146D16572.2 Traes_2AS_146D16572.3 Traes_2AS_146D16572.4 Traes_2AS_146D16572.5" > input.txt cat input.txt | perl -ne 'print$_ unless $_=~/^$patt\.\d+$/; ($patt)=$_=~/(^\S+)\./' Traes_1DL_9344DD117.1 Traes_1DL_BDDF1198A.1 Traes_1DL_C91636A55.1 Traes_2AL_48B239DD6.1 Traes_2AL_C00552444.1 Traes_2AS_146D16572.1  ADD COMMENT 0 Entering edit mode Let me know, if I should explain the perl command ADD REPLY 0 Entering edit mode Thank u sir, it works. but what i have to do if .1 is not available so in this case i have to take .2 as well, so how i can extract .1 and if .1 is not their then .2, or others unique in single command. ADD REPLY 0 Entering edit mode do u need .1 or .2 as well or just the base name ? like Traes_2AS_146D16572 ADD REPLY 0 Entering edit mode Thank you sir, I think the way I want to get id is not possible (but here is my elaborate question- sir at the time of annotation they are named id Traes_2AS_146D16572.1 Traes_2AS_146D16572.2 Traes_2AS_146D16572.3 Traes_2AS_146D16572.4  like this and i assumed that the id .1 is important, but if .1 is not present then I can take .2 like Traes_1BL_43408C9B0.2 is present but Traes_1BL_43408C9B0.1 is not present so I can take Traes_1BL_43408C9B0.2 as well.) but it seems it is not possible if it is possible then please help me) thank you for your explanation.. sir can I get unique initial id means before "." like Traes_1DL_9344DD117 Traes_1DL_BDDF1198A Traes_1DL_C91636A55 Traes_2AL_48B239DD6 Traes_2AL_C00552444 Traes_2AS_146D16572 Traes_1BL_43408C9B0  so I can get idea and count this also thank u once again ADD REPLY 1 Entering edit mode if you only need the part before the ".", you can use cut -d"." -f1 | sort | uniq  adding the -c option to unique will give you the number of occurrences. ADD REPLY 0 Entering edit mode ok sir. thank you very much...... ADD REPLY 0 Entering edit mode The command I posted takes care of this. Here is an explanation how it works: perl -ne #start reading the input file line by line 'print$_ unless $_=~/^$patt\.\d+$/; #print the line, if the part before the . does not match the the content of$patt
($patt)=$_=~/(^\S+)\./'    #set $patt to the current string before the .  So, it always prints the first occurence of an ID and skips all following. Note, that this only works if the input is sorted so you might want to run it as cat input.txt | sort | perl -ne ...  ADD REPLY 0 Entering edit mode I edited my answer to include some of the comments made here ADD REPLY 0 Entering edit mode 7.0 years ago Hi, I am stuck in same kind of problem, as a newbie in bioinformatics, I am trying to extract the fasta sequences from a file_2 with similar ids from another list file_1 File_1.fasta >comp148_c0_seq1 >comp169_c0_seq1 >comp258_c0_seq1 >comp285_c0_seq1 >comp350_c0_seq1 >comp424_c0_seq1 >comp783_c0_seq1 >comp1089_c0_seq1  File_2.fasta >comp6_c1_seq1 -1 22 237 MAILRFMDSWVVGVNVCGKRPRRFVDPINMIRETIIRVHVRPFGVWISIICLIISLTSQCWKEWRRLLIRRF >comp11_c0_seq1 -2 35 358 MKEDDKVIEDDEKAEGSKGDIQKEEPGADDETEESNKLIGDNQGKDEANADEDDPQNEETIDKSEENKQREEQQQITLLHFIGRSFASLLKNLLKKTCPSAAEGNNYY >comp42_c0_seq1 +3 114 305 MPSTLKFLLIYESHWYDKNSEKLVNEFLSLLAHCTQLRYMPILLEDYDLLKLIEEKNTRQFDKI >comp43_c0_seq1 -2 38 298 MSSSNWAFYIGVSNGHVHDNLLVCKAPNCYCFPTRMDHCYIGGTQHHFEWPTDAISRPQWNGIGDTLGCGILLNPKNELAIFFTANG >comp48_c0_seq1 +3 18 242 MLYKSLINSKSLRGKTPAEVVNMFANDGQRIFDAVTFAPLVLIGPLVLVGGLIYLLRVIGPVSLLAVSVFLIFDF >comp53_c0_seq1 -1 55 312 MIGNRLRVKRDKVTLKMEISHCLHSQIIGRGGRNTQKIMRDTGCHIHFPDSNKCLTNPVTMPQQAKNDQVSISGCAKDVEKAREML >comp56_c0_seq1 -2 110 379 MNNARLNAEINELHAAIHANVHYGRPFKPSHISMNKSQATDRSDNNVCGQLATIDNKNENDHDNDNDDNEANDETRERRRFTVADYMPGG >comp56_c1_seq1 -1 52 408 MWIWSGPIFSTTILFGHISPFLKPGWRRRAYFCPNFPRRLRVYWATTALLSLLIITFLLGRHFFLGVPSQLTTSPEAIFPSTLGLDCGNVTLSFTAAANDEKVPSNQTLAADKVVASFA >comp72_c0_seq1 +2 14 271 MYALTWNGLMELKLGADTCPDVNVNWEVFGERGLKSISLFAVADKVFIFSTPNELLVYDRGSATISQFPIASPPLKTLLAVSQSSQ >comp74_c0_seq1 +3 96 305 MVIPFIFLKAQTIQLEAKDESNFCQNRVDLLAHEVEVRVRIGKRQKRALASSAGCCSCGRGPMGERGAPG >comp79_c0_seq1 -1 31 213 MASAMFCCQVCMLLSSAYHIFGCHSPNRRKRWLRADLFGVSAGLIGLYLSGLYTSFYHFPV  I used awk: diff <(cat file_1 | grep ">" | sort) <(cat file_2 | grep ">" | sort) | grep "^<" | awk -F\> '{print$2}'


I m just getting ids only, I need sequences to corresponding ID also.

Can anyone suggest awk or perl code for the same?

Thanks
RV

0
Entering edit mode

Hi Ruchi, welcome to Biostars.

It is better to open a new question for this, but here is a quick answer:

Assuming, your fasta file has no empty lines and only one line per sequence, you can use the -A option of grep.

This allows you to display the lines following a match. For example:

grep -A 1 'comp72_c0_seq1' File_2.fasta

will return:

>comp72_c0_seq1 +2      14      271
MYALTWNGLMELKLGADTCPDVNVNWEVFGERGLKSISLFAVADKVFIFSTPNELLVYDRGSATISQFPIASPPLKTLLAVSQSSQ

You can grep for a list of Patterns with the --file option (check 'man grep' for details)