Question: Remove lines of BLAST output based on two criteria
0
gravatar for c.tansley
22 months ago by
c.tansley0
c.tansley0 wrote:

Hello!

I am using BLAST+ on a Linux command terminal and have concatenated the input so that it contains multiple protein sequences from the same protein family of one organism. I am then using BLASTp to align this to the predicted protein sequences of another organism in order to find proteins of the same family. This gives an output of the top 50 hits for each protein sequence input with the subject sequence ID and subject sequence. I have tried to sort this (sort -u) and used sed to remove gaps (sed -e "/-//g"), I have also added > to each sequence ID (sed 's/^/>/') for fasta format as I intend to pipe this into a multiple sequence alignment.

However, as the consensus that managed to match is different for each ID I cant use uniq -u to remove repeats.

What I want to do is take the longest matching sequence for each ID and remove all of the smaller sequences but I'm very new to this kind of computing and dont know which tool to use. I need something that will analyse the sequence ID to group them and then select based on the associated sequence length.

Any advice will be appreciated.

output blast • 764 views
ADD COMMENTlink written 22 months ago by c.tansley0

please, give us a sample of your input/output

ADD REPLYlink written 22 months ago by Pierre Lindenbaum122k

take the longest matching sequence for each ID

The longest sequence by actual length, or alignment length to query?

ADD REPLYlink written 22 months ago by st.ph.n2.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1134 users visited in the last hour