Is there any way to extract longest ORF from blastx output?
Hi all,

I read in a paper, the longest ORF in the reading frame indicated by the blastx analysis was determined, then resulting CDS extracted and also UTR regions were removed. Could anybody please let me know how to determine longest ORF using blastx results and find the CDS and UTR on them?

Thanks

What is the reference of that paper?

You can find it here

http://proteomics.ysu.edu/tools/OrfPredictor.html

will do the work for you

I don't think it's possible to detect the longest possible ORF from blastx output, only the longest aligned region (although probably in most cases the latter is part of the former). Below 1) sort by query id; 2) sort by alignment length (tabular output assumed). Note that only the longest hit per contig is considered so this strategy is not that sensible for all data (e.g. contigs that are expected to include introns and or intergenic regions between CDS). If you're fine with this, you can output the translated region into a column (check blastx -help), and then parse it from there..

LC_ALL=C; export LANG=C; sort -k1,1 -k4,4gr tabularBlastxOutput | sort -u -k1,1 --merge > longestAlignedRegions