Question: Please clear me about these blastx output!!
gravatar for seta
3.8 years ago by
seta1.1k wrote:

Dear all,

I'm getting confused about a so basic matter, please clear me what happened. I did de novo transcriptome assembly for a non-model organism, then run blastx. I computed part of blastx output using the cods: 

cut -f1 blast_output.txt | sort -u | wc -l

(that show how many of query sequences got a hit) and 

cut -f2 blast_output.txt | sort -u | wc -l


, which show how many subjects did my query sequences hit), these number were 36725 and 16542, respectively, for one of my assembly, with 57210 sequences is it usual?. Please be patient with me and tell me how to present the results, in fact can I say 36725 from 57210 has been annotated? Also, please explain what is the source of this difference between two numbers (36725 and 16542), one reason is, more than one contigs got the same hit, am I right? I'm so concerned about the issue, please put here what you know regardless the issue may be simple and stupid for you.

Many thanks

ADD COMMENTlink modified 3.8 years ago by 5heikki8.4k • written 3.8 years ago by seta1.1k

Hi 5heikki, although I got your reply in my email, it did not appear here! yes, it's for transcriptome assembly, is it usual?. About unique ORFs within contigs that you mentioned, it's assume that every unique contigs bear unique ORF within itself otherwise there is chimeric contigs, am I right or wrong? please share me your idea

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by seta1.1k

So 36,725 out of 57,210 putative assembled mRNAs hit in total 16,542 unique sequences of some database? What was the database? What were the thresholds? Since your organism is diploid, we can expect that all its expressed proteins are transcribed from at least two loci as nearly or exactly identical mRNAs, yes? How did you assemble the transcriptome? Was there any actual assembly or did you just e.g. merge pairs? If you cluster your transcriptome at e.g. 99% identity, how many clusters are there?

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by 5heikki8.4k

Thanks for following the post. This assembly was done by CLC genomic software with (k=64) after read trimming, and exposed to blastx against uniprot database (viridiplantae). we can assume transcription from at least two nearly or exactly identical mRNA, also it may resulted from alternative splicing form that produce members of one protein family, however I'm not sure about them, what's your idea?. In addition, I did another assembly with Trinity and mixed it with the CLC assembly, then subjected to cd-hit-est to remove redundancy (threshold 1), it generated 182968 clusters from 204397 input sequences, the blastx was done on this assembly against just arabidopsis proteome as database (for fast evaluation) and Although 80% of contigs got hit, only 28% of hits were unique. These results make me crazy as I don't know they are usual or not, what strategy is right? what's wrong and how to solve or even improve it? Please share me your opinion about the issue.

Many thanks to read me and help me out on it.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by seta1.1k
gravatar for Damian Kao
3.8 years ago by
Damian Kao15k
Damian Kao15k wrote:

Yeah it means that multiple queries are hitting the same subjects. This could mean that you are picking up members of the same gene family, multiple isoforms of the same gene, paralogous genes...etc.

ADD COMMENTlink written 3.8 years ago by Damian Kao15k

Thanks. Please let me know can I say 36725 from 57210 has been annotated? (details explained in the post). Since my organism is a diploid and dioecious plant with much heterozygosity, the result can be reasonable or something is wrong about assembly? Could you please help me out to correctly deal with this issue?

ADD REPLYlink written 3.8 years ago by seta1.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 770 users visited in the last hour