Question

Transcriptome blasting doubts

0

Entering edit mode

7.0 years ago

pablo61991 ▴ 90

Hello community,

I'm biology trying to analyze data from a RNA-seq experiment. My experience in this field is zero and I'm learning as much as I can.

First let me contextualize my problem: I have a de novo assembly transcriptome obtained from a mollusk conformed by 125k transcripts/contigs.

My first attempt to annotate that was to blastx against H.sapiens proteome and I found a reasonable number of 35k unique hits. However, I think human proteome could be too much stringent to find matches so I construct a database composed by 9 proteomes (octopus, oyster, mytilus, sea urchin, drosophila, amphioxus, lamprey, zebrafish, human) to do that I have use blastdb_aliastool. Once I had my database I blasted it using blastx (I only change e-value threshold to 1E-3 and -outfmt 5) but I obtained a lot of tophits with the description "unchraracterized". In this case I had the 87% of my transcripts with one hit.

In order to fix that, I repeat the annotation but including the option -max_target_seqs 5 -max_hsps 1 and then I ran the next commands to filter out the "uncharacterized" hits, I reordered all the hits for the same query (max. 5) based on the score and e-value and I recovered only the top hit (this time witouth "uncharacterized" hits):

 grep -v -i "uncharacterized" FILE_A.xls > FILE_B.xls

 sort -k1,1 -k12,12nr -k13,13n FILE_B.xls | sort -u -k1,1 --merge > FILE_C.xls

This is the order of my fields in my output:

Query Name | Query Length | Subject Name | Subject Length | Alignment Length | Query Start | Query End | Subject Start | Subject End | Query Sequence | Subject Sequence | Hsp Score | Hsp Expect | Hsp Identities | Percent Match | Number_of_gaps

Finally I obtained more or less the same number of hits than the first time (blastx only vs human proteome), again more or less 35k and my annotation was reduced to 25%. I have checked and it looks fine but I come here to know your opinion (not only about the code, also about the procedure). In my case the annotation it's fundamental because my DGE analysis follows the pipeline: tximport (to summarize to gene level) > DESeq2; so it depends totally TranscriptID==Gene.

It's the approach/code right? It' s common to find more or less the same results vs only human than vs a more complete database (when I take out the uncharacterized proteins).

Thank you for your attention

RNA-Seq blast annotation • 2.8k views

ADD COMMENT • link updated 7.0 years ago by GenoMax 141k • written 7.0 years ago by pablo61991 ▴ 90

0

Entering edit mode

I would do ORF-prediction and then translate and blastp the candidate ORFs against your proteome database. I do not completely understand why you check against human. Is that because your organism and its clade are poorly studied/annotated? With insects and plants I use a multi-step approach (similar to the one described in https://plantmethods.biomedcentral.com/articles/10.1186/s13007-015-0066-6 ) against different databases.

ADD REPLY • link updated 7.0 years ago by h.mon 35k • written 7.0 years ago by cschu181 ★ 2.8k

0

Entering edit mode

Yes that is the main reason to use Human proteome. I'll take a view of that link.

I have a question related with that approach, I think ORF prediction works better when you have a very good de novo assembly. In my case (invertebrate) my assembly has a lot of relative short transcripts and for that reason I though in to blastx instead to do a blastp based on ORFs. Is that approach wrong?

Thank you cschu1981.

ADD REPLY • link 7.0 years ago by pablo61991 ▴ 90

0

Entering edit mode

No, blastx is not wrong. I use it complementarily to ORF-blastp. It just takes longer and it might align STOPs (*) in your query to an amino acid in the subject (recently observed this with blast-2.6.0, some insect transcriptome against ncbi nr), which could give you false results. It doesn't hurt to try both I'd say.

ADD REPLY • link 7.0 years ago by cschu181 ★ 2.8k

0

Entering edit mode

Thank you for your tips cschu181 I didn't know about that (I'm quite newbie in this "genomic" point of view of the data). I'll try to apply this in my analysis.

Can I ask you (because I see on you a high level of expertise in this field) how do you deal with "uncharacterized" hits?

Thank you again and if I sound rude please don't feel like I'm angry cause of your tips. My English is just no so good (sorry :()

ADD REPLY • link 7.0 years ago by pablo61991 ▴ 90

0

Entering edit mode

No worries! I wouldn't call myself an expert, but I am trying to get there ;)

Currently I deal with uncharacterized hits by running hmmscan against Pfam-A on potential ORFs (translated) of these sequences. This at least might give me some idea of whether transcripts potentially contain some domain information.

Transcripts without any hits, neither against Pfam nor against your Blast databases and without any ORF you could investigate for RNA structure/ncRNA sequence signatures, and then deal with them in a way depending on your research goal.

Transcripts that have ORFs without any hits, I would annotate as "uncharacterized" (or something similar). If they don't show up in your DE analysis, you may as well put them into the corner somewhere. If they do, then get some experimental collaboration going :)

One thing - it never hurts to blast such uncharacterised sequences against the broadest database available (e.g. NCBI nr ) and not limit yourself to your kingdom. Hits will likely be some bogus but they also might yield something interesting.

Hope that helps.

ADD REPLY • link 7.0 years ago by cschu181 ★ 2.8k

0

Entering edit mode

Thank you for your help cschu181 (apologize my cause of the delay in the anwser, I was very ill during the last 3 weeks).

I have another question, when I use blast I find outfmt 6 very comfortable to work with it cause it doesn't need to be parsed but unfortunately I can't obtain something like a description field. In contrast, if i use outfmt 5 I obtain a field which include a long description of the hit and not only the accession code. I really need this description to take out the uncharacterized hits before merge the result.

And here I come my question, in outfmt 6 there are 2 fields, bit score and e-value which are the common way to decide if a hit is good enough but when I use outfmt 5 I obtained hsp score and hsp expect. Are these two fields equivalent? Its possible to select the top hit by highest hsp score and lowest hsp expect as I shpuld do with bit score and e-value?

Sorry if my question isn't clear enough, I have tried to explain my self as good as I can.

Thank you for your attention.

ADD REPLY • link 6.9 years ago by pablo61991 ▴ 90

0

Entering edit mode

Sorry to hear you were sick, hope you're doing better now!

For your long description, you might try to include the salltitles column in your blast output

-outfmt "6 std salltitles"

Or look at whatever else is offered in the blast help (e.g. blastn -help) and include that in your output.

Concerning bit score and e-value: I think those are the same as hsp score/hsp expect, but maybe try to compare the values between outfmt 5 and 6?

Edit: grammar (omg!)

ADD REPLY • link 6.9 years ago by cschu181 ★ 2.8k

0

Entering edit mode

Oh thank I will try with the configuration of outfmt 6. I find posts of people claiming about the impossibility to add this field to outfmt 6 but I should tried before ask (maybe outdated information).

Yep, I think they have at least a similar meaning, but in a different scale (I think bit score is hsp score in a logarithmic scale).

I will test your tip. Thank you again.

PD: If my English sounds like strange, sorry I'm doing my best :(

ADD REPLY • link 6.9 years ago by pablo61991 ▴ 90