Question

RNA de novo assembly - blasts - KEGG - GO

0

Entering edit mode

4.1 years ago

dimitrischat ▴ 210

Hello,

I am a phd candidate to bioninformatics and with (almost) 0 guidance. Seeking help here.. I was asked to do a de novo RNA transcriptome assembly from a total RNA sequencing. After fastqc i trimmed my original fastq and then ran trinity. So i got my trinity_trimmed.fasta. So, some of the things i was asked to do are:

1) fill out a table like this one :

        |  total number  |  total length(nt)  |  mean length(nt)  |  N50  |  total consensus sequences  |  Distinct Clusters |  Distinct Singletons

Contig

Unigene

I used TrinityStats.pl and got this :

Counts of transcripts, etc.

Total trinity 'genes': 87177 Total trinity transcripts: 169974 Percent GC: 40.18

Stats based on ALL transcript contigs:

Contig N10: 3290
Contig N20: 2503
Contig N30: 2049
Contig N40: 1713
Contig N50: 1413

Median contig length: 529
Average contig: 869.67
Total assembled bases: 147821426

Stats based on ONLY LONGEST ISOFORM per 'GENE':

Contig N10: 3087
Contig N20: 2301
Contig N30: 1816
Contig N40: 1414
Contig N50: 1029

Median contig length: 348
Average contig: 632.11
Total assembled bases: 55105774

My question has 2 parts : a) can i fill out this table with this information? b) Some people use cap3 assembly tool. I have already done that too in case i need it. Is that the way to go ? I need to check the quality of trinity_trimmed.fasta ?

for cap3 i also used TrinityStats.pl and got this :

for contigs:

Total trinity 'genes': 23017 Total trinity transcripts: 23017 Percent GC: 40.42

Stats based on ALL transcript contigs:

Contig N10: 3885
Contig N20: 3082
Contig N30: 2598
Contig N40: 2254
Contig N50: 1971

Median contig length: 1318
Average contig: 1522.23
Total assembled bases: 35037102

note: not reporting gene-based longest isoform info since couldn't parse Trinity accession info.

for singletons:

Counts of transcripts, etc.

Total trinity 'genes': 67695 Total trinity transcripts: 81478 Percent GC: 38.77

Stats based on ALL transcript contigs:

Contig N10: 1906
Contig N20: 1347
Contig N30: 1007
Contig N40: 751
Contig N50: 572

Median contig length: 333
Average contig: 490.70
Total assembled bases: 39981353

Stats based on ONLY LONGEST ISOFORM per 'GENE':

Contig N10: 1853
Contig N20: 1284
Contig N30: 917
Contig N40: 671
Contig N50: 508

Median contig length: 317
Average contig: 461.01
Total assembled bases: 31207973

2) blastp/blastx in excel files.

i should use -outfmt 16 ?

( also hmmscan/pfam is needed for KEGG / GO terms ? )

3) Do a KEGG and GO analysis. I should annotate the assembly ( but which one the trinity_trimmed.fasta or the cap3 one ? ) using Trinotate and then go with GOseq for GO? Or i could use blast2go, using the blastx/blatp files with -outfmt 16? (7 days trial version ) . Kegg also in blast2go or i could something llike this : https://www.kegg.jp/blastkoala/ ?

i know i was long, sorry about that.

RNA-Seq Assembly • 1.8k views

ADD COMMENT • link updated 4.1 years ago by gayachit ▴ 200 • written 4.1 years ago by dimitrischat ▴ 210

1

Entering edit mode

Please use the formatting bar (especially the code option) to present your post better.
code_formatting

As you can see above biostars parser does not understand standard HTML code.

ADD REPLY • link 4.1 years ago by GenoMax 141k

0

Entering edit mode

Yes that's the one. What kind of sequences do you want to extract? Any specific sequences or based on ID?

ADD REPLY • link 4.1 years ago by gayachit ▴ 200

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing threads to keep them logically organized. SUBMIT ANSWER is only for new answers to original question.

ADD REPLY • link 4.1 years ago by GenoMax 141k

0

Entering edit mode

Which organism are you analyzing?

ADD REPLY • link 4.0 years ago by Kristoffer Vitting-Seerup ★ 4.0k

score 1 · Answer 1 · 2020-03-18

1

Entering edit mode

4.1 years ago

gayachit ▴ 200

Hi

There are parameters to check whether a transcriptome assembly is good or not. For example you can see the N50 value which better for your trinity assembly than your cap3 one. blast2go is a good option if you have a license and i think it needs xml as input. There is also an online tool called Functionannotator for transcriptome annotation. Check that

ADD COMMENT • link 4.1 years ago by gayachit ▴ 200

0

Entering edit mode

Appreciate your help!! So better keep the trinity assembly? How can i fill that table though?? I ll check that online tool asap. Thanks

ADD REPLY • link 4.1 years ago by dimitrischat ▴ 210

score 0 · Answer 2 · 2020-03-18

0

Entering edit mode

4.1 years ago

gayachit ▴ 200

So for the table I think the data will be like

total number = total trnity genes
total length(nt) = total bases used for assembly
mean length(nt)  = average contig length
N50 = N50
total consensus sequences  = you can get number of consensus sequence if you give the --long-reads option in trinity
Distinct Clusters = clusters i believe you do a CD-HIT of the assembly and get
Distinct Singletons = Same as above

ADD COMMENT • link 4.1 years ago by gayachit ▴ 200

0

Entering edit mode

with this : https://github.com/weizhongli/cdhit

edit: do you know which one of all cd-hit programs i use? and an example of a command ?

Also do you know how can i extract the contigs from the trinity.fasta files? to an other fasta file maybe?

ADD REPLY • link 4.1 years ago by dimitrischat ▴ 210

0

Entering edit mode

i read here : https://github.com/trinityrnaseq/trinityrnaseq/wiki/There-are-too-many-transcripts!-What-do-I-do%3F

" CD-HIT can be used to cluster highly similar sequences and retain a single representative sequence per group. Stringent clustering can be done like so: 'cd-hit-est -o cdhit -c 0.98 -i Trinity.fasta -p 1 -d 0 -b 3 -T 10' "

i did that, and now the stats are like this :

Counts of transcripts, etc.

Total trinity 'genes': 87036 Total trinity transcripts: 126484 Percent GC: 39.72

Stats based on ALL transcript contigs:

Contig N10: 3012
Contig N20: 2260
Contig N30: 1807
Contig N40: 1444
Contig N50: 1128

Median contig length: 401
Average contig: 701.88
Total assembled bases: 88776032

Stats based on ONLY LONGEST ISOFORM per 'GENE':

Contig N10: 3087
Contig N20: 2302
Contig N30: 1817
Contig N40: 1415
Contig N50: 1030

Median contig length: 348
Average contig: 632.73
Total assembled bases: 55070125

ADD REPLY • link 4.1 years ago by dimitrischat ▴ 210

0

Entering edit mode

I dont think it should matter much that there are so many transcripts. Although what you can do is check your assembly using BUSCO. There is another post similar to this. The link is https://bioinformatics.stackexchange.com/questions/4589/reduce-number-of-transcripts-in-a-highly-variable-de-novo-transcriptome-assembly

ADD REPLY • link 4.1 years ago by gayachit ▴ 200

0

Entering edit mode

i am just trying to find Distinct Clusters and Distinct Singletons, how do i do that? Can you give me a command-example ? Using CD-hit as you mentioned?

ADD REPLY • link 4.1 years ago by dimitrischat ▴ 210

1

Entering edit mode

The cdhit-est command that you gave is fine. It should have generated a cluster file as given in the manual for cdhit. You can count the clusters from that. I think it also generates a stats file. Atleast the online version does.

ADD REPLY • link 4.1 years ago by gayachit ▴ 200

0

Entering edit mode

It produced 2 txt files. One almost 100mb which has the -o (output name i gave) and one 10mb with .clstr ending. I counted 126.483 clusters. So this is the number for Distinct cluster? Also i did a grep ">" for the big file and the number is 126.484. If 126.483 is the number of distinct clusters, which is for the distinct singletons ? Thanks again

ADD REPLY • link 4.1 years ago by dimitrischat ▴ 210

1

Entering edit mode

Yes. And sorry about the singletons, inchworm in trinity actually does not include kmer singletons in the initial contig step. You can get the number by running trimmomatic for raw reads. Singletons are nothing but the reads which don't have a corresponding pair. It generates left and right files for Paired and left and right for Unpaired which are your singletons

ADD REPLY • link 4.1 years ago by gayachit ▴ 200

0

Entering edit mode

Thanks!! Could you elaborate a bit more on which command to use with trimmomatic? And on which file, the trinity.fasta?

ADD REPLY • link 4.1 years ago by dimitrischat ▴ 210

0

Entering edit mode

No not trinity fasta. You do a trimmomatic before you assemble. This is a pre-processing step for raw reads where you can remove adaptors and low quality sequences. Here's the manual which has the command as well http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf

This will create four output files. Count the reads in unpaired files for singletons

ADD REPLY • link 4.1 years ago by gayachit ▴ 200

0

Entering edit mode

but this route wont give me the singletons of the assembly. Right ? I am looking to get the singletons of my assembly

ADD REPLY • link 4.1 years ago by dimitrischat ▴ 210

0

Entering edit mode

What does singleton of assembly mean? I don't exactly know how to do that because by definition singletons mean the reads having no pair. But here's an idea, you identify the singleton reads, map the reads on the assembly using bowtie2 or similar alignment tool and identify where the singleton reads have mapped. Just a thought.