de novo transcriptome assembly work flow, paired end reads
2
0
Entering edit mode
6.8 years ago

Hello! I am new to RNA seq. I am trying to assemble a transcriptome, de novo. I have 100 bp paired end reads. I am having a hard time finding a good guide to paired end de novo assembly; I have been taking bits and pieces from different guides, however, I am starting to become confused. I would like to explain the steps I have done so far, and any advice on things I have done wrong, or need to do, would be greatly appreciated!

1. I used fastx toolkits quality trimmer to trim any reads with a phred score of <30.

2. I know I am supposed to clip adapter sequences, however, I do not think I have any on my reads? is this possible? is there a way to check?

General question: I have paired end reads... am I supposed to quality trim them separately? By separately, I mean quality trim the left reads and then quality trim the right? No matter how much reading I do on PE reads, I am still confused on this.

3. I used a left read and its corresponding right read to assemble a transcript using trinity. Does anyone know how to tell if I needed to set a strand specific parameter?... such as RF, FR.... and how I figure this out if need be?

I know this is many MANY questions. Any help would be so so so much appreciated!

Thank you,
Nikelle

de-novo trancscriptome rna RNA-Seq trinity • 3.2k views
0
Entering edit mode

Appears that you have chosen to ignore the advice from a previous question that you had posted here: Generating Read Length?

You should be scanning/trimming your reads in pairs so as not to lose their order in respective files.

0
Entering edit mode

I am not sure what you mean by "in pairs". Would I need to combine each left and right read file together somehow and then trim?

0
Entering edit mode

Paired-end data aware trimming programs (trimmomatic, cutadapt and bbduk) will accept a file pair for a sample (R1/R2) and then process the files together.

If a read gets trimmed and becomes a candidate for elimination based on criteria that you set (e.g. less than a certain length) then the program will remove the corresponding read from the OTHER sequence file (even though that may still meet passing criteria). This helps keep the reads in the two files in sync. Most aligners expect to find the pairs of reads for a fragment at corresponding positions in the two files. If they are not in sync they may still be used in alignment and my be reported as aligning discordantly or not aligning at all.

0
Entering edit mode

Thank you very much, this was informative.

2
Entering edit mode
6.8 years ago
pld 5.0k
1. You can trim the files separately and then sort them. You'll want to do this anyways to make sure you remove singletons. Singletons are when one read in a pair is discarded during quality filtering/trimming. For example, the left read was filtered out due to quality, so the right read is no longer part of a pair. The right read would be the singleton.

You can use an interleaved file (where each end of a pair is stored on alternating lines), but it seems that most programs expect the left and right reads to be in separate files. I would stick with separate files.

2. You can use a program such as cutadapt to remove any adapters.

3. Ask the person or people who generated the cDNA library if they used a strand specific kit. If a strand specific kit was not used, then your data isn't strand specific. In the context of trinity, this means you don't specify anything with the --SS_lib_type flag.

I would QC the reads and trim adapters (as individual files), then sort the reads (and remove singletons), then start Trinity. Trinity assumes that pairs are on the same line in each file. So, for example, the nth left read and the nth right read should each be on the nth line of their corresponding file.

0
Entering edit mode

thanks so much!

In "1." when you say "then sort them", what do you mean by sort?

In "3." what do you mean by "in the context of trinity, this means you don't specific with the --SS_lib_type flag"?

Nikelle

0
Entering edit mode

If you QC them separately, pairs may be out of order or broken (the left read of a pair was thrown out, but the right passed QC). To fix this you need to remove singletons and then ensure the reads are in the same order. The header of the nth fastq entry in the left file should be the same as the header of the nth fastq entry in the right file. Every header in the left file should show up in the right file.

It means you leave that flag alone, the default for Trinity is to treat paired end data as unstranded.

0
Entering edit mode

Hi Joe,

Thank you. What is the harm in pairs being out of order?

Nikelle

0
Entering edit mode

As I said above, Trinity expects each end of a pair to be on the same line in their corresponding files. If the left read of a pair is on line 34 of left.fastq, then the right read must be on line 34 of right.fastq.

When parsing the input, Trinity will grab the next line of the left.fastq and right.fastq, assuming that they're part of the same pair. If they aren't, Trinity will throw an error. Think about it, paired end sequencing doesn't really work if you were to randomly choose (i.e. have unsorted) left and right reads.

I guess you could think of it like this: Imagine you have two full decks of cards, a left deck and a right deck. A pair is when you have the same card from each deck, e.g. the "left" ace of spades and the "right" ace of spades. If both decks aren't in the same order, the card drawn from the left deck won't match the card drawn from the right deck. Which means you don't have your pair of reads.

0
Entering edit mode

Does that mean the number of reads in both RF and FR should be same after quality filtering and trimming?

2
Entering edit mode

When you use a paird-end data aware trimming program the answer is yes. Not only the numbers would be the same, the order of reads would be retained in the result files as well (which is important for alignment programs).

1
Entering edit mode

Sorry, I should have clarified what I meant by sorting. QC can (will) create instances where one read in a pair was thrown out while the other was kept. The read that passed QC is called a singleton.

Like genomax2 said, the numbers should be the same. Prior to sorting, singletons should be removed from any fastq files. After removing the singetons, the fastq files should be sorted. The script I have for this does both steps at once so in my head I have removing singletons and sorting reads as a single step.

0
Entering edit mode
6.8 years ago
Mehmet ▴ 780

Hi:

For de novo transcriptome assembly:

1. do quality control
3. star de novo assembly by using an assembly tool.

which sequence platform have you used?

0
Entering edit mode

I used Illumina for sequencing