Question: de novo transcriptome assembly work flow, paired end reads
0
gravatar for nikelle.petrillo
3.6 years ago by
Providence College, Providence, RI
nikelle.petrillo100 wrote:

Hello! I am new to RNA seq. I am trying to assemble a transcriptome, de novo. I have 100 bp paired end reads. I am having a hard time finding a good guide to paired end de novo assembly; I have been taking bits and pieces from different guides, however, I am starting to become confused. I would like to explain the steps I have done so far, and any advice on things I have done wrong, or need to do, would be greatly appreciated! 

  1. I used fastx toolkits quality trimmer to trim any reads with a phred score of <30. 
  2. I know I am supposed to clip adapter sequences, however, I do not think I have any on my reads? is this possible? is there a way to check?

General question: I have paired end reads... am I supposed to quality trim them separately? By separately, I mean quality trim the left reads and then quality trim the right? No matter how much reading i do on PE reads, I am still confused on this. 

     3.   I used a left read and its corresponding right read to assemble a transcript using trinity. Does anyone know how to tell if i needed to set a strand specific parameter?... such as RF, FR.... and how I figure this out if need be? 

 

I know this is many MANY questions. Any help would be so so so much appreciated! 

 

Thank you, 

Nikelle 

ADD COMMENTlink modified 3.6 years ago by pld4.8k • written 3.6 years ago by nikelle.petrillo100

Appears that you have chosen to ignore the advice from a previous question that you had posted here: Generating Read Length?

You should be scanning/trimming your reads in pairs so as not to lose their order in respective files.

ADD REPLYlink written 3.6 years ago by genomax71k

I am not sure what you mean by "in pairs." Would i need to combine each left and right read file together somehow and then trim? 

ADD REPLYlink written 3.6 years ago by nikelle.petrillo100

Paired-end data aware trimming programs  (trimmomatic, cutadapt and bbduk) will accept a file pair for a sample (R1/R2) and then process the files together.

If a read gets trimmed and becomes a candidate for elimination based on criteria that you set (e.g. less than a certain length) then the program will remove the corresponding read from the OTHER sequence file (even though that may still meet passing criteria). This helps keep the reads in the two files in sync. Most aligners expect to find the pairs of reads for a fragment at corresponding positions in the two files. If they are not in sync they may still be used in alignment and my be reported as aligning discordantly or not aligning at all.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by genomax71k

Thank you very much, this was informative. 

ADD REPLYlink written 3.6 years ago by nikelle.petrillo100
2
gravatar for pld
3.6 years ago by
pld4.8k
United States
pld4.8k wrote:

1. You can trim the files separately and then sort them. You'll want to do this anyways to make sure you remove singletons. Singletons are when one read in a pair is discarded during quality filtering/trimming. For example, the left read was filtered out due to quality, so the right read is no longer part of a pair. The right read would be the singleton.

You can use an interleaved file (where each end of a pair is stored on alternating lines), but it seems that most programs expect the left and right reads to be in separate files. I would stick with separate files.

2. You can use a program such as cutadapt to remove any adapters.

3. Ask the person or people who generated the cDNA library if they used a strand specific kit. If a strand specific kit was not used, then your data isn't strand specific. In the context of trinity, this means you don't specify anything with the --SS_lib_type flag.

I would QC the reads and trim adapters (as individual files), then sort the reads (and remove singletons), then start Trinity. Trinity assumes that pairs are on the same line in each file. So, for example, the nth left read and the nth right read should each be on the nth line of their corresponding file. 

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by pld4.8k

thanks so much! 

In "1."  when you say "then sort them," what do you mean by sort?
In "3." what do you mean by "in the context of trinity, this means you don't specific with the --SS_lib_type flag"?

Your response was very helpful!

Nikelle  


 

 

ADD REPLYlink written 3.6 years ago by nikelle.petrillo100

If you QC them separately, pairs may be out of order or broken (the left read of a pair was thrown out, but the right passed QC). To fix this you need to remove singletons and then ensure the reads are in the same order. The header of the nth fastq entry in the left file should be the same as the header of the nth fastq entry in the right file. Every header in the left file should show up in the right file.

It means you leave that flag alone, the default for Trinity is to treat paired end data as unstranded.
 

ADD REPLYlink written 3.6 years ago by pld4.8k

Hi Joe, 

Thank you. What is the harm in pairs being out of order? 

Nikelle 

 

ADD REPLYlink written 3.6 years ago by nikelle.petrillo100

As I said above, Trinity expects each end of a pair to be on the same line in their corresponding files. If the left read of a pair is on line 34 of left.fastq, then the right read must be on line 34 of right.fastq.

When parsing the input, Trinity will grab the next line of the left.fastq and right.fastq, assuming that they're part of the same pair. If they aren't, Trinity will throw an error. Think about it, paired end sequencing doesn't really work if you were to randomly choose (i.e. have unsorted) left and right reads.

I guess you could think of it like this: Imagine you have two full decks of cards, a left deck and a right deck. A pair is when you have the same card from each deck, e.g. the "left" ace of spades and the "right" ace of spades. If both decks aren't in the same order, the card drawn from the left deck won't match the card drawn from the right deck. Which means you don't have your pair of reads.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by pld4.8k

Does that mean the number of reads in both RF and FR should be same after quality filtering and trimming?

ADD REPLYlink written 3.5 years ago by Rahul30
2

When you use a paird-end data aware trimming program the answer is yes. Not only the numbers would be the same, the order of reads would be retained in the result files as well (which is important for alignment programs).

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by genomax71k
1

Sorry, I should have clarified what I meant by sorting. QC can (will) create instances where one read in a pair was thrown out while the other was kept. The read that passed QC is called a singleton.

Like genomax2 said, the numbers should be the same. Prior to sorting, singletons should be removed from any fastq files. After removing the singetons, the fastq files should be sorted. The script I have for this does both steps at once so in my head I have removing singletons and sorting reads as a single step.

ADD REPLYlink written 3.5 years ago by pld4.8k
0
gravatar for Mehmet
3.6 years ago by
Mehmet490
Japan
Mehmet490 wrote:

Hi:

For de novo transcriptome assembly:

1. do quality control

2. trim (remove) adaptors

3. star de novo assembly by using an assembly tool.

which sequence platform have you used?

ADD COMMENTlink written 3.6 years ago by Mehmet490

thanks for your help! 

I used Illumina for sequencing 

 

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by nikelle.petrillo100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1557 users visited in the last hour