Question

Single End Reads

1

Entering edit mode

11.3 years ago

rjames.biotech ▴ 10

Hi! I am a novice in this field. I just started. I will be working on Ion Torrent PGM within a month. I have a very basic problem. What are single end reads and why are they not preferred for genome assembly? Is there any suite for checking the quality of the sample in NGS? Any site or materials where i can get extensive details about the above questions?

Thanks. RJames

• 4.4k views

ADD COMMENT • link updated 11.3 years ago by KCC ★ 4.1k • written 11.3 years ago by rjames.biotech ▴ 10

score 7 · Answer 1 · 2013-07-16

During the sequencing process, what you are actually sequencing are DNA fragments. By 'fragment', I literally mean a string of nucleotides produced during your sample preparation. Usually, the fragment is much longer than your read length. However, due mostly to cost and technology, we are only able to sequence the first few nucleotides of a fragment. This is called a single-end read. For instance, you might have a sample where the average fragment is 350bp, but the data you get back is the first 100 bp of each fragment.

You might ask what happens to the other nucleotides we can't read. They remain unknown to us. For single-end reads, the size of the fragments we are sequencing also remain unknown to us. So, the only information we have for figuring out where a read came from is by the sequence that we get back. Paired-end read is another type of sequencing. In this case, whenever we sequence one end of the fragment, we sequence the other end as well. So, now for each fragment we have two reads. In the case of the sample where the average fragment length is 350bp, we would have two reads of 100 bp each which are labeled in our data as being a pair, meaning coming from the same fragment. Now, when we try to place these pair of reads in the genome, we know that we need to find an area where they both match uniquely and are relatively close to each other. This is extra information we didn't have before.

In the case of genome assembly, the extra information about which reads are close to which other reads helps tremendously. This is why it's preferred.

For sequencing in general, the biological methods for looking at the quality of the sample are extremely helpful. This includes use of bioanalyzer and qPCR. By the time the bioinformatics analyst is looking at the data, the damage is already done so to speak. After the sequencing, there are many tools I have found useful:

fastqc (gives you initial information)
bwa or bowtie 2 and samtools gives you a sense of how much of your sequence mapped uniquely and that can be a warning if it's unusually low
Also, looking at the quality of the matches, the CIGAR string in the SAM file has been useful to me
Visualizing your data in IGV gives you a lot of information on whether your data looks the way you expect

I have found that figuring out how good your sequencing results are is usually a lot of detective work and there is no fixed formula, because every sequencing sample tends to be unique. The many, many procedures used to prepare the sample often provides lots of opportunity to introduce variation. The distribution of fragments is unique. The effect of the PCR amplification is unique.

score 0 · Answer 2 · 2013-07-16

0

Entering edit mode

11.3 years ago

vijay ★ 1.6k

I don't have a straight answer for your first question. However, I feel the basic reason why we prefer a paired end or a bidirectional sequencing module is to obtain a good coverage of your targeted region. You have a lot of tools that can assist you in validating the quality of your dataset viz. fastqc, NGSQC, fastx toolkit,galaxy etc. I think you can verify the papers of these tools to obtain more information.Also it is ideal to keep an eye on the updations that are happening.