Question

paired end illumina reads

0

Entering edit mode

8.5 years ago

midox ▴ 290

Hello,
I want to know if paired end sequencing we have two files and R2.fq and R1.fq.
do R1.fq is the first strand of DNA and the other R2.fq is the second strand of DNA?
and how the overlap between the reads? do we must merge the two reads of R1.fq and R2.fq in one reads?

Thank you

Assembly paired end • 14k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.5 years ago by midox ▴ 290

0

Entering edit mode

Hello,

I have some question please. How can we know the "known distance" between paired-end reads? And if we have two files Forward and Reverse, do reverse is always the reverse complement strand? We cannot find reverse complement reads in the Forward file?

Thanks

ADD REPLY • link updated 21 months ago by Ram 43k • written 8.2 years ago by midox ▴ 290

1

Entering edit mode

Known distance is usually a best guess based on how long the target sequences are supposed to be: library insert size or length from timed PCR.

Yes, reverse is from the reverse complement. Here's a video that describes Illumina paired end sequencing:

ADD REPLY • link updated 21 months ago by Ram 43k • written 8.2 years ago by anp375 ▴ 180

0

Entering edit mode

how to know the distance? I have just paired end file??

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.2 years ago by midox ▴ 290

0

Entering edit mode

If you look here, inside this forum, you can get the answer

To know the distance, you need to map these reads to a reference and get the SAM/BAM mapping file

There are some tools that allow you to discover the distance between reads mates

ADD REPLY • link 8.2 years ago by Antonio R. Franco ★ 5.1k

0

Entering edit mode

have you any references please?

ADD REPLY • link 8.2 years ago by midox ▴ 290

0

Entering edit mode

http://lmgtfy.com/?q=figure+out+internal+distance+in+paired+end+reads

ADD REPLY • link 8.2 years ago by Antonio R. Franco ★ 5.1k

Ram · Answer 1 · 2015-10-21

1

Entering edit mode

8.5 years ago

andynkili ▴ 10

You should specify a little bit what king of paired reads you have (SOLID, Illumina, ...).

about merging : it depends on you what you're going to do with the reads (what kind of analysis)? In my point of view read mapping and assembly are more accurate with paired end reads instead of merged reads, but again it depends on what you want to do

ADD COMMENT • link 8.5 years ago by andynkili ▴ 10

0

Entering edit mode

I want to do an assembly but before I have to understand the reads of the two files.

for example: I have two files and R2.fq R1.fq

r1.1   ---->........................<----  r1.2                                                          
                           r2.1    ---->........................<----   r2.2                                    
  r3.1  ---->........................<----   r3.2                                                               
                                                            rX.1      ---->........................<---- rX.2

how to assembly in this case?

thankyou

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by midox ▴ 290

0

Entering edit mode

All your rx.1 reads are your forward reads stored in one file and rX.2 are the reverse ones. You can use an assembler (SPAdes, Ray,...) and specify on command line which file contains the reverse and the forward reads. But choosing the assembler is related to the type organism you are studyig, some assembly tools performs better on viruses others on bacterial organism. Before going into any assembly step, did you pre process your reads? any filtering? any trimming?

ADD REPLY • link 8.5 years ago by andynkili ▴ 10

0

Entering edit mode

Thanks for your help.

Yes I know I can pass them to an assembler.

But I want to understand the assembly when it was paired end reads (two files).

for example, if I want to overlapping reads do I take the foward and the reverse reads and I'm looking overlap?

or do I make the reverse-complement of R2.fq?

thanks

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by midox ▴ 290

0

Entering edit mode

Just give your reads to an assembler. Don't apply your preprocessing unless the assembler is requiring so. If you want to understand how paired-end assembly works, read the paper of the assembler in use. Different assemblers work in different ways. Choose your assembler first, and then ask questions.

ADD REPLY • link 8.5 years ago by lh3 33k

0

Entering edit mode

Thanks for your help.

I'm not talking assemblers.

I used assemblers but me, if I want to create an assembler I want to understand the basics. and especially, the types of paired end reads files.

because I have confusion about paired end reads.

In the case of a single file, I can do the overlap between reads of this file.

but in the case paired end reads (2 files) and I have to find the overlap between reads these two files??

Thank you

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by midox ▴ 290

Ram · Answer 2 · 2015-10-21

1

Entering edit mode

8.5 years ago

Kamil ★ 2.3k

You might consider learning more about RNA-seq before you begin any analysis.

Try starting with RNA-seqlopedia, a comprehensive reference that covers every step of RNA-seq:

Experimental Design
RNA Preparation
Library Preparation
Sequencing
Analysis
References

To answer your particular question about paired reads, try the section on paired-end sequencing.

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by Kamil ★ 2.3k

0

Entering edit mode

Thank you for your help.
I'm not talking RNA. I have DNA data and i want to do an assembly but before I need to understand the R1.fq R2.fq files and to know how to make the overlap.

thanks

ADD REPLY • link 8.5 years ago by midox ▴ 290

Ram · Answer 3 · 2015-10-21

1

Entering edit mode

8.5 years ago

Antonio R. Franco ★ 5.1k

With paired end, you sequence both ends of a same shotgun fragment. Is just as you have drawn. The middle part, represented in your schemes as dots, remains unknown

Assembling principles does not differ too much on the assembling of single end sequencing (where only a single end of the shotgun fragment is sequenced), but take into account the information of both ends at the same time

This is highly advantageous. With the double of sequenced ends by each of the shotgun fragments, you simply have doubled the amount of information for the assembling. The fact that both paired sequences from the same fragment is sequenced, and is separated by a known distance, include restrictions to the assemblers and allow a better assignment of the sequences in the final dratf of the genome

ADD COMMENT • link 8.5 years ago by Antonio R. Franco ★ 5.1k

0

Entering edit mode

Thank you for your explanations.

So I have both R1.fq and R2.fq files, I just find the overlap between reads in the two files without make the reverse-complement of R2.fq? I use both the file as they are?

thankyou

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by midox ▴ 290

1

Entering edit mode

Exactly. The assemblers will take care of it. But you need to inform the program that this is paired data. Most of times is by simply writing in the command lane one file followed by the next

ADD REPLY • link 8.5 years ago by Antonio R. Franco ★ 5.1k

0

Entering edit mode

Thanks for your help. I'm not talking assemblers.

I know that I can inform the assembler type my files.

But me, for example, if I want to create an assembler.

I wanted to know the paired end reads. I want to understand the paired reads fonctionnality.

tell me if I'm wrong, so for example if I will assemble paired reads I must take both R1.fq and R2.fq file and I find the overlap between the reads of the two files ??

I'm no expert but I want to understand the basics. I have read many articles but there is no article that addresses such questions.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by midox ▴ 290

2

Entering edit mode

When you sequence with the new NGS (Next Generation Sequencing) systems, you break down the DNA or the cDNA into pieces (the shotgun pieces) and run an step in which you select for a determined size (let's say, 600 bases plus a standard variation, that is plus/minus 50 bases)

Then you know by sequencing 100 bases of both ends, that there is a central part of the sequence that is approximately 400+/-50 bases long whose sequence remains unknown. And this is not a big deal, because that region will be covered by other shotgun sequences as long as the break down of many other DNA sequences have been done at random. Simply by chance, other sequenced ends will be covering that space

You only have to change your mind a little bit. The assembler will try to fit everything looking for overlapping taking into account these pieces. If only a end is sequenced, I know that this is easy for you to understand.

Now you need to consider that you have actually a block of paired sequences of the same DNA fragment separated by a certain distance that needs to play the same game

Every end, can be saved into a different file. But the internal name of each read will allow the assembler to recognize which is its corresponding mate in the other file, if present. That is why you can have two separate files

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by Antonio R. Franco ★ 5.1k

0

Entering edit mode

thank you for your explanations.

so if we have:

r1.1 ----> ........................ <---- R1.2
                               r2.1 ----> ........................ <---- R2.2
   r3.1 ----> ........................ <---- R3.2
                                                   rX.1 ----> ........................ <---- rX.2

here was an overlap between r1.1 and r3.1 therefore can assemble them.

and we also overlap between R1.2 and r2.1.

In this case, it was an overlap between the reads from F1.fq and reads from F2.fq. Do we assemble in the same way as for r1.1 and r3.1 ?? (without create the reverse complement of R1.2? we take the sequence as in the file?)

Thank you

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by midox ▴ 290

1

Entering edit mode

I don't see in your example that r1.2 overlaps with r2.1

I see that r2.1 overlaps with the reverse of r3.2, and this latter with r1.2. So, a contig can be formed with r2.1, and the reverse of r3.2 and the reverse of r1,2 in this order

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by Antonio R. Franco ★ 5.1k

0

Entering edit mode

Okay.

So to make an assembly, if we has paired end we must make the reverse of rX.2 reads, so the second file to find the overlap between all reads (two files)?

it is necessary to have the reads in the same direction

r1.1   ---->........................---->  r1.2                                                          
                           r2.1    ---->........................---->   r2.2                                    
  r3.1  ---->........................---->   r3.2                                                               
                                                    rX.1      ---->........................----> rX.2

am I right?

thanks

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by midox ▴ 290

0

Entering edit mode

any response for this please?

thanks

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by midox ▴ 290

0

Entering edit mode

any respense please!?

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by midox ▴ 290

Ram · Answer 4 · 2015-10-21

One more thing. Some assemblers require the you hold your paired reads in a single fastq file. Most often, they are separated files

If you look into this Wiki on Fastq you will note that in every read, it is contained the spatial assignation (=read location) of the sequence in the sequencer that allow to recognize which ends are paired, the presence or not of barcodes, and whether the sequence is the first read or its mate with an /1 and /2 respectively

Ram · Answer 5 · 2016-02-22

1

Entering edit mode

8.2 years ago

jomo018 ▴ 720

As for your original questions, from my own experience: 1) r1 holds reads from both strands and so does r2. The only thing you know is that line x in r1 corresponds with line x in r2, each on a different strand. 2)You can align each read against a genome reference. In that case, you need to c-reverse half the data. You do not have apriori information which read to reverse. You need to align both ways. 3) You can merge first than align. You would still need to check both alignment directions.

ADD COMMENT • link 8.2 years ago by jomo018 ▴ 720

0

Entering edit mode

in this case, how can we know the relations between the paired-end reads to find the relation between the contigs after assembly to create scaffolds??

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.2 years ago by midox ▴ 290

1

Entering edit mode

Assemblers use information about paired end sequences to form contigs

Gaps and repeated sequences are the responsible for the appearance of many contigs

But assemblers cannot join or organize contigs into scaffolds once the assembly is done. They simply cannot do it, and this is why you end with many different contigs

To organize contigs into scaffolds you need a different strategy like the using of mate-paired reads, long sequencing reads (PacBio, Long Illumina, Nanopore) or a comparison with a trusted genome

ADD REPLY • link 8.2 years ago by Antonio R. Franco ★ 5.1k

0

Entering edit mode

how I can have the mate-paired reads ??

we can not make the scaffolding with only the paired end reads?

THanks

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.2 years ago by midox ▴ 290

1

Entering edit mode

Look information in this forum and into Illumina web pages about mate paired which is a different kind of paired sequences in which both reads from a same fragment retain long distance genomic information (several kb usually)

Assemblers cannot go beyond forming contigs because the limitations of short shotgun sequences and the presence of repeated sequences (hard to manage) and gaps. If an assembler using paired-end sequencing is giving you different contigs is because it cannot go beyond that. Assembling with second generation sequencers that are using short shotgun technology are far from perfect and it is very limited. An assembler trying to assemble a simple 4,5Mb E.coli genome with a 100X coverage of Illumina reads can provide you between 150 to 250 different contigs

This is why you need to overcome this limitations with mate paired, long sequencing or, if possible, the comparison and ordering of contigs using a trusted reference geneme if it is available

ADD REPLY • link 8.2 years ago by Antonio R. Franco ★ 5.1k

0

Entering edit mode

thank you for your reply.

So with paired-end reads I have to download the pair mate??

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.2 years ago by midox ▴ 290

2

Entering edit mode

If you download paired-end reads you are downloading both mates from a single fragment. If you download a paired-end sequence, you must be downloading the sequences from both extremes of a same fragment. You can also be provided by a single end sequence, in which only of the two ends is being sequenced

But don't get confuse. Every read from a paired end are named mates because they are related one to the other by the fact that are both extremes from a same fragment separated for less than 300 a 500bp

A mate-paired is however different. This fragment involves a completely different protocol to obtain it and correspond to sequences from a unique fragment which is several kilobases long. So the mates into a mate-paired fragment is separated by several kb

Check into the Illumina page for the protocols to obtain paired-end and mater-paired sequences

If you see useful these sequences, acknowledge them by voting. If you see close the subject, do the same, so people can be alerted that this contain useful information. No votes, no interest in reading this..

ADD REPLY • link updated 21 months ago by Ram 43k • written 8.2 years ago by Antonio R. Franco ★ 5.1k

0

Entering edit mode

I downloaded the paired end reads with length 300bp but I have no mate pairs.

How I do in this case??

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.2 years ago by midox ▴ 290

1

Entering edit mode

You have not many choices. You can only assemble with these paired end data Use different assemblers such as Mira (OLP method) and Velvet, Spades (Der Bruijn) and try to compare, measure several assembly indexes such as N50, etc

ADD REPLY • link 8.2 years ago by Antonio R. Franco ★ 5.1k

0

Entering edit mode

but if I want to do a scaffolding?

how to do in this case, do I have to download the mate-pair file?

do I get a link to download the mate-pair for E.coli?

thanks

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.2 years ago by midox ▴ 290

1

Entering edit mode

Yo need to prepare your mate-pairs from the same genome you are sequencing as paired-end

If you want to improve the assembly, you have a hard task. The only think I believe you can do is use different assemblers with the hope that one is better than the other. Usa Mira, and velvet, spades, etc

ADD REPLY • link 8.2 years ago by Antonio R. Franco ★ 5.1k

0

Entering edit mode

No, I want to do a scaffolding. I built contigs from an assembler but it is necessary to make the scaffolding and I do not want use a scaffolding tool, I want to do it alone but I do not know how!

Based on your advice you told me to use the mate-pairs but I do not know how to have the mate-pair?! This is my problem.

Yo need to prepare your mate-pairs from the same genome you are sequencing as paired-end

I have not prepared the paired-end reads but I downloaded the files. How to do for mate pairs?

Thanks

ADD REPLY • link updated 21 months ago by Ram 43k • written 8.2 years ago by midox ▴ 290

1

Entering edit mode

You need to prepare your mate-paired at the time you are preparing your paired-end fragments.. If you check the Illumina information, you can notice that both follow a different protocol

If you don't have mate-paired sequences, you can't do nothing. You need to prepare mate-paired at the same time with the same genome, and sequence everything at the same time

Yo don't mention what is your genome. Maybe you can have a trusted reference genome to compare with programs like Mauve that allow you to organize your contigs into scaffolds

ADD REPLY • link updated 21 months ago by Ram 43k • written 8.2 years ago by Antonio R. Franco ★ 5.1k

0

Entering edit mode

my genomes is E-Coli and S.cerevisae W303.

ADD REPLY • link 8.2 years ago by midox ▴ 290

1

Entering edit mode

These two organisms have nice and trusted genomes you can use as reference.

Download and install Mauve, read its instructions, and use the tool of organize the contigs resulting from the assembly using a comparison with these reference genomes