Question

Illumina paired-end reads R1 and R2 mixed together?

1

Entering edit mode

7.1 years ago

lvogel ▴ 30

Hi, I have received fastq files containing the reads from Illumina MiSeq. Since they are paired-end, there is an R1 and an R2 file for each sample. So I expected to find reads beginning with our forward primer in the R1 files, and reads beginning with our reverse primer in the R2 (or vice versa). However, I find both in both; i.e. about half of the reads in the R1 files begin with the forward primer, and half with the reverse primer; and same with the R2s. I tried merging them, but this results in about half of the reads being reverse complemented, and this makes things more complicated downstream, so I would like them to all go in the same direction. I thought to grep for each of the primers, but because of ambiguities and some still having short tags on the beginning, I don't think it's going to work--plus I thought they weren't supposed to be mixed anyway...??? Maybe I don't understand this as well as I thought. Any ideas? Thanks.

Illumina metabarcoding • 14k views

ADD COMMENT • link updated 6.1 years ago by gb ★ 2.2k • written 7.1 years ago by lvogel ▴ 30

1

Entering edit mode

Could you elaborate more on the library was prepared?

ADD REPLY • link 7.1 years ago by WouterDeCoster 47k

1

Entering edit mode

It sounds like your amplicon library was constructed by the standard Illumina method (i.e., adaptor ligation) and sequenced with standard Illumina (adaptor) primers. If so, then you'd expect a 50/50 mix of amplicon orientations. But @WouterDeCoster is correct, we'll need more details about library prep (e.g., what are the short tags to which you refer) to help you parse the data.

ADD REPLY • link 7.1 years ago by harold.smith.tarheel ★ 4.9k

2

Entering edit mode

The primer sequences you use in the sequencing step, use the adaptors you link to your fragmented DNA or cDNA)

And the joining of these adapters to these pieces of DNA is fully random (don't get into consideration direction) excepting when you are using a stranded transcriptomic protocol

If using genomic sequences, I am not aware of a protocol that will allow you to get directional libraries, though

ADD REPLY • link 7.1 years ago by Antonio R. Franco ★ 5.1k

0

Entering edit mode

Just plain (multiplex) PCR based enrichment & library prep can be directional.

ADD REPLY • link 7.1 years ago by WouterDeCoster 47k

0

Entering edit mode

So what are the other methods of amplicon library prep? what if I do not want the 50/50 mix of amplicon orientation?

ADD REPLY • link 4.4 years ago by ying.eddi2008 • 0

0

Entering edit mode

PCR-based methods (as opposed to ligation) will produce directional libraries. You can either incorporate the Illumina adapter sequences into your amplicon primers, or add them via two rounds of PCR (first round with amplicon primers, second round with Illumina adapters + amplicon overhang).

ADD REPLY • link 4.4 years ago by harold.smith.tarheel ★ 4.9k

0

Entering edit mode

Have you tried to scan the data with a trimming program? I suggest bbduk.sh from BBMap suite. You may have inserts that at smaller than the length of sequencing. While you are at it you could also use bbmerge.sh from the same suite to see what you get in terms of merging of R1/R2 reads.

ADD REPLY • link 7.1 years ago by GenoMax 141k

0

Entering edit mode

6.1 years ago

gb ★ 2.2k

I use FLASH to merge the reads, really easy to use.

Other options are:

PEAR

vsearch/usearch -fastq_mergepairs

On this page you can find a comparison: https://www.researchgate.net/publication/303288211_Evaluating_Paired-End_Read_Mergers

It depends on the sequence length, but you can first merge the reads and after that trim the primers

ADD COMMENT • link 6.1 years ago by gb ★ 2.2k

0

Entering edit mode

Ah, I took a quick look at it, and they didn't compare VSEARCH, which is what I'm using these days. I've recently found an interesting solution to my original question. I wanted to put all the merged reads in the same orientation, for the next steps of the pipeline, e.g. dereplication, BLAST-ing. So I use the fastx_revcomp command of VSEARCH to flip all of them around, and then cutadapt to remove primers from the combined file of original and reverse complemented reads, with the --discard-untrimmed option, so that everything that didn't have the primer, which is mostly the ones in the unwanted direction, get deleted.

ADD REPLY • link 6.1 years ago by lvogel ▴ 30

score 3 · Accepted Answer · 2017-03-30

3

Entering edit mode

7.1 years ago

jomo018 ▴ 720

Actually the reads are always mixed just the way you describe them. R1 may be forward or reverse. R2 may also be forward or reverse. You are only guaranteed that the pairs are complementary. Depending on your requirements, you may indeed need to check which is which down the pipeline. Standard alignment utilities do that automatically.

ADD COMMENT • link 4.4 years ago by jomo018 ▴ 720

0

Entering edit mode

Hi all, thanks for the comments and answer.

jomo018, since it's barcoding, I don't think I'm using any of the standard alignment utilities you're referring to--could you give some examples?

Also, I'll add the following details which were asked for, in case someone finds them useful:

Nextera Indices Kit was used, with i5 and i7, for multiplexing.
The primers we use also have their own old barcodes, apparently.
By "tags"--maybe these aren't tags per se, but there is sometimes TCAT occurring before the forward primer, and GGAG occuring before the reverse primer.

Here is what I got from BBDuk:

Input is being processed as paired

Input: 178704 reads 51661997 bases. KTrimmed: 23191 reads (12.98%) 605133 bases (1.17%) Total Removed: 2 reads (0.00%) 605133 bases (1.17%) Result: 178702 reads (100.00%) 51056864 bases (98.83%)

Here is from BBMerge:

Pairs: 89351 Joined: 23783 26.617% Ambiguous: 28013 31.352% No Solution: 37555 42.031% Too Short: 0 0.000%

Avg Insert: 357.7 Standard Deviation: 12.2 Mode: 365

Insert range: 52 - 425 90th percentile: 365 75th percentile: 365 50th percentile: 365 25th percentile: 339 10th percentile: 339

ADD REPLY • link 7.1 years ago by lvogel ▴ 30

1

Entering edit mode

Are you saying that there are two types of indexes in this experiment (at level 1 - Illumina nextera and once samples are demultiplexed into those pools, there are "inline" barcodes that further split nextera pools into individual samples)?

Your inserts are of a good size and there are no primer dimers so the data looks good at that level.

ADD REPLY • link 7.1 years ago by GenoMax 141k

0

Entering edit mode

genomax2, unfortunately I'm not sure. Could it be that the short things I thought were tags (TCAT and GGAG) are "inline" barcodes, since they are so short? Because since there is only one forward and one reverse, they aren't serving any purpose (no further splitting down of the pools). They are sometimes there and sometimes not, which just makes things more complicated for me.

ADD REPLY • link 7.1 years ago by lvogel ▴ 30

1

Entering edit mode

Do you perhaps have primer sequences which were used to amplify targets?

Did I understand correctly that in a first PCR targets are selectively amplified using tagged primers, followed by an universal PCR to add barcodes and illumina adapters?

ADD REPLY • link 7.1 years ago by WouterDeCoster 47k

0

Entering edit mode

Yes! :) I should have mentioned that too. (I only added a tag that said "metabarcoding") The target is a segment of the CO1 gene. So since I'm not doing genome assembly, the suggestion that standard assembly utilities will specify which of my reads are in the RC direction might not help. Although I've already accepted an answer, I kind of asked two questions in this post. I understand now that the mixture of directions is normal. The question still remains about how to get all my reads to go in the same direction, to make things easier downstream. I'll post it as a new question if necessary.

ADD REPLY • link 7.1 years ago by lvogel ▴ 30

0

Entering edit mode

You can use reformat.sh from BBMap to reverse-complement the reads. Two options you are looking for are.

rcomp=f                 (rc) Reverse-compliment reads.
rcompmate=f             (rcm) Reverse-compliment read 2 only.

ADD REPLY • link 7.1 years ago by GenoMax 141k

0

Entering edit mode

Thanks, I'll try it tomorrow when I can access my data & upvote you if it does what I need.

ADD REPLY • link 7.1 years ago by lvogel ▴ 30

0

Entering edit mode

Either I don't understand it correctly, or it doesn't do what I want. I tried like this:

bash reformat.sh in=merged.fq out=mergedr.fq rcomp

and variations of rcomp and rcompmate, but it either just reverse complements all of them or none of them. Apparently, reverse complementing only the reads that are in a different direction than the other reads is not commonly done, based on some of the responses here.

ADD REPLY • link 7.1 years ago by lvogel ▴ 30

1

Entering edit mode

reverse complementing only the reads that are in a different direction than the other reads is not commonly done

That is correct. You may need to identify reads that map to one strand or other (Forward Stand Or Reverse Strand ), isolate them (you could do that using filterbyname.sh) and then do RC using reformat.sh.

From your comment below:

All of my sequences should be approximately the same length

Curious why that is a requirement.

ADD REPLY • link 7.1 years ago by GenoMax 141k

0

Entering edit mode

OK, thanks. Now I see how it would need to be done. And without SAM/BAM files, it might be more work than it's worth.

Curious why that is a requirement.

We use primers that amplify a 313-bp coding region of a gene, so this region really shouldn't vary in length by more than a few bp. For clustering into OTUs, fragments should first be trimmed to the same ~313-bp region. I suppose this is also why I don't have BAM files, just gzipped FASTQs, since the data isn't as big in barcoding as in genomics.

ADD REPLY • link 7.1 years ago by lvogel ▴ 30

1

Entering edit mode

Wouldn't you rather first map the data, then slice the data by expected positions to obtain equal read lengths?

ADD REPLY • link 7.1 years ago by WouterDeCoster 47k

0

Entering edit mode

That's a good point there. It's just I've never thought to map my sequences to anything, because since they are environmental samples, there are many species of multiple orders or even classes that will have been amplified. So I don't know what I would use for a reference. From BLASTing, I know that most of the matches to the database are relatively low percent identical. ...But still I'm curious to look into this method now.

ADD REPLY • link 7.1 years ago by lvogel ▴ 30

0

Entering edit mode

I don't know why you would need them in the same direction. If you are just doing standard mapping and variant calling there is no issue. Although I don't really know what the downstream analysis is.

ADD REPLY • link 7.1 years ago by WouterDeCoster 47k

0

Entering edit mode

All of my sequences should be approximately the same length, which sometimes necessitates trimming bases from the left and/or right, and this is easier to do correctly when they are all in the same direction. And useful for graphical visualization. And it makes clustering into OTUs slightly more accurate.

ADD REPLY • link 7.1 years ago by lvogel ▴ 30