Question: Illumina paired-end reads R1 and R2 mixed together?
1
gravatar for lvogel
9 months ago by
lvogel20
Western Europe
lvogel20 wrote:

Hi, I have received fastq files containing the reads from Illumina MiSeq. Since they are paired-end, there is an R1 and an R2 file for each sample. So I expected to find reads beginning with our forward primer in the R1 files, and reads beginning with our reverse primer in the R2 (or vice versa). However, I find both in both; i.e. about half of the reads in the R1 files begin with the forward primer, and half with the reverse primer; and same with the R2s. I tried merging them, but this results in about half of the reads being reverse complemented, and this makes things more complicated downstream, so I would like them to all go in the same direction. I thought to grep for each of the primers, but because of ambiguities and some still having short tags on the beginning, I don't think it's going to work--plus I thought they weren't supposed to be mixed anyway...??? Maybe I don't understand this as well as I thought. Any ideas? Thanks.

illumina metabarcoding • 861 views
ADD COMMENTlink modified 9 months ago by jomo018180 • written 9 months ago by lvogel20
1

Could you elaborate more on the library was prepared?

ADD REPLYlink written 9 months ago by WouterDeCoster24k
1

It sounds like your amplicon library was constructed by the standard Illumina method (i.e., adaptor ligation) and sequenced with standard Illumina (adaptor) primers. If so, then you'd expect a 50/50 mix of amplicon orientations. But @WouterDeCoster is correct, we'll need more details about library prep (e.g., what are the short tags to which you refer) to help you parse the data.

ADD REPLYlink written 9 months ago by harold.smith.tarheel4.0k
2

The primer sequences you use in the sequencing step, use the adaptors you link to your fragmented DNA or cDNA)

And the joining of these adapters to these pieces of DNA is fully random (don't get into consideration direction) excepting when you are using a stranded transcriptomic protocol

If using genomic sequences, I am not aware of a protocol that will allow you to get directional libraries, though

ADD REPLYlink written 9 months ago by Antonio R. Franco3.5k

Just plain (multiplex) PCR based enrichment & library prep can be directional.

ADD REPLYlink written 9 months ago by WouterDeCoster24k

Have you tried to scan the data with a trimming program? I suggest bbduk.sh from BBMap suite. You may have inserts that at smaller than the length of sequencing. While you are at it you could also use bbmerge.sh from the same suite to see what you get in terms of merging of R1/R2 reads.

ADD REPLYlink modified 9 months ago • written 9 months ago by genomax40k
3
gravatar for jomo018
9 months ago by
jomo018180
jomo018180 wrote:

Actually the reads are always mixed just the way you describe them. R1 may be forward or reverse. R2 may also be forward or reverse. You are only guaranteed that the pairs are complementary. Depending on your requirements, you may indeed need to check which is which down the pipeline. Stsndard alignment utilities do that automatically.

ADD COMMENTlink written 9 months ago by jomo018180

Hi all, thanks for the comments and answer.

jomo018, since it's barcoding, I don't think I'm using any of the standard alignment utilities you're referring to--could you give some examples?

Also, I'll add the following details which were asked for, in case someone finds them useful:

  • Nextera Indices Kit was used, with i5 and i7, for multiplexing.
  • The primers we use also have their own old barcodes, apparently.
  • By "tags"--maybe these aren't tags per se, but there is sometimes TCAT occurring before the forward primer, and GGAG occuring before the reverse primer.

Here is what I got from BBDuk:

Input is being processed as paired

Input: 178704 reads 51661997 bases. KTrimmed: 23191 reads (12.98%) 605133 bases (1.17%) Total Removed: 2 reads (0.00%) 605133 bases (1.17%) Result: 178702 reads (100.00%) 51056864 bases (98.83%)


Here is from BBMerge:

Pairs: 89351 Joined: 23783 26.617% Ambiguous: 28013 31.352% No Solution: 37555 42.031% Too Short: 0 0.000%

Avg Insert: 357.7 Standard Deviation: 12.2 Mode: 365

Insert range: 52 - 425 90th percentile: 365 75th percentile: 365 50th percentile: 365 25th percentile: 339 10th percentile: 339

ADD REPLYlink modified 9 months ago • written 9 months ago by lvogel20
1

Are you saying that there are two types of indexes in this experiment (at level 1 - Illumina nextera and once samples are demultiplexed into those pools, there are "inline" barcodes that further split nextera pools into individual samples)?

Your inserts are of a good size and there are no primer dimers so the data looks good at that level.

ADD REPLYlink modified 9 months ago • written 9 months ago by genomax40k

genomax2, unfortunately I'm not sure. Could it be that the short things I thought were tags (TCAT and GGAG) are "inline" barcodes, since they are so short? Because since there is only one forward and one reverse, they aren't serving any purpose (no further splitting down of the pools). They are sometimes there and sometimes not, which just makes things more complicated for me.

ADD REPLYlink written 9 months ago by lvogel20
1

Do you perhaps have primer sequences which were used to amplify targets?

Did I understand correctly that in a first PCR targets are selectively amplified using tagged primers, followed by an universal PCR to add barcodes and illumina adapters?

ADD REPLYlink written 9 months ago by WouterDeCoster24k

Yes! :) I should have mentioned that too. (I only added a tag that said "metabarcoding") The target is a segment of the CO1 gene. So since I'm not doing genome assembly, the suggestion that standard assembly utilities will specify which of my reads are in the RC direction might not help. Although I've already accepted an answer, I kind of asked two questions in this post. I understand now that the mixture of directions is normal. The question still remains about how to get all my reads to go in the same direction, to make things easier downstream. I'll post it as a new question if necessary.

ADD REPLYlink modified 9 months ago • written 9 months ago by lvogel20

You can use reformat.sh from BBMap to reverse-complement the reads. Two options you are looking for are.

rcomp=f                 (rc) Reverse-compliment reads.
rcompmate=f             (rcm) Reverse-compliment read 2 only.
ADD REPLYlink modified 9 months ago • written 9 months ago by genomax40k

Thanks, I'll try it tomorrow when I can access my data & upvote you if it does what I need.

ADD REPLYlink written 9 months ago by lvogel20

Either I don't understand it correctly, or it doesn't do what I want. I tried like this:

bash reformat.sh in=merged.fq out=mergedr.fq rcomp

and variations of rcomp and rcompmate, but it either just reverse complements all of them or none of them. Apparently, reverse complementing only the reads that are in a different direction than the other reads is not commonly done, based on some of the responses here.

ADD REPLYlink written 9 months ago by lvogel20
1

reverse complementing only the reads that are in a different direction than the other reads is not commonly done

That is correct. You may need to identify reads that map to one strand or other (Forward Stand Or Reverse Strand ), isolate them (you could do that using filterbyname.sh) and then do RC using reformat.sh.

From your comment below:

All of my sequences should be approximately the same length

Curious why that is a requirement.

ADD REPLYlink modified 9 months ago • written 9 months ago by genomax40k

OK, thanks. Now I see how it would need to be done. And without SAM/BAM files, it might be more work than it's worth.

Curious why that is a requirement.

We use primers that amplify a 313-bp coding region of a gene, so this region really shouldn't vary in length by more than a few bp. For clustering into OTUs, fragments should first be trimmed to the same ~313-bp region. I suppose this is also why I don't have BAM files, just gzipped FASTQs, since the data isn't as big in barcoding as in genomics.

ADD REPLYlink modified 9 months ago • written 9 months ago by lvogel20
1

Wouldn't you rather first map the data, then slice the data by expected positions to obtain equal read lengths?

ADD REPLYlink written 9 months ago by WouterDeCoster24k

That's a good point there. It's just I've never thought to map my sequences to anything, because since they are environmental samples, there are many species of multiple orders or even classes that will have been amplified. So I don't know what I would use for a reference. From BLASTing, I know that most of the matches to the database are relatively low percent identical. ...But still I'm curious to look into this method now.

ADD REPLYlink written 9 months ago by lvogel20

I don't know why you would need them in the same direction. If you are just doing standard mapping and variant calling there is no issue. Although I don't really know what the downstream analysis is.

ADD REPLYlink written 9 months ago by WouterDeCoster24k

All of my sequences should be approximately the same length, which sometimes necessitates trimming bases from the left and/or right, and this is easier to do correctly when they are all in the same direction. And useful for graphical visualization. And it makes clustering into OTUs slightly more accurate.

ADD REPLYlink written 9 months ago by lvogel20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 820 users visited in the last hour