Question: fastq reads with primers and truseq indexed adapter
0
gravatar for jomo018
2.3 years ago by
jomo018480
jomo018480 wrote:

I received 150bp SE Nextseq Fastq files from two similar experiments. Reads from one experiment are structured: 5' primer -- DNA -- 3' primer -- indexed adapter (partial - no index). Reads from the second experiment are structured: DNA-- 3' primer -- indexed adapter (almost complete - index included).

The reads of the first experiment are useless as they do not contain the index within the adapter. What controls the location of the "limited view window" within the sequence?

sequencing sequence next-gen • 1.2k views
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by jomo018480

Index reads (1D/2D) in illumina technology are read independent of main reads and should never be part of actual sequence (unless you are using in-line barcodes). If you have indexed samples then the run has to be set up as such.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax69k

I am referring to Truseq indexed adapters, 63 base long and 6 bases in the middle acting as the index.

ADD REPLYlink written 2.3 years ago by jomo018480

If you start seeing adapter on 3'-end of a read then that means your inserts are shorter than the length of sequencing being done (http://nextgen.mgh.harvard.edu/IlluminaChemistry.html ).

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax69k

150 bases are indeed shorter than the complete sequence. Using the colors in the bottom picture of the link you sent: In the second experiment I see the black - purple - blue - yellow (partial) which is OK because I don't need the red and the blue (de-multiplexing index) is included. In the first experiment I see the red - black - purple (partial) which is no good because the blue is missing. So how I can I make sure the next experiment will indeed include the blue index.

ADD REPLYlink written 2.3 years ago by jomo018480

To get the "index" the sequencing run has to be set up as "multiplexed". So you will specify 150 bp x N bp (N would be the length of the index you would want) as run-requirement (if you only want single-end reads). Was this not specified the first time around?

ADD REPLYlink written 2.3 years ago by genomax69k

I don't know. I have access to the bcl run folder. Can this information be seen in one of the files e.g. RunParameters.xml or RTAConfiguration.xml ?

ADD REPLYlink written 2.3 years ago by jomo018480
1

Look in the RunInfo.xml file (should be in the top level FC folder) to see if you can find a block like this

<Reads>
      <Read Number="1" NumCycles="50" IsIndexedRead="N" />
      <Read Number="2" NumCycles="7" IsIndexedRead="Y" />   
 </Reads>
ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax69k

Both experiments have the same block:

<Reads>
     <Read Number="1" NumCycles="150" IsIndexedRead="N" />
     <Read Number="2" NumCycles="8" IsIndexedRead="Y" />
     <Read Number="3" NumCycles="8" IsIndexedRead="Y" />    </Reads>
ADD REPLYlink modified 2.3 years ago by genomax69k • written 2.3 years ago by jomo018480

That shows that both these are dual-indexed (2D) runs and should have the same kind of reads. The reads go in the order Read 1 --> Index 1 --> Index 2 --> Read 2 (which you don't have).

So you have one sequence file per sample? Can you post the first few lines of the file (z)cat your_R1_fastq(.gz) | head -8.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax69k

Fastq files were extracted with bcl2fastq without SampleSheet. I guess I am not losing any information.

The run including index

 @NB551014:49:HHGGCAFXX:1:11101:9463:1046 1:N:0:0
 NTTTGGGGATTTGATTTAGTCGTAGTTTTTGTGAATTAATATTTGTGCGGTTTATATTTGGTGGAAGTTTTTTATTTAGTGTGCGGGGAACGAGGTTTTTTTTATATATTTAAGATTCGTCGGGAGGTAGAGGATTTGTAGGGTGAGTGA
 +
 #AAAAEEEEEEEEAEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEAEEEEEEEEEEEEEEEEEAEAEEEEEEAE<EAEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEAEEEEEEEEEEAEEEEEEAEEE
 @NB551014:49:HHGGCAFXX:1:11101:16334:1046 1:N:0:0
 NCTGTCTCCTGCATCCAATCCATTAAACTGACCTCCCCGTGCAGAGGCGGGGATACAACCATAAGACGAGAAGACCCTATGGAGCTTTAAACTAAAGGCAACTGCCAACTTCAACCTAACCCATAAGGAAATAACAATTAAACAAGCAGA
 +
 #AAAAEEEEAEEEEEAAEEEEEEEEEEEEEEAAEEEEEEEEEEEEAEEEEEEEEEE/EEEEAEEEEAEA/EEA/EEEEEEAAEEE<EEEEEEEEEEEEEEEEE/EEE<6AE6//EAEE/<E/E/<EE/EEEEEEEEE/AEEEEEE//E/E
  

The run with two primers and no index

 @NB501025:135:HJY7VAFXX:1:11101:16969:1049 1:N:0:0
  TTAGANAAGTAAAATGATGGATAATAACGTACGGTGAAACGTAGTGTTGGGAATCGTAGATGGAAGTCGAGTATTTTTTTTATTTGTGGGGATCGGAAGAGCACACGTCTGAACTCCAGTCCATTCCTATCTCGTATGCCGTCTTCTGCT
  +
  AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEAEEEEEEEEEEEEEEE<AE<EAEE/EEE<<EEEEEAEAEE/EE/EEAAAAEEEA/<<A<A<AAAA//6AEE<A/A<<A<</EAA/AA
  @NB501025:135:HJY7VAFXX:1:11101:13840:1049 1:N:0:0
 ATCGTNACGTTTATGGTGAACAGTGGATTAAGTTCATGAAGGATGGTGTTAATGCCACTCCTCTCCCGACTGTTAACACTACTGGTTATATTGACCATGCCGCTTTTCTTGGCACGATTAACCCTGATACCATTAAAATCCCTAAGCATT
 +
   AAAAA#/EEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEAEEEEEEEEAEEEEAEEEEEEEEEEE/EEEAEE/E/E/EEAEE</A/A6AE<EE//<EAEAAEAA/<6AEE/EAEA<A/AEE6<</<E/E/AA/AAAAAEA//AA</6
  
ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by jomo018480

Fastq files were extracted with bcl2fastq without SampleSheet. I guess I am not losing any information

If you were not interested in separating the samples then yes. But if you want to split the samples then you would need to re-run the demultiplexing using a proper samplesheet.

Otherwise this data is a mix of multiple samples which you can't tell apart by sequence you have in hand.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax69k

So you are saying that even though the Truseq indexed adapter (with its 6-base index) is missing from the reads as I see them after bcl2fastq-without-SampleSheet, a bcl2fastq with SampleSheet would still be able to do the demultiplexing? Where would it take the index from?

As a side note, I am doing the demultiplexing with a custom script targeting at the 3' primer which is always present and unique.

ADD REPLYlink written 2.3 years ago by jomo018480
2

If you have a few minutes available then watch this short Illumina video (starting about about 2 min in). It will help clarify the order of sequencing I had posted above.

Only reason you may see the indexes in your reads (produced without a sample sheet) is because those inserts were shorter than the length of sequencing. Whoever ran this run should be able to help you get a Samplesheet in right format (in fact it should already be there in the raw folder you have, assuming it is a complete copy, should be called SampleSheet.csv).

There is no point in using custom scripts for demultiplexing data (unless one is using internal non-illumina barcodes). Doing demultiplexing with bcl2fastq is going to make sure that the demux is handled properly.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax69k

Thank you for pointing to the video. Could you possibly explain how to include a "63 base Truseq indexed adapter with a six base index embedded" within the SampleSheet. I have found no examples for this case and IEM doesn't help. The adapter is:

GATCGGAAGAGCACACGTCTGAACTCCAGTCACXXXXXXATCTCGTATGCCGTCTTCTGCTTG

ADD REPLYlink written 2.3 years ago by jomo018480

You don't need to do that. Just include the indexes for the samples. You can generate an example samplesheet from IEM to get general structure and then look at the examples here. There are slightly different samplesheets for CASAVA and bcl2fastq.

Your run data folder may already have a samplesheet (SampleSheet.csv) so look for that first.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax69k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 721 users visited in the last hour