Question

8-base tags in Illumina short reads

0

Entering edit mode

6.4 years ago

V.S ▴ 20

Hi biostars,

I am working on a bacteria short reads data downloaded from NCBI. In the description, it mentiones

Design: Illumina sequencing of library 6053957, constructed from sample accession ERS135592 for study accession ERP001405. This is part of an Illumina multiplexed sequencing run (8727_8). This submission includes reads tagged with the sequence GTACATCT.

The tag is in the reference genome.

I am really confused by this. I thought tags work as identifier to separate different isolates in one run. Ans they should be unique to each isolate (which is not in this case either) and not in the reference genome.

What I am thinking could be completely wrong. Can anyone help to explain why such tags are used or in which situation to use them? Should I remove them before alignment? Maybe not, since they are in the reference genome?

Many thanks

sequencing next-gen alignment • 1.5k views

ADD COMMENT • link updated 6.4 years ago by h.mon 35k • written 6.4 years ago by V.S ▴ 20

0

Entering edit mode

Wonder if this is related to this blog post.

ADD REPLY • link 6.4 years ago by WouterDeCoster 47k

0

Entering edit mode

I will check the blog. Thanks WouterDeCoster!

ADD REPLY • link 6.4 years ago by V.S ▴ 20

score 1 · Accepted Answer · 2017-12-13

The paper and its accompanying supplementary material doesn't make clear if the multiplexed barcodes are found in the adapters or inline within reads, but a simple FastQC run would let you know if these barcodes are inline within reads: there would be a huge spike in the "per base sequence content" at the position corresponding to the inline barcode.

The tag is in the reference genome.

I am not 100% sure about the math, but if I am not mistaken, the chance of not seeing a particular 8-mer on a 2.2Mb genome is:

( 1 - 1 / ( 4^8 ) ) ^ ( 2.2 * 10^6 ) = 2.6 * 10^(-15)

Which is practically zero, so any given 8-mer will most likely be found on Neissieria genome (assuming 0.25 probability for each base).

Ans they should be unique to each isolate (which is not in this case either)

They should be unique to each isolate being sequenced on the same lane of a flowcell. Two different isolates can have the same barcode if they are sequenced on different lanes.

Should I remove them before alignment?

If there is a systematic enrichment on sequencing reads indicating the barcodes are inline within reads, then you could remove - this would be essential for genome assembly, for example. But most short read mappers (like BWA or Bowtie2) would soft-clip these bases, so for mapping you need not worry too much.