Question: If the index sequences in my Illumina FASTQ headers aren't the barcodes, then what are they?
2
gravatar for jenn.drummond
2.3 years ago by
United States
jenn.drummond40 wrote:

Hi, everybody. I have FASTQ headers of the form

@FCC3KD2ACXX:6:1101:1545:2184#ATCACGATC/1

The "ATCACGATC" portion of these older-style headers is supposed to be the "index sequence", or the molecular barcode of a multiplexed sample, according to the Wikipedia article. But I know what the barcodes are, and that isn't one of them. All the barcodes for this project are 6-8bp, not 9, and there are "only" about 300 legitimate barcodes in the lane.

Overall, there are over 255,000 different individual ones of these index sequences on various different reads, out of about 178M reads in the file. Some of them even contain Ns, but they're all exactly 9bp long. And this particular sequence (ATCACGATC) is by far the most prevalent -- it's on 90% of the reads, so it can't be succeeding at separating anything out very specifically.

I'm coming in late to a project, and all I have to go on is the FASTQ files, the list of barcodes, and some wet-lab protocol docs that I'm not particularly qualified to interpret. Any idea what this odd extraneous-looking sequence is? If so, thanks in advance!

ADD COMMENTlink modified 2.3 years ago by harold.smith.tarheel4.2k • written 2.3 years ago by jenn.drummond40

Some of them even contain Ns, but they're all exactly 9bp long.

Therefore it's not barcode metadata, it's something that was sequenced. I'd imagine it's the first 9 bases of the sequenced DNA, that has been cut out of the SEQ and pasted into the QNAME, probably as a ghetto method for parsing 9bp barcodes in the past.

EDIT: I wonder if the machine has given different barcodes for pairs of the same read...?

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by John12k

@John: With an over-clustered flowcell it is possible to get N's in tag reads (one or more and in particularly bad cases all can be N's).

A good sequencing facility will not release this kind of data (unless for investigative purposes). Here OP has been left holding the bag with a mishmash of info and has to make sense of it all.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax57k

Right, but what i'm saying is that these 9bp come from the read, not from some metadata added programatically. For example, I assume the machines usually read the first few bases, try to align the barcodes it knows about to 'call' a reads barcode, then add the barcode it thinks it matches to the QNAME. This way even if a base has an N in the sequenced barcode, the barcode in the QNAME will be correct. Or if the first base was omitted from sequencing somehow, it's still matched as being barcode X. But this machine, since it's putting Ns in the QNAME, is probably just cutting the first 9bp out of the read, and sticking it in the header - losing the QUAL score in the process for all 9bp. Real lazy.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by John12k
2

Sure. That 9 bp came from a real read. Even though OP has 6-8 barcodes someone chose to run this as a 9 cycle (or 10, one base at end is removed by default) run.

In case of Illumina technology both tag reads are read completely independently of the main R1/R2 reads (sequence goes R1 --> Index 1 --> Index 2 --> R2). During the run the machine has no way of matching barcodes to anything. De-multiplexing is handled by the CASAVA/bcl2fastq software during post-processing of the run.

As an aside: There is an easy way to get the index reads in fastq (with Q-scores) format as separate files (then one ends up with R1, R2, R3, R4 files, corresponding to the order of reads above) with bcl2fastq2 v.2.x with a command line option. With older versions of CASAVA there was some editing of run files required to do that.

Reason one gets N's in the tag reads is the same as the main reads. If the clusters become too fat/are too close, then the basecalling software is no longer able to call a base and will use N. This happens when a flowcell is over-clustered (and there also tag reads are more vulnerable than the main reads). It is possible to get perfectly good sequence on main reads but lose the tag reads, rendering the run useless.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax57k

Ah, i didn't know it was a totally separate read. I assumed it was sequenced with everything else and then cut out programatically afterward. A totally separate sequencing step eh? Gosh.

So now it makes sense to me - thank you both genomax2 and igor :)

ADD REPLYlink written 2.3 years ago by John12k
1

The barcode read is a separate read. If you really want, you can get those quality scores.

ADD REPLYlink written 2.3 years ago by igor6.8k

Thanks for the excellent discussion, which I'm just now getting around to reading in detail, but continues to be relevant since I'm dealing with several different flavors of unpleasantly non-"standard" data. Every bit of crowdsourced clue I can get helps! This particular dataset did turn out to respond well to standard demultiplexing, but not all of them do. I had the pleasure of learning informatics in a top-of-the-line organization with excellent standards, where I had access to everyone from the lab techs to the people running the mapping pipeline in case questions arose. In my new situation...not so much. :)

ADD REPLYlink written 2.0 years ago by jenn.drummond40
4
gravatar for harold.smith.tarheel
2.3 years ago by
United States
harold.smith.tarheel4.2k wrote:

Your sequence string 'ATCACGATC' is a perfect match to the TruSeq adapter index 1. The length (9bp) is atypical but not unheard-of. The length of the index read is often index-length-plus-one (i.e., seven cycles for a 6mer), so it appears that the sequencer was programmed for nine index cycles (8mer+1). The last nucleotide is typically trimmed from the data by default (--use-bases-mask = I8n), but the full length can be specified by the user (I9).

As @igor indicates, the amount of data suggests that the file was not demultiplexed. However, given that 90% of your data are a perfect match, it's also unlikely to contain multiple libraries. Best guess is that some other lanes of this flow cell did contain 8mer indices, and CASAVA doesn't allow lane-specific BCL-to-FASTQ conversion (so the data in all lanes have identical formats).

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by harold.smith.tarheel4.2k

This is an older run, so I can't speak for CASAVA 1.7. However, 1.8 allows lane-specific BCL-to-FASTQ conversion.

ADD REPLYlink written 2.3 years ago by igor6.8k

This would have been useful to know! Our facility always ran CASAVA multiple times to generate different data outputs. Just for my edification, what's the CASAVA 1.8 syntax to extract (for example) a 6bp index from lane 1 vs a 8bp index from lanes 2-8?

ADD REPLYlink written 2.3 years ago by harold.smith.tarheel4.2k

The latest version of bcl2fastq is 2 where you can specify exact lanes.

With bcl2fastq 1.8, you have to run it separately for each barcode size, so you do have to run it multiple times. However, you don't have to process each lane twice. In your sample sheet, only include lanes you want to process and it will ignore any lanes that are not mentioned in the sample sheet.

ADD REPLYlink written 2.3 years ago by igor6.8k

Okay, that's exactly what we did.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by harold.smith.tarheel4.2k
1
gravatar for Brian Bushnell
2.3 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

It's possible that they ARE your barcodes, and the metadata is wrong or you're looking at the wrong data.

If your reads are paired, you can determine the actual adapter sequence using BBMerge with the "outadapter" flag; the barcode will be embedded in the adapter sequence (may be reverse-complemented, due to the chemistry). So you might try that, to determine the barcodes empirically.

Also, the number of bases in your barcode and the number of bases sequenced in the barcode phase are not necessarily correlated. Someone may have run 9 cycles even though the barcodes were only 6 bases long. In that case, the last 3 bases would just be part of the adapter sequence and not technically the barcode.

ADD COMMENTlink written 2.3 years ago by Brian Bushnell16k
1
gravatar for genomax
2.3 years ago by
genomax57k
United States
genomax57k wrote:

What is in the sequence file should be absolute data (provided it has not been tampered with). If the barcodes you have do not match the data then you should check on the validity of barcodes first (also check that that list you have is not reverse complement of what is in the file, the barcode match may start 2-3 bp into what is in the reads, this is a common error some experimental people do).

If you are positive that the barcodes are correct then

  1. you may have a wrong data set - if none (or few) of your barcodes match what is there
  2. data you need is mixed with something else - you would need to parse reads that match your barcodes out. Like @Brian said you could only look for the significant bases (6-8) from the beginning of the tag read and ignore the rest.

But before doing any of this you should take a few reads (that match your barcode selections and that predominant barcode) and do quick blast @NCBI to confirm that they are at least from the genome they are supposed to be from. If not, you have a problem on your hands.

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by genomax57k

What a horrible thought...and therefore a useful one for the future. I'll remember that as I go through further forensics!

ADD REPLYlink written 2.0 years ago by jenn.drummond40
1
gravatar for igor
2.3 years ago by
igor6.8k
United States
igor6.8k wrote:

I am somewhat concerned your FASTQ headers are not standard. Specifically, there is no instrument ID (see discussion here: Illumina Instrument Type from fastq? ). There may be other modifications that are not obvious that would make troubleshooting more difficult.

We don't know if the samples were demultiplexed. It's possible that they weren't. The FASTQ is just showing you the barcode sequence corresponding to each read. Does the filename contain a barcode? By default, it should.

The files are probably not demultiplexed because you say you have 178M reads in that file, but 300 samples. 178M reads is roughly what you would expect from a full lane of sequencing.

If we assume this is a demultiplexed file, don't expect all of the barcode sequences to match your barcode. When demultiplexing, you can specify the number of mismatches. It's common to allow 1 mismatch (Illumina's own BaseSpace service does that). That will yield a lot of barcodes that are not identical to your barcode sequence, but still correspond to it. What are your barcodes? How do they compare to that sequence your posted? What are the other barcode sequences you see?

Finally, this is really the question for your sequencing facility. We can only speculate here, but they can easily tell you exactly what they did.

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by igor6.8k
2

Strictly speaking, there is no standard. There is just a collection of unspoken rules. If there was a standard, there wouldn't be this problem :)

They also wouldn't be a need for every read to contain the same useless prefixes/suffixes. They might as well have made their machines write a QNAME of "Our_Dear_Leader_Illumina#[5 to 10 random digits to ruin compression]#barcode#current_exchange_rate_of_£_to_$/1"

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by John12k

There is no "standard", but the Illumina sequencers come with default settings that have been fairly consistent for the last few years.

ADD REPLYlink written 2.3 years ago by igor6.8k

Right, a collection of unspoken rules.

ADD REPLYlink written 2.3 years ago by John12k

Spoken rules. Ask Illumina and they will tell you exactly how the headers are generated. It's not a secret.

ADD REPLYlink written 2.3 years ago by igor6.8k
1

I mean they are not spoken in the FASTQ specification. They are additional "rules" for how Illumina wants to encode per-read metadata using a format that was not designed for per-read metadata. Illumina's decision to do that rather than find another/proper way to encode per-read metadata with sequence and quality has ultimately led to the existence of this thread/problem. I mean, i'd give Illumina some credit, they're not a huge company with unlimited resources. It's not like they made $1.57 billion dollars of gross profit in the last year. Oh wait. No. They did.

"Yeah just stuff it in the template ID. They'll figure it out."

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by John12k
2

I am going to abandon this thread before we get to the phred33 vs phred64 debate. At least the header line doesn't affect most downstream tools.

ADD REPLYlink written 2.3 years ago by igor6.8k

You get to do that when you invent a new technology (and then make everyone else run with it).

ADD REPLYlink written 2.3 years ago by genomax57k

This data looks to be from older versions of CASAVA (v.1.7 and before) which was the "standard" (for what it is worth) then.

ADD REPLYlink written 2.3 years ago by genomax57k

Eh, I don't think it's fair to say Illumina defined the standard for the Sanger Institute's FASTQ format. It may be standard given the prevalence of the Illumina machines, but I think it's wrong to say it is the standard or even a standard.

The reason i'm making a big deal out of it is that I think a lot of confusion in Bioinformatics happens when people think there are some rule going on, but really there are no rules. Assumptions about data is the number 1 cause for mistakes, so even if the header looked like it followed the CASAVA format perfectly, no one should try to interpret it because the header only is supposed to have 1 job - to be unique among reads that come from the same sequenced fragment. CASAVA ruined that with /1 and /2, and all this psudo-metadata about barcodes/etc does is blow up the file size of everything.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by John12k

To be explicit, I was referring to the style of the fastq header in the example posted in the OP as standard for CASAVA v.1.7 and before :-)

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax57k

I haven't thought about that. I haven't worked with CASAVA 1.7 headers, so it wasn't obvious to me.

However, CASAVA 1.8 has been around since 2011. Has the OP been sitting on this data for 5 years?

ADD REPLYlink written 2.3 years ago by igor6.8k

I have worked with every public version of CASAVA :-)

Has the OP been sitting on this data for 5 years?

Someone was. OP inherited this project with minimal information.

ADD REPLYlink written 2.3 years ago by genomax57k

Yep, you nailed it. :)

ADD REPLYlink written 2.0 years ago by jenn.drummond40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1790 users visited in the last hour