Question: Isn't fastq file header suppose to have barcode sequence?
gravatar for salamandra
4 weeks ago by
salamandra110 wrote:

So, I have ChIP-seq sample with this barcode: CGATGT and I want to remove the barcodes and adapters. But when looking to fastqc files in the headers has 'CTTGTA'

@NS500672:149:HNNH3BGXY:4:23612:26287:20351 1:N:0:CTTGTA

a) Shouldn't it be CGATGT, as the barcode?

Also, when running fastqc there are overepresented sequences and it says possible source TrueSeq adapter, Index 12, which has the CTTGTA barcode, as shown in page 22 in here:

The adapter content on FastQC is ok.

b) What does this mean?

I am new to chip-seq..

chip-seq quality control • 146 views
ADD COMMENTlink modified 4 weeks ago by genomax44k • written 4 weeks ago by salamandra110

CTTGTA is the barcode. That is what the sequencer saw/sequenced on this library.

Please see the informative posts about FastQC here.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax44k

As genomax said, trust the demultiplexing software (the one that reads out the sequenced barcode) for assigning reads to barcodes. FastQC, as good as it is for quality control, is not specialized in barcode/index recognition. The only information that it can give you is which kind of adapter (TruSeq, Nextera...) is present so that you know what you have to trim for.

ADD REPLYlink written 4 weeks ago by ATpoint3.2k

Have two other questions which you might know the answer:

I wish to trim over-represented sequences, which fastQC says are TrueSeq adapters Index 12, from data, using Trimmomatic. Nevertheless, when going to adapters folder of Trimmnomatic none of the files there (NexteraPE-PE.fa, TruSeq2-SE.fa, TruSeq3-PE.fa, TruSeq2-PE.fa, TruSeq3-PE-2.fa and TruSeq3-SE.fa) has the over-represented sequence shown in fastQC. Why?

So, created a fasta file with each over-represented sequence. Sequencing is single-end but I do not know which end has the adapter. So, inside that file should put just the sequence name or repeat the sequence with name/1 and name/2? I read the manual, but didn't understand whether /1 and /2 should be used with single-end sequencing or paired-end sequencing.

ADD REPLYlink written 29 days ago by salamandra110

So, created a fasta file with each over-represented sequence.

Don't reinvent the wheel. Various Illumina adapters have a core sequence that is common. Once that is detected in your sequence everything to the right of that sequence is generally removed. There is no need to modify sequence identifiers if you know the data to be single-end. /1 and /2 were identifiers used in old illumina data. You can find more information here.

ADD REPLYlink modified 29 days ago • written 29 days ago by genomax44k

well, the core sequence that is common to most of them: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACATGTCAGAATC' isn't in any of the index files of trimmomatic, so can't see how I can remove the sequences without reinventing the weel :) What am I missing?

ADD REPLYlink written 29 days ago by salamandra110

That is interesting. I use from BBMap suite. @Brian makes an adapters.fa file available in the resources directory of BBMap install that contains sequences of most common adapters from commercial kits. You may want to try that out.

Edit: I see the common sequence in trimmomatic files

ADD REPLYlink modified 29 days ago • written 29 days ago by genomax44k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 794 users visited in the last hour