Question

Isn't fastq file header suppose to have barcode sequence?

0

Entering edit mode

7.4 years ago

salamandra ▴ 550

So, I have ChIP-seq sample with this barcode: CGATGT and I want to remove the barcodes and adapters. But when looking to fastqc files in the headers has 'CTTGTA'

@NS500672:149:HNNH3BGXY:4:23612:26287:20351 1:N:0:CTTGTA
TCCGTCTAGTCAAAGCTATGGTTTTTCCAGTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTAATCTCCTAT
TATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTAATATCGTATGCCGTCGTCTGCGTGAAAAAAAAAAGCGGGGG

a) Shouldn't it be CGATGT, as the barcode?

Also, when running fastqc there are overepresented sequences and it says possible source TrueSeq adapter, Index 12, which has the CTTGTA barcode, as shown in page 22 in here: https://support.illumina.com/content/dam/illumina-support/documents/documentation/chemistry_documentation/experiment-design/illumina-adapter-sequences-1000000002694-05.pdf

The adapter content on FastQC is ok.

b) What does this mean?

I am new to chip-seq..

ChIP-Seq Quality control • 3.8k views

ADD COMMENT • link updated 7.2 years ago by Biostar 20 • written 7.4 years ago by salamandra ▴ 550

1

Entering edit mode

CTTGTA is the barcode. That is what the sequencer saw/sequenced on this library.

Please see the informative posts about FastQC here.

ADD REPLY • link 7.4 years ago by GenoMax 152k

1

Entering edit mode

As genomax said, trust the demultiplexing software (the one that reads out the sequenced barcode) for assigning reads to barcodes. FastQC, as good as it is for quality control, is not specialized in barcode/index recognition. The only information that it can give you is which kind of adapter (TruSeq, Nextera...) is present so that you know what you have to trim for.

ADD REPLY • link 7.4 years ago by ATpoint 88k

0

Entering edit mode

Have two other questions which you might know the answer:

I wish to trim over-represented sequences, which fastQC says are TrueSeq adapters Index 12, from data, using Trimmomatic. Nevertheless, when going to adapters folder of Trimmnomatic none of the files there (NexteraPE-PE.fa, TruSeq2-SE.fa, TruSeq3-PE.fa, TruSeq2-PE.fa, TruSeq3-PE-2.fa and TruSeq3-SE.fa) has the over-represented sequence shown in fastQC. Why?

So, created a fasta file with each over-represented sequence. Sequencing is single-end but I do not know which end has the adapter. So, inside that file should put just the sequence name or repeat the sequence with name/1 and name/2? I read the manual, but didn't understand whether /1 and /2 should be used with single-end sequencing or paired-end sequencing.

ADD REPLY • link 7.4 years ago by salamandra ▴ 550

0

Entering edit mode

So, created a fasta file with each over-represented sequence.

Don't reinvent the wheel. Various Illumina adapters have a core sequence that is common. Once that is detected in your sequence everything to the right of that sequence is generally removed. There is no need to modify sequence identifiers if you know the data to be single-end. /1 and /2 were identifiers used in old illumina data. You can find more information here.

ADD REPLY • link 7.4 years ago by GenoMax 152k

0

Entering edit mode

well, the core sequence that is common to most of them: 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACATGTCAGAATC' isn't in any of the index files of trimmomatic, so can't see how I can remove the sequences without reinventing the weel :) What am I missing?

ADD REPLY • link 7.4 years ago by salamandra ▴ 550

1

Entering edit mode

That is interesting. I use bbduk.sh from BBMap suite. @Brian makes an adapters.fa file available in the resources directory of BBMap install that contains sequences of most common adapters from commercial kits. You may want to try that out.

Edit: I see the common sequence in trimmomatic files

Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
Trimmomatic-0.36/adapters/TruSeq3-SE.fa:AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC

ADD REPLY • link 7.4 years ago by GenoMax 152k