Question: Interpreting Fastqc Results For Chip-Seq Data
2
gravatar for David
5.8 years ago by
David730
David730 wrote:

Hello,

I am processing a ChIP-seq experiment I downloaded from GEO (link). The SRA files are massive (39M sequences). It took me a while processing them. Briefly, I did SRA to fastQ format using fastq-dump then concatenated the 2 fastQ files with cat and ran fastQC a first time. I discover that the reads contained an adapter.

I ran Trim Galore! to remove the adapter.

Adapter 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGTCGGATATCTCGTAT', length 50, was trimmed 13590580 times.

Then I ran fastQC on the trimmed fastQ file and obtained the following results.

As you can see the results are not ideal.

The "per sequence content" and "per base GC" look odd. I am thinking to trim the end of the read ends. Any comment on that is welcome.

The per sequence GC show a mix distribution. Did anybody encountered that? Could it comes from the two fastQ file I concatenated (kind of a batch effect)??

Lastly, because we are dealing with ChIP-seq sequences here the reads are not completely random and contain over-represented motifs. So I assume the warning in "sequence duplication level" and "Kmer content" are not relevant. Is this assumption correct?

ADD COMMENTlink modified 5.8 years ago by Ido Tamir4.9k • written 5.8 years ago by David730
1
gravatar for Jelena Aleksic
5.8 years ago by
Cambridge, UK
Jelena Aleksic900 wrote:

I agree with your conclusions. I would trim the last 10-20 or so base pairs - it looks a little bit like there's some sort of end of an adapter there, that hasn't been trimmed. They're pretty long sequences, so you should still have plenty of data left after this. Additionally, I'd also perform quality trimming to get rid of a few of the lower quality sequence reads (Trim Galore can do this).

The sequence duplication levels actually look completely fine to me, especially for ChIP-seq data, but I think that some people remove the duplicates, so this is something you could consider doing if you're worried about it? I don't know if it's bad etiquette to point people to a different help forum? But there was quite a long and in-depth discussion of duplicate removal in this thread at SEQanswers.

ADD COMMENTlink written 5.8 years ago by Jelena Aleksic900
1

Thanks for the hints. I will look into quality trimming. In our pipeline we keep duplicate reads but allow only unique matching at the alignment step.

ADD REPLYlink written 5.8 years ago by David730

Mmm, sounds good - that's actually my preferred solution as well.

ADD REPLYlink written 5.8 years ago by Jelena Aleksic900

I found out that the quality trimming is performed by default. I am re-running it with a higher threshold.

ADD REPLYlink written 5.8 years ago by David730
1
gravatar for Ido Tamir
5.8 years ago by
Ido Tamir4.9k
Austria
Ido Tamir4.9k wrote:

Seems like you pasted the two read files together not "cat" as you said, or the SRA conversion messed something up. Else you would not get reads with a length of 250. You also see the drop + rise in qualities at 130 (125).

I would go back to square one and start again.

ADD COMMENTlink written 5.8 years ago by Ido Tamir4.9k

Yes, the GEO page says that they used Illumina Genome Analyzer so the reads shouldn't be 250 bp.

ADD REPLYlink written 5.8 years ago by Mikael Huss4.6k

Thanks guys I will look into that. How does using:

cat file1.fq file2.fq > file.fq

can paste reads together?

ADD REPLYlink written 5.8 years ago by David730

maybe it was

  paste file1.fq file2.fq > file.fq

The read names would be also concatenated in this case.

ADD REPLYlink written 5.8 years ago by Ido Tamir4.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1233 users visited in the last hour