Question: How to remove kmer profiles?
1
gravatar for kirannbishwa01
4.5 years ago by
kirannbishwa011.2k
United States
kirannbishwa011.2k wrote:

I did a check of my fastq files using fastqc which reavealed several problems: 1) per base gc content, per base sequence content) at the intial part of the 100 bp paired end 2) several over represented sequences and kmer profiles. I then used trimmomatic to remove first 10 base pairs (headcrop 10) which showed some problems in the reads (is it so????) and also supplied Illumina adapters to remove the over represented sequences and kmer profiles using Illuminaclip. The report for overrepresented sequences has been good but the kmer profiles are still existing.

How should I remove those kmer profiles? Is it fine to go ahead and do the alignment to the reference genome without correcting for the kmers?

Thank yop in advance !

I wanted to share the pics/html files, I have got but I am not finding any options to share it on this forum. I am not sure why is that ? Are attachments not allowed on Biostars forum?

- Bishwa K.

ADD COMMENTlink modified 4.2 years ago by Biostar ♦♦ 20 • written 4.5 years ago by kirannbishwa011.2k

Please upload things somewhere and link to them. Also, what kind of experiment was this (e.g., RNAseq)?
 

ADD REPLYlink written 4.5 years ago by Devon Ryan94k

Hi Devon,

I have shared the link using google drive sharing. I think it will work after you download the link (on the browser). The data are genomic reseq data.

Thanks,

ADD REPLYlink written 4.5 years ago by kirannbishwa011.2k
5
gravatar for Istvan Albert
4.5 years ago by
Istvan Albert ♦♦ 83k
University Park, USA
Istvan Albert ♦♦ 83k wrote:

Don't worry about the kmers - in the vast majority of cases they provide useless information. 

I would also not head crop data, that rarely helps. Many aligners (like bwa) will tolerate leading and trailing errors in reads.

ADD COMMENTlink written 4.5 years ago by Istvan Albert ♦♦ 83k

As Istvan said.  There doesn't seem to be anything terribly wrong with your data.  In my experience the kmers are often low in number relative to total reads, and are often caused by adapters.  The nucleotide usage at the beginning of the reads always looks odd (not flat) due to random hexamers used by Illumina not being truly random. The important thing is to remove adapters and use the sliding window for quality trimming.
 

ADD REPLYlink written 4.5 years ago by Ian5.6k

Thanks for the update !!!

ADD REPLYlink written 4.5 years ago by kirannbishwa011.2k
0
gravatar for kirannbishwa01
4.5 years ago by
kirannbishwa011.2k
United States
kirannbishwa011.2k wrote:

I am attaching the link to the files that are available in html format. I think it will open on the browser after downloading.

This if the fasqc report for raw files (genomic resequenced data, paired end reads). It shows several problems: 1) per base gc content, per base sequence content) at the intial part of the 100 bp paired end 2) several over represented sequences and kmer profiles.

https://drive.google.com/file/d/0B9YUBnYGAr1AMWpFSjdjN3UyaFE/view?usp=sharing

https://drive.google.com/file/d/0B9YUBnYGAr1AXy0tSXZRQVhpRkE/view?usp=sharing

 

I then head cropped (10 bases) and removed adapter using trimmomatic

adapters: https://drive.google.com/file/d/0B9YUBnYGAr1AS0hrc2lMbE43ZUU/view?usp=sharing

https://drive.google.com/file/d/0B9YUBnYGAr1ANEFZc3FleDRob3M/view?usp=sharing

 

only adapter trimming improved the kmer profiles but not most of the sequence content and gc content per base at the first 10 bp of the read.

The new fastqc 0.32 reports kmer profiles for the fasta files that were not reported by fastqc (available on iplant).

 

Also, the RNAseq data has following fastqc report; no kmer and adapter contaminant but the gc and base content show more variation at the first 10 bp.

https://drive.google.com/file/d/0B9YUBnYGAr1Adk5MY1NoRVMzZFE/view?usp=sharing

https://drive.google.com/file/d/0B9YUBnYGAr1AR09wZnUzeXAwRTA/view?usp=sharing

 

I am thinking of proceeding with adapter trimming but no head crop, but I would like to know why is there such variation at the first 10 base pairs of reads (for both RNAseq and genomic reseq data; they were both sequenced at different facilities).

 

Thanks,
 

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by kirannbishwa011.2k

Can someone comment on my report?

Thanks,

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by kirannbishwa011.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1088 users visited in the last hour