Question: CLC GW vs. tophat2
0
gravatar for Assa Yeroslaviz
3.1 years ago by
Assa Yeroslaviz1.1k
Munich
Assa Yeroslaviz1.1k wrote:

Hi,

we are having a discussion with our genomic centre about the mapping results of the samples they provided for us.

I have analysed the data with both tophat2 and STAR.

We have done a quality check using fastqc. The results we got back were not very promising. I have added one image below.

These were the command I have used to run the analysis:

tophat2 -p 10 -g 20 --read-edit-dist 5 --report-secondary-alignments -N 5 --transcriptome-index=transcriptome_index/genes -o $NEW_FILE.out genome $file

STAR --runMode alignReads --runThreadN 10 --genomeDir /home/yeroslaviz/genomes/Mus_musculus/STARIndex/ --readFilesCommand zcat --readFilesIn $file --sjdbGTFfile /home/yeroslaviz/genomes/Mus_musculus/Ensembl/NCBIM37/Annotation/Genes/genes.gtf --sjdbFileChrStartEnd  ~/genomes/Mus_musculus/STARIndex/sjdbList.out.tab --sjdbInsertSave All --outFilterMultimapNmax 20 --outFileNamePrefix $NEW_FILE --outSAMprimaryFlag AllBestScore --outSAMtype BAM SortedByCoordinate --quantMode TranscriptomeSAM --twopassMode Basic --limitGenomeGenerateRAM 50000000000--alignSJDBoverhangMin 1

We have gotten very low mapping results (only around 30-60% were mapped).

When we asked at the sequence centre if they can explain the problem(s), we were told, that they can't reproduce the problem.

They have sent us a list of their mapping results which ranges between 75-95%.

It turns out they are using the CLC genomic workbench tool to map the results with the following parameters:

Mismatch cost: 2; Insertion cost: 3; Deletion cost: 3; Length fraction: 0.5; Similarity fraction: 0.8

I was wondering if it even make sense to try and map a data set with such parameters. The length fraction and the similarity allow IMHO for a very high error rate, where a minimum of 50% of the read must be a match and in this 50% I still expect only 80% similarity. This allows in our 100 bases read length samples for 60 bases to be not correct.

I have tried to search for papers or more information from other users who have worked with the CLC GW before, but couldn't find much.

Do you think the CLC way of analysing the data is still good enough? Is the error rate not too high?

 

thanks,

Assa

per_base_quality

fastqc mapping clc tophat • 1.6k views
ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by Assa Yeroslaviz1.1k
1

also the plot above shows data of very low quality - I would be highly suspicious of any tool (or settings) that produces high  alignment rates on it

ADD REPLYlink written 3.1 years ago by Istvan Albert ♦♦ 77k
1
gravatar for michael.ante
3.1 years ago by
michael.ante2.5k
Austria/Vienna
michael.ante2.5k wrote:

I would go first with a low-quality tail trimming and also check for adapter-contamination (also one part of the fastqc report). You can use for instance bbduk (from the bbmap suite) or the fastq_quality_trimmer from the FASTX toolkit.

Subsequently, you might check for over-represented sequences in the trimmed data. Maybe you have some other contaminations as well. 

After these steps, you still could compare Tophat2 and CLC GW.

ADD COMMENTlink written 3.1 years ago by michael.ante2.5k
1

Hi,

this I have already done. I did all the trimming and cutting and filtering i think I can do. It didn't really increase the mapping results by much. My question here is not really about how to make my data set better, but to try and understand whether or not the results from CLC are trustwothy enough, and, if so, how come that they differ so much from the tophat2 run.

ADD REPLYlink written 3.1 years ago by Assa Yeroslaviz1.1k
1
gravatar for Burnedthumb
3.1 years ago by
Burnedthumb90
Netherlands
Burnedthumb90 wrote:

Both Tophat2 and STAR are splice aware aligners. If I recall correctly, the default alignment program of CLC bio is not. Maybe you can verify which of the aligners they used, maybe they used the RNAseq pipeline which (should) work differently.

A couple of months back I did some tests with CLC bio vs Bowtie2 vs HiSAT followed by some SNP calling program. The results from that were that CLC bio gave more (false positve) SNPs than the other two. My guess is that this is due to weird liberal intron/exon boundary alignments of CLC (however, I need to do more testing for that).

 

 

ADD COMMENTlink written 3.1 years ago by Burnedthumb90
1

Did you use the default parameters from the CLC run?

I still think that taking a length fraction of 0.5 and than a similarity of 0.8 on top of that is quite high. Any experience on that?

ADD REPLYlink written 3.1 years ago by Assa Yeroslaviz1.1k
0
gravatar for Istvan Albert
3.1 years ago by
Istvan Albert ♦♦ 77k
University Park, USA
Istvan Albert ♦♦ 77k wrote:

By default bowtie2 is tuned for speed and will not be able to handle data with lots of errors. You can greatly increase its sensitivity, for example see this:

A: BWA vs Bowtie 2 (Poll)

 

ADD COMMENTlink written 3.1 years ago by Istvan Albert ♦♦ 77k
1

I will try to run bowie instead of tophat2 with the mentioned parameters. Maybe I will play a bit with them as well.

But my main question stays the same. Can I trust the CLC results?

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Assa Yeroslaviz1.1k
1

tophat2 already runs bowtie2 as its aligner - is may just need a few extra parameters.

ADD REPLYlink written 3.1 years ago by Istvan Albert ♦♦ 77k
1

Is it possible to add the bowtie parameters from the link you added above to the tophat2 run?

I have looked into the tophat parameters and couldn't find any beside the ones I listed above to make the search less stringent.

ADD REPLYlink written 3.1 years ago by Assa Yeroslaviz1.1k
2

I think these correspond to:

--very-sensitive

see the Bowtie2 specific settings in the Tophat2 manual. Also in this case the options may need to be prepended by --b2 so that it knows to pass it down to Bowtie2. For example -D will be --b2-D

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Istvan Albert ♦♦ 77k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1471 users visited in the last hour