We are having a discussion with our genomic centre about the mapping results of the samples they provided for us.
I have analysed the data with both tophat2 and STAR.
We have done a quality check using
fastqc. The results we got back were not very promising. I have added one image below.
These were the command I have used to run the analysis:
tophat2 -p 10 -g 20 --read-edit-dist 5 --report-secondary-alignments -N 5 --transcriptome-index=transcriptome_index/genes -o $NEW_FILE.out genome $file STAR --runMode alignReads --runThreadN 10 --genomeDir /home/yeroslaviz/genomes/Mus_musculus/STARIndex/ --readFilesCommand zcat --readFilesIn $file --sjdbGTFfile /home/yeroslaviz/genomes/Mus_musculus/Ensembl/NCBIM37/Annotation/Genes/genes.gtf --sjdbFileChrStartEnd ~/genomes/Mus_musculus/STARIndex/sjdbList.out.tab --sjdbInsertSave All --outFilterMultimapNmax 20 --outFileNamePrefix $NEW_FILE --outSAMprimaryFlag AllBestScore --outSAMtype BAM SortedByCoordinate --quantMode TranscriptomeSAM --twopassMode Basic --limitGenomeGenerateRAM 50000000000--alignSJDBoverhangMin 1
We have gotten very low mapping results (only around 30-60% were mapped).
When we asked at the sequence centre if they can explain the problem(s), we were told, that they can't reproduce the problem.
They have sent us a list of their mapping results which ranges between 75-95%.
It turns out they are using the CLC genomic workbench tool to map the results with the following parameters:
Mismatch cost: 2; Insertion cost: 3; Deletion cost: 3; Length fraction: 0.5; Similarity fraction: 0.8
I was wondering if it even make sense to try and map a data set with such parameters. The length fraction and the similarity allow IMHO for a very high error rate, where a minimum of 50% of the read must be a match and in this 50% I still expect only 80% similarity. This allows in our 100 bases read length samples for 60 bases to be not correct.
I have tried to search for papers or more information from other users who have worked with the CLC GW before, but couldn't find much.
Do you think the CLC way of analysing the data is still good enough? Is the error rate not too high?
Also the plot above shows data of very low quality - I would be highly suspicious of any tool (or settings) that produces high alignment rates on it