Question: Base Recabiliration Filters out 88% of the reads
0
gravatar for Sharon
15 months ago by
Sharon420
Sharon420 wrote:

Hi All

When using this command:

java -jar ${GATK}/GenomeAnalysisTK.jar -T SplitNCigarReads -R ${hg38}.fasta -I ${WHERE}/${CURRENT}.dedupped.bam -o ${WHERE}/${CURRENT}.split.bam \
 -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

The output is as follows:

INFO 18:47:27,384 ProgressMeter - done 1.42375677E8
4.4 h 111.0 s 100.0% 4.4 h 0.0 s INFO 18:47:27,384 ProgressMeter - Total runtime 15909.94 secs, 265.17 min, 4.42 hours INFO 18:47:27,394 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 142375677 total reads

(0.00%) INFO 18:47:27,395 MicroScheduler - -> 0 reads (0.00% of

total) failing BadCigarFilter INFO 18:47:27,395 MicroScheduler -

-> 0 reads (0.00% of total) failing MalformedReadFilter INFO 18:47:27,396 MicroScheduler - -> 0 reads (0.00% of total) failing ReassignOneMappingQualityFilter

But when I do base recabliration:

         java -jar ${GATK}/GenomeAnalysisTK.jar \
        -T BaseRecalibrator \
        -R ${hg38}.fasta \
        -I ${WHERE}/${CURRENT}.split.bam \
        -knownSites ${DBSNP} \
        -knownSites ${GOLDINDELS} \
        -o ${WHERE}/${CURRENT}.recal_data.table

I got:

INFO 12:02:24,217 BaseRecalibrator - ...done! INFO 12:02:24,218 BaseRecalibrator - BaseRecalibrator was able to recalibrate 21765617 reads INFO 12:02:24,219 ProgressMeter - done
2.1765706E7 94.9 m 4.4 m 100.0% 94.9 m 0.0 s INFO 12:02:24,220 ProgressMeter - Total runtime 5693.92 secs, 94.90 min, 1.58 hours INFO 12:02:24,220 MicroScheduler - 160165683 reads > were filtered out during the traversal out of approximately 181931389 > total reads (88.04%) INFO 12:02:24,221 MicroScheduler - -> 6632 reads (0.00% of total) failing BadCigarFilter INFO 12:02:24,221 MicroScheduler - -> 72386752 reads (39.79% of total) failing DuplicateReadFilter INFO 12:02:24,223 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 12:02:24,223 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 12:02:24,223 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 12:02:24,223 MicroScheduler - -> 63827082 reads (35.08% of total) failing MappingQualityZeroFilter INFO 12:02:24,224 MicroScheduler - -> 23945217 reads (13.16% of total) failing NotPrimaryAlignmentFilter INFO 12:02:24,225 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

What makes >88% of reads filtered out, if they were not filtered before? and how should I fix this?

Thanks

rna-seq variant calling • 774 views
ADD COMMENTlink modified 13 months ago by Biostar ♦♦ 20 • written 15 months ago by Sharon420

lookls like SplitNCigarReads produces malformed reads. what's your version of gatk ?

ADD REPLYlink written 15 months ago by Pierre Lindenbaum120k

GATK3. I think we have 0 reads (0.00% of total) failing MalformedReadFilter before and after the SplitNCigarRead.

ADD REPLYlink modified 15 months ago • written 15 months ago by Sharon420

what the major-minor version of gatk please . GATK3.* ?

ADD REPLYlink written 15 months ago by Pierre Lindenbaum120k

I don't know how to check this Pierre Lindenbaum ?

ADD REPLYlink written 15 months ago by Sharon420
1
gravatar for andrew.j.skelton73
15 months ago by
London
andrew.j.skelton735.7k wrote:

This is from an RNA Seq experiment right, and you're attempting to call variants? AFAIK, SplitNCigarReads corrects the CIGAR string when reads span splice junctions, and forces filters for MalformedReadFilter and ReassignOneMappingQualityFilter, of which all of your reads seems to have passed. When you're running BQSR, you're losing a lot of reads from Duplicates (expected as you have a quantitive experiment). The major concern for me here is the MappingQualityUnavailableFilter, how did you perform your alignments?

ADD COMMENTlink written 15 months ago by andrew.j.skelton735.7k

I used STAR for alignment, same as GATK pipeline for variant calling. So does this mean there is too many reads that are mapped but with low quality? But if yes, how can I fix. I think the default parameters of STAR are relaxed enough, making them more stringent will lead to more unmapped. Or I am thinking in a wrong direction? Thanks a lot.

ADD REPLYlink modified 15 months ago • written 15 months ago by Sharon420

The issues are these:

72386752 reads (39.79% of total) failing DuplicateReadFilter

Too many duplicate reads

--------------------------

63827082 reads (35.08% of total) failing MappingQualityZeroFilter

Too many reads with MAPQ of nil / zero. You'll have to see what MAPQ=0 means according to STAR

--------------------------

23945217 reads (13.16% of total) failing NotPrimaryAlignmentFilter

Too many secondary reads

ADD REPLYlink written 15 months ago by Kevin Blighe42k

Thanks alot, but is there a way to fix this? How to fix multimapping, duplicates, and secondary reads. Those are related to the orignal reads right?

ADD REPLYlink modified 15 months ago • written 15 months ago by Sharon420

It may help to explain the nature of the experiment - is it RNA-seq variant calling, like Andrew said?. If it is indeed RNA-seq, then many duplicates are expected.

Are you following step-by-step the pipeline by the GATK: https://gatkforums.broadinstitute.org/gatk/discussion/3892/the-gatk-best-practices-for-variant-calling-on-rnaseq-in-full-detail ?

ADD REPLYlink written 15 months ago by Kevin Blighe42k

Modified:mYes, following it step by step. I think those issues are related to the quality of the original reads? Also the link you posted could be a reason. So my question is it safe to ignore the base recalibration step? and move with the bam file after splitncigar? The base recabilration is discarding 88% of the reads !

ADD REPLYlink modified 15 months ago • written 15 months ago by Sharon420

Do you have stats using FASTQC on your raw FASTQ reads? 5PM here - now going home!

ADD REPLYlink written 15 months ago by Kevin Blighe42k

I do apologize Kevin, I feel I pushed you, really did not mean to, so sorry.
Here is a snapshot of the quality, just in case you get a chance :) It is not terrible to the extent to get 88% discarded.

https://ibb.co/iSTNBS

ADD REPLYlink written 15 months ago by Sharon420
1

Now safe in the comfort of my overpriced home. Nothing out of the ordinary on the FASTQC image. My own opinion is that the base quality recalibration step is not needed at all. Even on the page where the pipeline flow diagram is located, they say:

For one thing, these recommendations are based on high quality RNA-seq data (30 million 75bp paired-end reads produced on Illumina HiSeq). Other types of data might need slightly different processing. In addition, we have currently worked only on data from one tissue from one individual. Once we’ve had the opportunity to get more experience with different types (and larger amounts) of data, we will update these recommendations to be more comprehensive.

For me, that basically means that modifications are expected.

ADD REPLYlink written 15 months ago by Kevin Blighe42k
1

I would also look up the default parameter settings for the base recalibration step in order to see what could have been resulting in the filtering out of so many reads.

ADD REPLYlink written 15 months ago by Kevin Blighe42k

Thanks Kevin so much. Much appreciated :)

ADD REPLYlink written 15 months ago by Sharon420

Thanks a lot andrew.j.skelton73 :)

ADD REPLYlink written 15 months ago by Sharon420
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 651 users visited in the last hour