Question: Approximately following the GATK recommended workflow for DNA sequence data, my mpileup output files are empty
0
gravatar for Joel Wallenius
29 days ago by
Sweden
Joel Wallenius10 wrote:

Hello!

I've googled for hours but haven't found a solution that worked for me. I have some raw human DNA paired-end data in fastq format, then I use these programs in this order:

Trimmomatic (trim the reads) BWA (map the reads to hg19) SAMtools (sorting and indexing) Picard (duplicate marking) GATK (indel realignment) GATK (base recalibration) SAMtools (mpileup) VarScan

Everything looks good until the mpileup step. The output files are empty. I did a flagstat on the bam files that I input to mpileup, and they look mostly like this:

2136672 + 0 in total (QC-passed reads + QC-failed reads)
12002 + 0 secondary
0 + 0 supplementary
1370901 + 0 duplicates
2131283 + 0 mapped (99.75% : N/A)
2124670 + 0 paired in sequencing
1062335 + 0 read1
1062335 + 0 read2
2032962 + 0 properly paired (95.68% : N/A)
2115336 + 0 with itself and mate mapped
3945 + 0 singletons (0.19% : N/A)
5880 + 0 with mate mapped to a different chr
3537 + 0 with mate mapped to a different chr (mapQ>=5)

This looks good right? What am I missing? I tried adding the -A flag to mpileup because that should include improperly mated read pairs (but judging by the flagstat output, I should not have to do this), it changes nothing.

Commands in order:

java -jar ${EBROOTTRIMMOMATIC}/trimmomatic-0.38.jar PE "$f1" "$f2" "TRIMMED/${sample}.fp.gz" "TRIMMED/${sample}.fu.gz" "TRIMMED/${sample}.rp.gz" "TRIMMED/${sample}.ru.gz" "$illuclip"

bwa mem -t 8 -M -R "@RG\tID:${sample}\tSM:${sample}\tPL:ILLUMINA" "$refGenome" "$read1" "$read2" | samtools view -Sb - > "MAPPED/${sample}.bam"

samtools sort "MAPPED/${sample}.bam" > "SORTED/${sample}.bam"

samtools index "SORTED/${sample}.bam" "SORTED/${sample}.bai"

java -jar "${EBROOTPICARD}/picard.jar" MarkDuplicates I="SORTED/${sample}.bam" O="DUPLICATES/MARKED/${sample}.bam" M="QC/PICARD/${sample}.txt"

samtools index "DUPLICATES/MARKED/${sample}.bam" "DUPLICATES/MARKED/${sample}.bai"

java -jar "$EBROOTGATK/GenomeAnalysisTK.jar" -T RealignerTargetCreator -R "$refGenome" "${knownI[@]}" -I "$i" --filter_reads_with_N_cigar -o "$t"

java -jar "$EBROOTGATK/GenomeAnalysisTK.jar" -T IndelRealigner -R "$refGenome" "${knownI[@]}" -I "$i" -targetIntervals "$t" -o "INDELSREALIGNED/${sample}.bam"

samtools index "INDELSREALIGNED/${sample}.bam" "INDELSREALIGNED/${sample}.bai"

java -jar "$EBROOTGATK/GenomeAnalysisTK.jar" -T BaseRecalibrator -R "$refGenome" "${knownS[@]}" -I "$i" -o "$o"

java -jar "$EBROOTGATK/GenomeAnalysisTK.jar" -T BaseRecalibrator -R "$refGenome" "${knownS[@]}" -I "$i" -BQSR "$o" -o "$p"

 java -jar "$EBROOTGATK/GenomeAnalysisTK.jar" -T AnalyzeCovariates -R "$refGenome" -before "$o" -after "$p" -plots "BASERECALIBRATED/${sample}.pdf"

java -jar "$EBROOTGATK/GenomeAnalysisTK.jar" -T PrintReads -R "$refGenome" -I "$i" -BQSR "$o" -o "BASERECALIBRATED/${sample}.bam"    

samtools mpileup -A -B -d 1000 -f "$refGenome" -o "PILEDUP/${sample}.mup" "BASERECALIBRATED/${sample}.bam"

java -Xmx3G -jar "${EBROOTVARSCAN}/VarScan.v2.4.1.jar" mpileup2snp "PILEDUP/${sample}.mup" --min-var-freq 0.001 --strand-filter 0 --p-value 0.05 > "VARCALLS/$sample.snp"

I'm newly graduated so don't have much experience. I just like to program, heh.

Big thanks in advance!!

sequencing snp next-gen genome • 175 views
ADD COMMENTlink modified 28 days ago • written 29 days ago by Joel Wallenius10

Hi Joel Wallenius, show the relevant command please ( Brief Reminder On How To Ask A Good Question )

ADD REPLYlink modified 29 days ago • written 29 days ago by ATpoint15k

Changed the title and added command!

ADD REPLYlink written 29 days ago by Joel Wallenius10

GATK (indel realignment) GATK (base recalibration) SAMtools (mpileup) VarScan

That doesn't look like the (current?) GATK best practices to me.

ADD REPLYlink written 29 days ago by WouterDeCoster38k

Changed the title and added command!

ADD REPLYlink written 29 days ago by Joel Wallenius10

It will be good if you are able to paste all your commands as we can see the parameters used. what are the commands for Varscan ?

ADD REPLYlink written 29 days ago by Nandini780
1

Also, generally, it is not a great idea to mix and match up these programs that were developed by independent groups. If you want to use GATK, then follow their best practices guide and use HaplotypeCaller to call variants. As far as I am aware, the indel realignment step is neither now necessary as a separate step in the GATK 4 release - you should check.

As an example of where this may go wrong, GATK does base recalibration separately, but SAMtools can also do it during the mpileup command. One should not be doing it twice (?).

ADD REPLYlink modified 29 days ago • written 29 days ago by Kevin Blighe41k

The super computer I'm working with does not support the latest GATK so I have to use 3.6. The best practices guide includes the indel realignment and base recalibrations as far as I can see on their web page. Unless SAMtools mpileup performs the recalibration automatically without being asked to, I am surely not doing it twice. Advice noted though, and thank you!

ADD REPLYlink written 29 days ago by Joel Wallenius10

conda is your friend. https://anaconda.org/bioconda/gatk4

Or what do you mean by "does not support"?

ADD REPLYlink written 29 days ago by ATpoint15k

They use a module system, you load modules more or less like you'd activate virtual envs. The latest GATK module is 3.7 or something. I could maybe ask the admins to install the latest one but that would take a while, probably. How much do I stand to gain in practice from an update? The problem I'm having is with mpileup, after all (though I guess maybe mpileup misbehaves because of the output from GATK's baserecalibration?).

ADD REPLYlink written 29 days ago by Joel Wallenius10

Yes, modules are common on HPCs but you have a $HOME folder, no? There you can install whatever you like locally and link it to your $PATH. conda is a package manager to do exactly that.

ADD REPLYlink written 29 days ago by ATpoint15k

Yes true but this script is supposed to be used by the rest of my research group and they aren't very savvy, if you know what I mean. I'd like to make it as easy as possible for them, and installing a bunch of packages as a first step is... well, it's not a priority right now at least. Perhaps in a few weeks if I don't receive new tasks...

ADD REPLYlink written 29 days ago by Joel Wallenius10

To avoid to install tools locally in your home (and also to avoid that your colleagues install themselves tools in their home) you may try to build a singularity image with all the tools you want. Then share the image with your colleagues. I'm sure your HPC has a module to use singularity images.

ADD REPLYlink written 29 days ago by Nicolas Rosewick7.5k

The commands for VarScan do not matter as it's a later step. I will paste the commands tomorrow when I'm at work again!

ADD REPLYlink written 29 days ago by Joel Wallenius10

Added all the commands in order!

ADD REPLYlink written 29 days ago by Joel Wallenius10

How was the library done? Is this amplicon based? I'm asking because you have a very high duplication rate and in case of amplicon based data, you cannot mark duplicates.

ADD REPLYlink written 29 days ago by finswimmer11k

I'm not sure actually, I only know that the input at the start of my script is in .fastq.gz format, that the reads are paired, and produced by an Illumina machine. FWIW, all the reads and all the samples have the same adapter sequences. Does that supply enough information?

ADD REPLYlink written 29 days ago by Joel Wallenius10

So is it WGS or WES or targeted ?

ADD REPLYlink modified 29 days ago • written 29 days ago by Nandini780

Your question gave me an idea. I am using a file with targets in a BED-format, and the chromosome names in that file were like 1,2,3... rather than chr1, chr2, and so on... that's why the output was empty. :[ The shame is real. At least the problem is solved now...

ADD REPLYlink written 29 days ago by Joel Wallenius10
0
gravatar for Joel Wallenius
28 days ago by
Sweden
Joel Wallenius10 wrote:

SOLVED! ( Sorry I couldn't write this earlier, my post limit had been reached )

The problem was that my targets file (BED-format) used chromosome names 1,2,3,4,5,6..., while my alignment files used names like chr1,ch2,chr3...

Big thanks to everyone for you patience. I'll remember your advice!

ADD COMMENTlink written 28 days ago by Joel Wallenius10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1775 users visited in the last hour