I tried to use BWA MEM to map reads from an interleaved FASTQ.
fastq="all.fastq"
fasta="/share/PI/apps/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa"
bwa="/share/PI/apps/bcbio/anaconda/bin/bwa"
nThreads="12"
#Run BWA MEM
#IMPORTANT: NEED -p since "$fastq" is an interleaved fastq
readGroup="@RG\tID:CHM1\tSM:CHM1\tPL:Illumina"
sam="CHM1.sam"
"$bwa" mem -R "$readGroup" -t "$nThreads" -p "$fasta" "$fastq" -o "$sam"
(The FASTQs are CHM1; I used prefetch to fetch .sra files from three different runs from NCBI, then used fastq-dump to convert the SRAs to FASTQs, then cated them all together into one FASTQ.)
The SAM is 515Gb but has no obvious problems. samtools quickcheck says it's valid. But when I run GATK4's FixMateInformation or ValidateSamFile, I get output like this
ERROR: Record 1, Read name ######################################################################################################################################################################################################, Zero-length read without FZ, CS or CQ tag
WARNING: Record 1, Read name ######################################################################################################################################################################################################, QUAL field is set to * (unspecified quality scores), this is allowed by the SAM specification but many tools expect reads to include qualities
ERROR: Record 421522661, Read name ######################################################################################################################################################################################################, Zero-length read without FZ, CS or CQ tag
There may be even more errors, but this is what I got after two hours.
I can, in fact, see that the first line in the SAM file is
###################################################################################################################################################################################################### 4 * 0 0 * * 0 0 * * AS:i:0 XS:i:0 RG:Z:CHM1
Is this SAM really invalid? Or is there something I need to do so GATK4 will accept it?
Your sam file is invalid.
How was your
fastq-dumpcommand-line?How large are the concatenated fastqs? What is the output of:
all.fastqis 342G. For each of three.srafiles, I didfastq-dump SRR1514950.sra > SRR1514950.fastqhead all.fastq:tail all.fastqYour fastq is corrupted, it should start with a header (like
@SRR1514950.2 131213_SN711_0409_AC32B3ACXX:6:1101:1191:2090:0 length=202), but it starts with:And the first sequence is missing its header, which should be
@SRR1514950.1 131213_SN711_0409_AC32B3ACXX:6:1101:1206:2058:0 length=202.Did you use
nohupwithfastq-dump?Huh, you're right. I guess I'm somehow using
fastq-dumpwrong. All my FASTQs start like that.head *.fastq:Do you think there's a way to salvage the SAM I have? Or do I need to fix the FASTQs and then realign them?
I would play it safe and fix the fastq and re-aling them. You don't know how the broken headers altered bwa interpretation of the file.