Weird BWA MEM Behavior: Zero-Length Read
1
1
Entering edit mode
5.7 years ago
aostern ▴ 30

I tried to use BWA MEM to map reads from an interleaved FASTQ.

fastq="all.fastq"
fasta="/share/PI/apps/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa"
bwa="/share/PI/apps/bcbio/anaconda/bin/bwa"
nThreads="12"

#Run BWA MEM
#IMPORTANT: NEED -p since "$fastq" is an interleaved fastq
readGroup="@RG\tID:CHM1\tSM:CHM1\tPL:Illumina"
sam="CHM1.sam"
"$bwa" mem -R "$readGroup" -t "$nThreads" -p "$fasta" "$fastq" -o "$sam"

(The FASTQs are CHM1; I used prefetch to fetch .sra files from three different runs from NCBI, then used fastq-dump to convert the SRAs to FASTQs, then cated them all together into one FASTQ.)

The SAM is 515Gb but has no obvious problems. samtools quickcheck says it's valid. But when I run GATK4's FixMateInformation or ValidateSamFile, I get output like this

ERROR: Record 1, Read name ######################################################################################################################################################################################################, Zero-length read without FZ, CS or CQ tag
WARNING: Record 1, Read name ######################################################################################################################################################################################################, QUAL field is set to * (unspecified quality scores), this is allowed by the SAM specification but many tools expect reads to include qualities 
ERROR: Record 421522661, Read name ######################################################################################################################################################################################################, Zero-length read without FZ, CS or CQ tag

There may be even more errors, but this is what I got after two hours.

I can, in fact, see that the first line in the SAM file is

######################################################################################################################################################################################################  4       *       0       0       *       *       0       0       *       *       AS:i:0  XS:i:0  RG:Z:CHM1

Is this SAM really invalid? Or is there something I need to do so GATK4 will accept it?

alignment genome sequencing software error • 2.2k views
ADD COMMENT
0
Entering edit mode

Your sam file is invalid.

How was your fastq-dump command-line?

The SAM is 515Gb

How large are the concatenated fastqs? What is the output of:

head all.fastq
tail all.fastq
ADD REPLY
0
Entering edit mode

all.fastq is 342G. For each of three .sra files, I did fastq-dump SRR1514950.sra > SRR1514950.fastq

head all.fastq:

Read 216120040 spots for SRR1514950.sra
Written 216120040 spots for SRR1514950.sra
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR1514950.1 131213_SN711_0409_AC32B3ACXX:6:1101:1206:2058:0 length=202
<#4@######################################################################################################################################################################################################
@SRR1514950.2 131213_SN711_0409_AC32B3ACXX:6:1101:1191:2090:0 length=202
TGAGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR1514950.2 131213_SN711_0409_AC32B3ACXX:6:1101:1191:2090:0 length=202
(89?######################################################################################################################################################################################################
@SRR1514950.3 131213_SN711_0409_AC32B3ACXX:6:1101:1227:2111:0 length=202

tail all.fastq

+SRR1514952.214054906 131213_SN711_0409_AC32B3ACXX:8:2316:21322:100963:1 length=202
#+1AD=DDGFFHGEIJIJIIIIIIIIIJBGG>HEHGHBFHGGIDHHDGHIEHGGEE@=FGIHDHFDFEED@AB'82=:>>@9CB7<99@9@>BBDDD####B<+ADFF>AFHFFHGGIGGJDCB;EEE;;?BB1?;0BD?;BFFG*09?=;(7=(.@E;.=E;EFFFEDEFCD?2@C2=(;A39>:95(:?:44:(:>C4:>
@SRR1514952.214054907 131213_SN711_0409_AC32B3ACXX:8:2316:21296:100963:1 length=202
NAAGTAAAGTATGCTTAGAGGAGGGCCAAGCAGGCAACAGGAGAGATCCAAGTGCCTTGACGAATTATATCATGTCATATTTTTATTCATTAGCATAGTTTAACTGAAAAGAATGCAGATAGACAATTAAACACGTTTGAAAAGTGATGCATGAACAAAGTGAGGTAAAACTATAAAAAAAAAAAAACCTTTCACATAAAAA
+SRR1514952.214054907 131213_SN711_0409_AC32B3ACXX:8:2316:21296:100963:1 length=202
#1:A=ADDDADFFEHGGHHGIGCDEGGEHHIDIFIHGI@?DHIGI?GFEHIGIEF@:CDE@BE:?=DBCCCC>366@>>@DDECBCCD;CDD:?:@@:A:A@@@DFAFFDFD?HFGHEBFIHCHIEGGHHGIEHIE9FDCDHE999BBHHHDGFCGICBGIIIIA@@FE@DGHHJCEHGFCB@BD#################
@SRR1514952.214054908 131213_SN711_0409_AC32B3ACXX:8:2316:21349:100963:0 length=202
NTTGACTTCTTTAGCAAAACAATACAGTAAGAGGGCCCTTTTCTCAATGAATGATAGGCTAACACAACAGAGATCTTTCAGTTAAAGCAGAATCTTAAGTAACTCGTCCTCATGGCACAGTCCCTCATGGCTTCCTTTGGCTAGGGGAGGGAGTAGACTGACCCCTTGCGCTTCCCAGGTGAGGAAACGCCACACTCTGCTT
+SRR1514952.214054908 131213_SN711_0409_AC32B3ACXX:8:2316:21349:100963:0 length=202
#(0((2.;>?88><??:<>?9?;>??@<>;@;?5//==??>8?;>???>???=??=86=?@@>5;>=??@8>73=399===9?=>8(==?###########?@+=BD1?BHD?DF>GB9?<FGHGGC???AEGHIGBF<3:BG38)).7<F;@#################################################
ADD REPLY
0
Entering edit mode

Your fastq is corrupted, it should start with a header (like @SRR1514950.2 131213_SN711_0409_AC32B3ACXX:6:1101:1191:2090:0 length=202), but it starts with:

Read 216120040 spots for SRR1514950.sra

Written 216120040 spots for SRR1514950.sra

And the first sequence is missing its header, which should be @SRR1514950.1 131213_SN711_0409_AC32B3ACXX:6:1101:1206:2058:0 length=202.

Did you use nohup with fastq-dump?

ADD REPLY
0
Entering edit mode

Huh, you're right. I guess I'm somehow using fastq-dump wrong. All my FASTQs start like that.

head *.fastq:

==> SRR1514950.fastq <==
Read 216120040 spots for SRR1514950.sra
Written 216120040 spots for SRR1514950.sra
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR1514950.1 131213_SN711_0409_AC32B3ACXX:6:1101:1206:2058:0 length=202
<#4@######################################################################################################################################################################################################
@SRR1514950.2 131213_SN711_0409_AC32B3ACXX:6:1101:1191:2090:0 length=202
TGAGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR1514950.2 131213_SN711_0409_AC32B3ACXX:6:1101:1191:2090:0 length=202
(89?######################################################################################################################################################################################################
@SRR1514950.3 131213_SN711_0409_AC32B3ACXX:6:1101:1227:2111:0 length=202

==> SRR1514951.fastq <==
Read 213892051 spots for SRR1514951.sra
Written 213892051 spots for SRR1514951.sra
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR1514951.1 131213_SN711_0409_AC32B3ACXX:7:1101:1242:2066:0 length=202
<<<@######################################################################################################################################################################################################
@SRR1514951.2 131213_SN711_0409_AC32B3ACXX:7:1101:1200:2068:0 length=202
TGTANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR1514951.2 131213_SN711_0409_AC32B3ACXX:7:1101:1200:2068:0 length=202
<<<@######################################################################################################################################################################################################
@SRR1514951.3 131213_SN711_0409_AC32B3ACXX:7:1101:1183:2095:0 length=202

==> SRR1514952.fastq <==
Read 214054908 spots for SRR1514952.sra
Written 214054908 spots for SRR1514952.sra
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR1514952.1 131213_SN711_0409_AC32B3ACXX:8:1101:1190:2056:0 length=202
##########################################################################################################################################################################################################
@SRR1514952.2 131213_SN711_0409_AC32B3ACXX:8:1101:1241:2082:0 length=202
CAATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR1514952.2 131213_SN711_0409_AC32B3ACXX:8:1101:1241:2082:0 length=202
<<<@######################################################################################################################################################################################################
ADD REPLY
0
Entering edit mode

Do you think there's a way to salvage the SAM I have? Or do I need to fix the FASTQs and then realign them?

ADD REPLY
0
Entering edit mode

I would play it safe and fix the fastq and re-aling them. You don't know how the broken headers altered bwa interpretation of the file.

ADD REPLY
2
Entering edit mode
5.7 years ago
h.mon 35k

fastq-dump will create file names and output reads to those files, so don't use redirection to stdout (>) - if you want to use it, then you have to use -Z:

  -Z|--stdout                      Output to stdout, all split data become 
                                   joined into single stream

There is a working example of fastq-dump for the very same dataset you are using here:

Aligning paired end fastq files dumped from SRA

fastq-dump --origfmt -I --split-files --gzip SRR1514950.sra

This will output two files one for R1 and other for R2, so you will need to modify your bwa mem command.

ADD COMMENT

Login before adding your answer.

Traffic: 3001 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6