Question: Weird BWA MEM Behavior: Zero-Length Read
1
gravatar for aostern
14 months ago by
aostern30
aostern30 wrote:

I tried to use BWA MEM to map reads from an interleaved FASTQ.

fastq="all.fastq"
fasta="/share/PI/apps/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa"
bwa="/share/PI/apps/bcbio/anaconda/bin/bwa"
nThreads="12"

#Run BWA MEM
#IMPORTANT: NEED -p since "$fastq" is an interleaved fastq
readGroup="@RG\tID:CHM1\tSM:CHM1\tPL:Illumina"
sam="CHM1.sam"
"$bwa" mem -R "$readGroup" -t "$nThreads" -p "$fasta" "$fastq" -o "$sam"

(The FASTQs are CHM1; I used prefetch to fetch .sra files from three different runs from NCBI, then used fastq-dump to convert the SRAs to FASTQs, then cated them all together into one FASTQ.)

The SAM is 515Gb but has no obvious problems. samtools quickcheck says it's valid. But when I run GATK4's FixMateInformation or ValidateSamFile, I get output like this

ERROR: Record 1, Read name ######################################################################################################################################################################################################, Zero-length read without FZ, CS or CQ tag
WARNING: Record 1, Read name ######################################################################################################################################################################################################, QUAL field is set to * (unspecified quality scores), this is allowed by the SAM specification but many tools expect reads to include qualities 
ERROR: Record 421522661, Read name ######################################################################################################################################################################################################, Zero-length read without FZ, CS or CQ tag

There may be even more errors, but this is what I got after two hours.

I can, in fact, see that the first line in the SAM file is

######################################################################################################################################################################################################  4       *       0       0       *       *       0       0       *       *       AS:i:0  XS:i:0  RG:Z:CHM1

Is this SAM really invalid? Or is there something I need to do so GATK4 will accept it?

ADD COMMENTlink modified 14 months ago by h.mon27k • written 14 months ago by aostern30

Your sam file is invalid.

How was your fastq-dump command-line?

The SAM is 515Gb

How large are the concatenated fastqs? What is the output of:

head all.fastq
tail all.fastq
ADD REPLYlink written 14 months ago by h.mon27k

all.fastq is 342G. For each of three .sra files, I did fastq-dump SRR1514950.sra > SRR1514950.fastq

head all.fastq:

Read 216120040 spots for SRR1514950.sra
Written 216120040 spots for SRR1514950.sra
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR1514950.1 131213_SN711_0409_AC32B3ACXX:6:1101:1206:2058:0 length=202
<#4@######################################################################################################################################################################################################
@SRR1514950.2 131213_SN711_0409_AC32B3ACXX:6:1101:1191:2090:0 length=202
TGAGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR1514950.2 131213_SN711_0409_AC32B3ACXX:6:1101:1191:2090:0 length=202
(89?######################################################################################################################################################################################################
@SRR1514950.3 131213_SN711_0409_AC32B3ACXX:6:1101:1227:2111:0 length=202

tail all.fastq

+SRR1514952.214054906 131213_SN711_0409_AC32B3ACXX:8:2316:21322:100963:1 length=202
#+1AD=DDGFFHGEIJIJIIIIIIIIIJBGG>HEHGHBFHGGIDHHDGHIEHGGEE@=FGIHDHFDFEED@AB'82=:>>@9CB7<99@9@>BBDDD####B<+ADFF>AFHFFHGGIGGJDCB;EEE;;?BB1?;0BD?;BFFG*09?=;(7=(.@E;.=E;EFFFEDEFCD?2@C2=(;A39>:95(:?:44:(:>C4:>
@SRR1514952.214054907 131213_SN711_0409_AC32B3ACXX:8:2316:21296:100963:1 length=202
NAAGTAAAGTATGCTTAGAGGAGGGCCAAGCAGGCAACAGGAGAGATCCAAGTGCCTTGACGAATTATATCATGTCATATTTTTATTCATTAGCATAGTTTAACTGAAAAGAATGCAGATAGACAATTAAACACGTTTGAAAAGTGATGCATGAACAAAGTGAGGTAAAACTATAAAAAAAAAAAAACCTTTCACATAAAAA
+SRR1514952.214054907 131213_SN711_0409_AC32B3ACXX:8:2316:21296:100963:1 length=202
#1:A=ADDDADFFEHGGHHGIGCDEGGEHHIDIFIHGI@?DHIGI?GFEHIGIEF@:CDE@BE:?=DBCCCC>366@>>@DDECBCCD;CDD:?:@@:A:A@@@DFAFFDFD?HFGHEBFIHCHIEGGHHGIEHIE9FDCDHE999BBHHHDGFCGICBGIIIIA@@FE@DGHHJCEHGFCB@BD#################
@SRR1514952.214054908 131213_SN711_0409_AC32B3ACXX:8:2316:21349:100963:0 length=202
NTTGACTTCTTTAGCAAAACAATACAGTAAGAGGGCCCTTTTCTCAATGAATGATAGGCTAACACAACAGAGATCTTTCAGTTAAAGCAGAATCTTAAGTAACTCGTCCTCATGGCACAGTCCCTCATGGCTTCCTTTGGCTAGGGGAGGGAGTAGACTGACCCCTTGCGCTTCCCAGGTGAGGAAACGCCACACTCTGCTT
+SRR1514952.214054908 131213_SN711_0409_AC32B3ACXX:8:2316:21349:100963:0 length=202
#(0((2.;>?88><??:<>?9?;>??@<>;@;?5//==??>8?;>???>???=??=86=?@@>5;>=??@8>73=399===9?=>8(==?###########?@+=BD1?BHD?DF>GB9?<FGHGGC???AEGHIGBF<3:BG38)).7<F;@#################################################
ADD REPLYlink modified 14 months ago • written 14 months ago by aostern30

Your fastq is corrupted, it should start with a header (like @SRR1514950.2 131213_SN711_0409_AC32B3ACXX:6:1101:1191:2090:0 length=202), but it starts with:

Read 216120040 spots for SRR1514950.sra

Written 216120040 spots for SRR1514950.sra

And the first sequence is missing its header, which should be @SRR1514950.1 131213_SN711_0409_AC32B3ACXX:6:1101:1206:2058:0 length=202.

Did you use nohup with fastq-dump?

ADD REPLYlink modified 14 months ago • written 14 months ago by h.mon27k

Huh, you're right. I guess I'm somehow using fastq-dump wrong. All my FASTQs start like that.

head *.fastq:

==> SRR1514950.fastq <==
Read 216120040 spots for SRR1514950.sra
Written 216120040 spots for SRR1514950.sra
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR1514950.1 131213_SN711_0409_AC32B3ACXX:6:1101:1206:2058:0 length=202
<#4@######################################################################################################################################################################################################
@SRR1514950.2 131213_SN711_0409_AC32B3ACXX:6:1101:1191:2090:0 length=202
TGAGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR1514950.2 131213_SN711_0409_AC32B3ACXX:6:1101:1191:2090:0 length=202
(89?######################################################################################################################################################################################################
@SRR1514950.3 131213_SN711_0409_AC32B3ACXX:6:1101:1227:2111:0 length=202

==> SRR1514951.fastq <==
Read 213892051 spots for SRR1514951.sra
Written 213892051 spots for SRR1514951.sra
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR1514951.1 131213_SN711_0409_AC32B3ACXX:7:1101:1242:2066:0 length=202
<<<@######################################################################################################################################################################################################
@SRR1514951.2 131213_SN711_0409_AC32B3ACXX:7:1101:1200:2068:0 length=202
TGTANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR1514951.2 131213_SN711_0409_AC32B3ACXX:7:1101:1200:2068:0 length=202
<<<@######################################################################################################################################################################################################
@SRR1514951.3 131213_SN711_0409_AC32B3ACXX:7:1101:1183:2095:0 length=202

==> SRR1514952.fastq <==
Read 214054908 spots for SRR1514952.sra
Written 214054908 spots for SRR1514952.sra
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR1514952.1 131213_SN711_0409_AC32B3ACXX:8:1101:1190:2056:0 length=202
##########################################################################################################################################################################################################
@SRR1514952.2 131213_SN711_0409_AC32B3ACXX:8:1101:1241:2082:0 length=202
CAATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR1514952.2 131213_SN711_0409_AC32B3ACXX:8:1101:1241:2082:0 length=202
<<<@######################################################################################################################################################################################################
ADD REPLYlink modified 14 months ago • written 14 months ago by aostern30

Do you think there's a way to salvage the SAM I have? Or do I need to fix the FASTQs and then realign them?

ADD REPLYlink written 14 months ago by aostern30

I would play it safe and fix the fastq and re-aling them. You don't know how the broken headers altered bwa interpretation of the file.

ADD REPLYlink written 14 months ago by h.mon27k
2
gravatar for h.mon
14 months ago by
h.mon27k
Brazil
h.mon27k wrote:

fastq-dump will create file names and output reads to those files, so don't use redirection to stdout (>) - if you want to use it, then you have to use -Z:

  -Z|--stdout                      Output to stdout, all split data become 
                                   joined into single stream

There is a working example of fastq-dump for the very same dataset you are using here:

Aligning paired end fastq files dumped from SRA

fastq-dump --origfmt -I --split-files --gzip SRR1514950.sra

This will output two files one for R1 and other for R2, so you will need to modify your bwa mem command.

ADD COMMENTlink modified 14 months ago • written 14 months ago by h.mon27k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2409 users visited in the last hour