Question: adding pair info to fastq files, for BWA
0
gravatar for sullis02
9 days ago by
sullis020
sullis020 wrote:

I downloaded a file of Illumina paired reads from SRA. When split into _1 and _2 using the sratools fastq_dump --split-files, the fastq record IDs looks like this (I'm showing just the identifier lines of the first record in each file)

_1.fastq

    @SRR4734558.1 HWI-ST1117:138:C1HR1ACXX:5:1101:1451:1979 length=100


_2.fastq

    @SRR4734558.1 HWI-ST1117:138:C1HR1ACXX:5:1101:1451:1979 length=100

i.e., they're exactly the same in files _1 and _2. BWA-mem (v 0.7.15) is giving me error messages with these files, saying it can't find any FR pairs (and soon after, a core dump). This seems to be because there isn't any indication of '1' and '2' in the IDs. I tried adding '1:' and '2:' before the 'HWI' (using sed)

_1.fastq
@SRR4734558.1 1:HWI-ST1117:138:C1HR1ACXX:5:1101:1451:1979 length=100

_2.fastq
@SRR4734558.1 2:HWI-ST1117:138:C1HR1ACXX:5:1101:1451:1979 length=100

but BWA still didn't find FR pairs.

I also tried reducing the ID to a string with no space, and adding /1 and /2 to the end of the ID lines

_1.fastaq
@HWI-ST1117:138:C1HR1ACXX:5:1101:1451:1979/1
_2.fastq
@HWI-ST1117:138:C1HR1ACXX:5:1101:1451:1979/2

But BWA still does not see them as a FR pair. (It only sees reads as FF and RR in all of these cases)

So, what is the proper way to indicate '1' and '2' so that BWA still them as FR? Or is there something wrong with this version of BWA? (I am requesting our HPC IT to install the latest) . NB I confirmed that there are no duplicate IDs within either file.

sra bwa illumina • 125 views
ADD COMMENTlink modified 8 days ago • written 9 days ago by sullis020

I've tried (almost) everything you suggested (I reduced the fastq sizes down to 20000 lines, i.e., just the first 5000 fastq records, just to make testing faster). Nothing was recognized as FR. Here for example are three different ID header versions of the first record. The first one is the header generated by fastq-dump -F. The second and third versions are created from that by BBmap's reformat.sh by manipulating addslash and spaceslash parameters (There was no way I could find to generate "1:N" format.)

v1: @HWI-ST1117:138:C1HR1ACXX:5:1101:1451:197
v2: @HWI-ST1117:138:C1HR1ACXX:5:1101:1451:197/1
v3: @HWI-ST1117:138:C1HR1ACXX:5:1101:1451:197 /1


v1:@HWI-ST1117:138:C1HR1ACXX:5:1101:1451:197
v2:@HWI-ST1117:138:C1HR1ACXX:5:1101:1451:197/2
v3:@HWI-ST1117:138:C1HR1ACXX:5:1101:1451:197 /2

here is my typical command line

bwa mem -t 2 <my_indexed_assembly> <reads_1.fastq> <reads_2.fastq> > output.sam

I'm using the latest version of BWA (0.7.17)

run output regardless of ID formatting:

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 10000 sequences (1000000 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (726, 0, 0, 693)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (9, 23, 42)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 108)
[M::mem_pestat] mean and std.dev: (27.86, 24.13)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 141)
[M::mem_pestat] skip orientation FR as there are not enough pairs
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation RR...
[M::mem_pestat] (25, 50, 75) percentile: (9, 21, 46)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 120)
[M::mem_pestat] mean and std.dev: (29.52, 26.02)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 157)
[M::mem_process_seqs] Processed 10000 reads in 3.464 CPU sec, 1.733 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 2 PGA_assembly_jelly origfmt_20k_1.slash.fastq origfmt_20k_2.slash.fastq

(I presume the 10,000 reads refers to the 5000 * 2?)

An output.sam file was indeed created, it's 4MB in size, beyond that I don't know whether it's 'correct' or not.

I confess I don't understand what BWA is looking for re: 'candidate unique pairs'.

ADD REPLYlink modified 8 days ago • written 8 days ago by sullis020

You missed this reformat.sh option :

addcolon=f              Append ' 1:' and ' 2:' to read names, if not already present.

Have you checked to make sure those reads are in sync in R1/R2 files?

Also try downloading the fastq files directly from ENA to see if they don't have this problem.

Note: Just to be sure. You are downloading a public dataset to align against your assembly of same (or similar genome)?

ADD REPLYlink modified 8 days ago • written 8 days ago by genomax65k

I also tried a completely different pair of fastq files, generated by our sequencing core (not downloaded), with headers like this

R1.fastq
@HWI-ST911:230:C4MVUACXX:2:1101:1483:2058 1:N:0:ACAGTG

R2.fastq
@HWI-ST911:230:C4MVUACXX:2:1101:1483:2058 2:N:0:ACAGTG

run:

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 10000 sequences (1010000 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1758, 0, 0, 1687)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (9, 23, 46)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 120)
[M::mem_pestat] mean and std.dev: (30.13, 26.54)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 157)
[M::mem_pestat] skip orientation FR as there are not enough pairs
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation RR...
[M::mem_pestat] (25, 50, 75) percentile: (10, 22, 44)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 112)
[M::mem_pestat] mean and std.dev: (28.84, 25.18)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 146)
[M::mem_process_seqs] Processed 10000 reads in 1.308 CPU sec, 0.655 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 2 PGA_assembly_jelly SDW_G3_1_ACAGTG_L002_R1_001_20K.fastq SDW_G3_1_ACAGTG_L002_R2_001_20K.fastq

So I'm stumped.

ADD REPLYlink modified 8 days ago • written 8 days ago by sullis020

Please use ADD REPLY/ADD COMMENT to keep threads logically organized. SUBMIT ANSWER is for new answers to original question.

Since you have BBMap suite downloaded I am going to suggest that you try bbmap.sh the aligner (http://seqanswers.com/forums/showthread.php?t=41057 ).

ADD REPLYlink modified 8 days ago • written 8 days ago by genomax65k
0
gravatar for genomax
9 days ago by
genomax65k
United States
genomax65k wrote:

I suggest that you re-run fastq-dump with -F option. That should restore original Illumina style fastq headers. If that does not work for some reason you can use reformat.sh from BBMap suite to add either /1 or 1:N style designations to read headers.

ADD COMMENTlink written 9 days ago by genomax65k

Sorry, my reply should have gone here. Instead it's following my first post up there.

ADD REPLYlink written 8 days ago by sullis020

Problem solved. It was due to incorrect bwa indexing. By default (I use v 0.7.17, though 0.7.15 exhibited the same behavior), the bwa index -a option is set to 'auto', which means bwa index itself decides whether to use the 'bwtsw' (large reference sequence) or 'is' (small) setting (or 'rb2' but I don't know what that is).

$ bwa index

Usage:   bwa index [options] <in.fasta>

Options: -a STR    BWT construction algorithm: bwtsw, is or rb2 [auto]
         -p STR    prefix of the index [same as fasta name]
         -b INT    block size for the bwtsw algorithm (effective with -a bwtsw) [10000000]
         -6        index files named as <in.fasta>.64.* instead of <in.fasta>.*

Warning: `-a bwtsw' does not work for short genomes, while `-a is' and
         `-a div' do not work not for long genomes.

In my case it was choosing bwtsw, incorrectly, generating index files that broke bwa mem and caused it to fail to recognize FR reads. Explicitly setting -a to small

$ bwa index -a is <assembly.fa>

generated the correct indices, and allowed bwa mem to recognize nearly all the reads as FR.

My assembly is of a large (though hardly human-sized) protist genome, so I suggest others working on such genomes make sure to set bwa index -a to 'is', or at least, to check their run logs to make sure FR reads are being recognized as such.

Btw the online BWA manual at http://bio-bwa.sourceforge.net/bwa.shtml is obsolete in this regard.

ADD REPLYlink modified 7 days ago • written 7 days ago by sullis020

NB that even if bwa index chose the wrong -a option, bwa mem will still output a .sam file. So it's worth checking.

ADD REPLYlink written 7 days ago by sullis020
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1728 users visited in the last hour