BWA 0.7.12 and Unique reads, -r, -c options, XA: and SA: optional tags on SAM output,
2
3
Entering edit mode
6.4 years ago
theodore ▴ 90
So, I am using the following version of BWA

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.12-r1039

I am performing ChIP or 4C seq. I get single end reads. I would like to get uniquely mapped reads.
For the 4C I am trimming the viewpoint.

this is the command I am using:
bwa mem -t 8 -M -k 19 -r 1 -c 1 $ref_genome$1/$base1".trimmed" >$3/NOEXT1".sam" and those are the explanation for every string I included: #-c keep those that have unique alignment (Discard a MEM if it has more than INT occurence in the genome. This is an insensitive parameter. [10000]) #-r by reducing it to 1, I get bwa mem to work things out more intensively, like the --best command for bowtie (Larger value yields fewer seeds, which leads to faster alignment speed but lower accuracy) #-k seed length (I thing that 32 is to big????) #-M keep it compatible with picard tools #-t threads to use What are your opinion on the -c and -r strings? then I get to see the produced file and it has sequences that are marked with the optional field of XA:Z and SA:Z XA: Alternative hits; format: (chr,pos,CIGAR,NM;)* EDIT: in anny case I have failed to find the XT:A:U optional flag SA: Z, not sure what it is but, it almost laways coinside with the 256 flag = not primary alignment so I believe that this is NOT a unique aligned read. The question is mostly on XA:. there is no XA:U if it exist it is XA:Z and is followed by 3 or more genomic coordinates, and that makes me believe that this is not a uniquely mapped read. To wrap it up I remove from the SAM filewith the following: samtools view -h -q 1 -F 4 -F 2563/$NOEXT1".sam" | grep -v "XA:Z" | grep -v "SA:Z" >$3/NOEXT1"_mem.sam" -F 4 : remove not mapped -F 256 : remove not primary aligned -q 1: remove reads with MAPQ quality less than 1 (I thing that this is quite necessary) I am rephrasing. Does anyone have any info our input regarding the SA: nad XA: tags, especially in the absence of the XA:U tag? Does the removal of those tags makes sense or I am loosing valuable reads? EDIT: here are some reads as an example: those include both the SA: and the XA optional flag BIOMICS-HISEQHI:522:HCYM5ADXX:2:2115:14837:88360 16 chr12 87311997 1 34M17S * 0 0 TCTATAAACACCTCTAAGAAAATAAACTAGAAAATCCAGATGAAATGGATA BBBBBBABB>A<B>B>A?7BBB?A=0BBA?A?BBBBBBBA=BBBBBBBB?B NM:i:1 MD:Z:0A33 AS:i:33 XS:i:31 SA:Z:chr2,161581931,+,32M19S,1,0; XA:Z:chr9,-104599379,51M,5;chr3,+170653467,51M,5; BIOMICS-HISEQHI:522:HCYM5ADXX:2:1102:3525:68438 16 chr12 87311999 0 32M17S * 0 0 TATAAACACCTCTAAGAAAATAAACTAGAAAATCCAGATGAAATGGATA AA==5.DB8BB8=/??*<D99D??9?D<BEBD@FDD????E<E?E<CAC NM:i:0 MD:Z:32 AS:i:32 XS:i:30 SA:Z:chr2,161581931,+,32M17S,0,0; XA:Z:chr9,-104599381,49M,4;chr3,+170653467,49M,4;chr12,+46991828,49M,5; BIOMICS-HISEQHI:522:HCYM5ADXX:2:1209:6904:71888 256 chr17 59076122 0 33M18H * 0 0 TTTTCCCATTCTGTAGATTGTCTGTTTACTCTG IGGIIIIIIIIGA@GGGHFIIIIIFFHEIHIHF NM:i:0 MD:Z:33 AS:i:33 XS:i:32 SA:Z:chr11,22585338,+,18S33M,0,0; XA:Z:chr4,-151801895,19S32M,0; (END) those got only the XA: optional flag BIOMICS-HISEQHI:522:HCYM5ADXX:2:2104:11639:55952 0 chr1 8120580 1 15S31M5S * 0 0 TCTTGGGATGGACATGTGAGCTGAAATGGCGCCATTGCACTCCAGCTTGGG GIIIGCGFGCGHEGFHIICCGIHDD>HB<CHHGAHHIIHHEHHFFFFFEDD NM:i:0 MD:Z:31 AS:i:31 XS:i:29 XA:Z:chr8,+98304845,30M21S,1;chr20,+51943978,23S28M,0;chr16,+70407077,14S37M,2;chr8,+55962122,15S36M,2;chrX,-70173856,26M25S,0; BIOMICS-HISEQHI:522:HCYM5ADXX:2:2105:16548:32340 0 chr1 8120580 6 15S34M * 0 0 CCTTGGGATGGACATGTGAGCTGAAATGGCGCCATTGCACTCCAGCCTG HHIIHIEIBF1?DHGGDDDF<FB?BGIHI>FGI<EHC@CCHEFE>@BCC NM:i:0 MD:Z:34 AS:i:34 XS:i:30 XA:Z:chr8,+98304845,30M19S,0;chr16,+70407077,14S35M,1;chr8,+55962122,15S34M,1; BIOMICS-HISEQHI:522:HCYM5ADXX:2:2106:5931:90491 0 chr1 8120580 6 14S32M * 0 0 CTTGGGATGTACATGTGAGCTGAAATGGCGCCATTGCACTCCAGCC @H4EC381)*1*:?B:*?:0:?BD?G*?8'-<-5C4=;;D.?7A>E NM:i:0 MD:Z:32 AS:i:32 XS:i:28 XA:Z:chr16,+70407077,13S33M,1;chr8,+55962122,14S32M,1; BIOMICS-HISEQHI:522:HCYM5ADXX:2:2108:7261:70911 0 chr1 8120580 0 15S36M * 0 0 TCTTGGGATGGACATGTGAGCTGAAATGGCGCCATTGCACTCCAGTCTGGG :>33@@1(.=?1>99?3.=33.6=3:>)=;533-5=9;;3>9?;?8=?337 NM:i:2 MD:Z:30C3A1 AS:i:30 XS:i:29 XA:Z:chr8,+98304845,30M21S,1;chr1,-8936944,28M23S,0;chr16,+70407077,14S37M,2;chr8,+55962122,15S36M,2;chr10,-95309935,26M25S,0; BIOMICS-HISEQHI:522:HCYM5ADXX:2:2113:12753:98209 0 chr1 8120580 9 15S36M * 0 0 CCTTGGGATGGACATGTGAGCTGAAATGGCGCCATTGCACTCCAGCCTGAG ><6)<2:86-97;3:?>?==<?>?1=?3>75;?3=?;?;=???;?==>7;> NM:i:0 MD:Z:36 AS:i:36 XS:i:30 XA:Z:chr8,+98304845,30M21S,0;chr16,+70407077,14S37M,2; Any help would be appreciated BWA 0.7.12 unique reads mapping SAM • 18k views ADD COMMENT 0 Entering edit mode no one??? anyone??? ADD REPLY 0 Entering edit mode I will probably also post this question over to seqansewers ADD REPLY 0 Entering edit mode Personally I find your post quite confusing, it's unclear to me what you are asking... To the question "What are your opinion on the -c and -r strings? I haven't played with them but default seems ok for my needs. "Am, I missing something, Do I do something wrong?" hard to tell without more background about your task... ADD REPLY 0 Entering edit mode what would you think a proper background would be? experiment? propose of the analysis? regarding the "what am I missing" question, it refers mostly to the observation of the existence of the SA:Z and XA:Z, optional flags, although I have followed the recommendations described in nummerous posts in the forums. I could add that I have single end reads... ADD REPLY 0 Entering edit mode Dear Theodore; I am having equal difficulty filtering for uniquely aligned reads using bwa0.7.10. I simply want to extract all reads that align exactly once to the genome. In STAR, MAPQ is calculated as a function of how many times it can be potentially aligned, so I can simply filter based on that. I'm not sure what strategy is appropriate for bwa, where MAPQ is calculated differently. Would outputing only the highest alignments do the trick? I am hoping you have made progress on this front. Regards, Julien ADD REPLY 0 Entering edit mode do you have figured out the question to extract all reads that align exactly once to the genome successfully ? ADD REPLY 0 Entering edit mode if I am not mistaken the following as noted above should do it try to map with: bwa mem -t 8 -M -k 19 -r 1 -c 1  and then from the sam file, before converting it to bam do: samtools view -h -q 1 -F 4 -F 2563/$NOEXT1".sam" | grep -v "XA:Z" | grep -v "SA:Z" >$3/NOEXT1"_mem.sam"  the -F 256 applies only if the -M has been used during mapping I hope that it helps ADD REPLY 4 Entering edit mode 6.0 years ago Carambakaracho ★ 2.9k Not sure whether I understood your original question, but in case what you're asking is "what do the XA and SA flags mean" this is what bwa announced: • Other hits part of a chimeric alignment are now reported in the SA tag, conforming to the SAM spec v1.5. (See also SAM format specifications) • Output alternative hits in the XA tag if there are not so many of them. This is a BWA-backtrack feature It took me a while to figure this out as well. ADD COMMENT 2 Entering edit mode 3.5 years ago Samir ▴ 190 Copied from my reply at an another forum For much faster and stable implementation, sambamba can handle this using following one-liner:

sambamba view -t 12 -h -f bam -F "mapping_quality >= 1 and not (unmapped or secondary_alignment) and not ([XA] != null or [SA] != null)" test.bwa-mem.bam -o test-uniq.bam


See manpage and sambamba synatx for filters for details on further usage.

\$: Besides grep based search being slow, it may return false positive hits (@gringer's reply above). I also do not find awk based approach reliable because fields like XA, SA, etc. are optional fields in SAM format, and not positionally fixed fields.