Question: BWA 0.7.12 and Unique reads, -r, -c options, XA: and SA: optional tags on SAM output,
3
gravatar for theodore
4.2 years ago by
theodore40
Germany
theodore40 wrote:
So, I am using the following version of BWA

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.12-r1039

I got through a lot threads read answers and so on... still I am confused.
I am performing ChIP or 4C seq. I get single end reads. I would like to get uniquely mapped reads.
For the 4C I am trimming the viewpoint.

this is the command I am using:
bwa mem -t 8 -M -k 19 -r 1 -c 1 $ref_genome $1/$base1".trimmed" > $3/$NOEXT1".sam"

and those are the explanation for every string I included:
#-c keep those that have unique alignment (Discard a MEM if it has more than INT occurence in the genome. This is an insensitive parameter. [10000]) 
#-r by reducing it to 1, I get bwa mem to work things out more intensively, like the --best command for bowtie (Larger value yields fewer seeds, which leads to faster alignment speed but lower accuracy)
#-k seed length (I thing that 32 is to big????)
#-M keep it compatible with picard tools
#-t threads to use

What are your opinion on the -c and -r strings?

then I get to see the produced file and it has sequences that are marked with the optional field of XA:Z and SA:Z

XA: Alternative hits; format: (chr,pos,CIGAR,NM;)* EDIT: in anny case I have failed to find the XT:A:U optional flag

SA: Z, not sure what it is but, it almost laways coinside with the 256 flag = not primary alignment so I believe that this is NOT a unique aligned read.

The question is mostly on XA:. there is no XA:U if it exist it is XA:Z and is followed by 3 or more genomic coordinates, and that makes me believe that this is not a uniquely mapped read.

To wrap it up I remove from the SAM filewith the following:
samtools view -h -q 1 -F 4 -F 256 $3/$NOEXT1".sam" | grep -v "XA:Z" | grep -v "SA:Z" > $3/$NOEXT1"_mem.sam"

-F 4 : remove not mapped
-F 256 : remove not primary aligned
-q 1: remove reads with MAPQ quality less than 1 (I thing that this is quite necessary)

 

I am rephrasing. Does anyone have any info our input regarding the SA: nad XA: tags, especially in the absence of the XA:U tag? Does the removal of those tags makes sense or I am loosing valuable reads?

 

EDIT:
here are some reads as an example:

those include both the SA: and the XA optional flag

BIOMICS-HISEQHI:522:HCYM5ADXX:2:2115:14837:88360        16      chr12   87311997        1       34M17S  *       0       0       TCTATAAACACCTCTAAGAAAATAAACTAGAAAATCCAGATGAAATGGATA     BBBBBBABB>A<B>B>A?7BBB?A=0BBA?A?BBBBBBBA=BBBBBBBB?B     NM:i:1  MD:Z:0A33       AS:i:33 XS:i:31 SA:Z:chr2,161581931,+,32M19S,1,0;       XA:Z:chr9,-104599379,51M,5;chr3,+170653467,51M,5;
BIOMICS-HISEQHI:522:HCYM5ADXX:2:1102:3525:68438 16      chr12   87311999        0       32M17S  *       0       0       TATAAACACCTCTAAGAAAATAAACTAGAAAATCCAGATGAAATGGATA       AA==5.DB8BB8=/??*<D99D??9?D<BEBD@FDD????E<E?E<CAC       NM:i:0  MD:Z:32 AS:i:32 XS:i:30 SA:Z:chr2,161581931,+,32M17S,0,0;       XA:Z:chr9,-104599381,49M,4;chr3,+170653467,49M,4;chr12,+46991828,49M,5;
BIOMICS-HISEQHI:522:HCYM5ADXX:2:1209:6904:71888 256     chr17   59076122        0       33M18H  *       0       0       TTTTCCCATTCTGTAGATTGTCTGTTTACTCTG       IGGIIIIIIIIGA@GGGHFIIIIIFFHEIHIHF       NM:i:0  MD:Z:33 AS:i:33 XS:i:32 SA:Z:chr11,22585338,+,18S33M,0,0;       XA:Z:chr4,-151801895,19S32M,0;
(END)

those got only the XA: optional flag

BIOMICS-HISEQHI:522:HCYM5ADXX:2:2104:11639:55952        0       chr1    8120580 1       15S31M5S        *       0       0       TCTTGGGATGGACATGTGAGCTGAAATGGCGCCATTGCACTCCAGCTTGGG     GIIIGCGFGCGHEGFHIICCGIHDD>HB<CHHGAHHIIHHEHHFFFFFEDD     NM:i:0  MD:Z:31 AS:i:31 XS:i:29 XA:Z:chr8,+98304845,30M21S,1;chr20,+51943978,23S28M,0;chr16,+70407077,14S37M,2;chr8,+55962122,15S36M,2;chrX,-70173856,26M25S,0;
BIOMICS-HISEQHI:522:HCYM5ADXX:2:2105:16548:32340        0       chr1    8120580 6       15S34M  *       0       0       CCTTGGGATGGACATGTGAGCTGAAATGGCGCCATTGCACTCCAGCCTG       HHIIHIEIBF1?DHGGDDDF<FB?BGIHI>FGI<EHC@CCHEFE>@BCC       NM:i:0  MD:Z:34 AS:i:34 XS:i:30 XA:Z:chr8,+98304845,30M19S,0;chr16,+70407077,14S35M,1;chr8,+55962122,15S34M,1;
BIOMICS-HISEQHI:522:HCYM5ADXX:2:2106:5931:90491 0       chr1    8120580 6       14S32M  *       0       0       CTTGGGATGTACATGTGAGCTGAAATGGCGCCATTGCACTCCAGCC  @H4EC381)*1*:?B:*?:0:?BD?G*?8'-<-5C4=;;D.?7A>E  NM:i:0  MD:Z:32 AS:i:32 XS:i:28 XA:Z:chr16,+70407077,13S33M,1;chr8,+55962122,14S32M,1;
BIOMICS-HISEQHI:522:HCYM5ADXX:2:2108:7261:70911 0       chr1    8120580 0       15S36M  *       0       0       TCTTGGGATGGACATGTGAGCTGAAATGGCGCCATTGCACTCCAGTCTGGG     :>33@@1(.=?1>99?3.=33.6=3:>)=;533-5=9;;3>9?;?8=?337     NM:i:2  MD:Z:30C3A1     AS:i:30 XS:i:29 XA:Z:chr8,+98304845,30M21S,1;chr1,-8936944,28M23S,0;chr16,+70407077,14S37M,2;chr8,+55962122,15S36M,2;chr10,-95309935,26M25S,0;
BIOMICS-HISEQHI:522:HCYM5ADXX:2:2113:12753:98209        0       chr1    8120580 9       15S36M  *       0       0       CCTTGGGATGGACATGTGAGCTGAAATGGCGCCATTGCACTCCAGCCTGAG     ><6)<2:86-97;3:?>?==<?>?1=?3>75;?3=?;?;=???;?==>7;>     NM:i:0  MD:Z:36 AS:i:36 XS:i:30 XA:Z:chr8,+98304845,30M21S,0;chr16,+70407077,14S37M,2;

Any help would be appreciated

ADD COMMENTlink modified 15 months ago by Samir170 • written 4.2 years ago by theodore40

no one??? anyone???

ADD REPLYlink written 4.2 years ago by theodore40

I will probably also post this question over to seqansewers

 

ADD REPLYlink written 4.2 years ago by theodore40

Personally I find your post quite confusing, it's unclear to me what you are asking... To the question "What are your opinion on the -c and -r strings?" I haven't played with them but default seems ok for my needs. "Am, I missing something, Do I do something wrong?" hard to tell without more background about your task...

ADD REPLYlink written 4.2 years ago by dariober10k

what would you think a proper background would be?
experiment?
propose of the analysis?
 

regarding the "what am I missing" question, it refers mostly to the observation of the existence of the SA:Z and XA:Z, optional flags, although I have followed the recommendations described in nummerous posts in the forums.

I could add that I have single end reads...

ADD REPLYlink written 4.2 years ago by theodore40

Dear Theodore; I am having equal difficulty filtering for uniquely aligned reads using bwa0.7.10. I simply want to extract all reads that align exactly once to the genome. In STAR, MAPQ is calculated as a function of how many times it can be potentially aligned, so I can simply filter based on that. I'm not sure what strategy is appropriate for bwa, where MAPQ is calculated differently. Would outputing only the highest alignments do the trick? I am hoping you have made progress on this front. Regards, Julien

ADD REPLYlink written 3.5 years ago by jrichardalbert0

do you have figured out the question to extract all reads that align exactly once to the genome successfully ?

ADD REPLYlink written 3.2 years ago by winter_li50

if I am not mistaken the following as noted above should do it try to map with: bwa mem -t 8 -M -k 19 -r 1 -c 1

and then from the sam file, before converting it to bam do: samtools view -h -q 1 -F 4 -F 256 $3/$NOEXT1".sam" | grep -v "XA:Z" | grep -v "SA:Z" > $3/$NOEXT1"_mem.sam" the -F 256 applies only if the -M has been used during mapping

I hope that it helps

ADD REPLYlink written 3.2 years ago by theodore40
2
gravatar for Carambakaracho
3.8 years ago by
Carambakaracho1.6k
Switzerland/Basel
Carambakaracho1.6k wrote:

Not sure whether I understood your original question, but in case what you're asking is "what do the XA and SA flags mean" this is what bwa announced:

Release 0.7.5 (29 May, 2013):

  • Other hits part of a chimeric alignment are now reported in the SA tag, conforming to the SAM spec v1.5. (See also SAM format specifications)

Release 0.7.9 (19 May, 2014):

  • Output alternative hits in the XA tag if there are not so many of them. This is a BWA-backtrack feature

It took me a while to figure this out as well.

 

ADD COMMENTlink written 3.8 years ago by Carambakaracho1.6k
2
gravatar for Samir
15 months ago by
Samir170
United States
Samir170 wrote:

Copied from my reply at an another forum


For much faster and stable$ implementation, sambamba can handle this using following one-liner:

sambamba view -t 12 -h -f bam -F "mapping_quality >= 1 and not (unmapped or secondary_alignment) and not ([XA] != null or [SA] != null)" test.bwa-mem.bam -o test-uniq.bam

See manpage and sambamba synatx for filters for details on further usage.

$: Besides grep based search being slow, it may return false positive hits (@gringer's reply above). I also do not find awk based approach reliable because fields like XA, SA, etc. are optional fields in SAM format, and not positionally fixed fields.

ADD COMMENTlink written 15 months ago by Samir170
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1931 users visited in the last hour