So, I am using the following version of BWA Program: bwa (alignment via Burrows-Wheeler transformation) Version: 0.7.12-r1039
I got through a lot threads read answers and so on... still I am confused.
I am performing ChIP or 4C seq. I get single end reads. I would like to get uniquely mapped reads.
For the 4C I am trimming the viewpoint.
this is the command I am using:
bwa mem -t 8 -M -k 19 -r 1 -c 1 $ref_genome $1/$base1".trimmed" > $3/$NOEXT1".sam"
and those are the explanation for every string I included: #-c keep those that have unique alignment (Discard a MEM if it has more than INT occurence in the genome. This is an insensitive parameter. [10000]) #-r by reducing it to 1, I get bwa mem to work things out more intensively, like the --best command for bowtie (Larger value yields fewer seeds, which leads to faster alignment speed but lower accuracy) #-k seed length (I thing that 32 is to big????) #-M keep it compatible with picard tools #-t threads to use
What are your opinion on the -c and -r strings?
then I get to see the produced file and it has sequences that are marked with the optional field of XA:Z and SA:Z
XA: Alternative hits; format: (chr,pos,CIGAR,NM;)* EDIT: in anny case I have failed to find the XT:A:U optional flag
SA: Z, not sure what it is but, it almost laways coinside with the 256 flag = not primary alignment so I believe that this is NOT a unique aligned read.
The question is mostly on XA:. there is no XA:U if it exist it is XA:Z and is followed by 3 or more genomic coordinates, and that makes me believe that this is not a uniquely mapped read. To wrap it up I remove from the SAM filewith the following: samtools view -h -q 1 -F 4 -F 256 $3/$NOEXT1".sam" | grep -v "XA:Z" | grep -v "SA:Z" > $3/$NOEXT1"_mem.sam" -F 4 : remove not mapped -F 256 : remove not primary aligned -q 1: remove reads with MAPQ quality less than 1 (I thing that this is quite necessary)
I am rephrasing. Does anyone have any info our input regarding the SA: nad XA: tags, especially in the absence of the XA:U tag? Does the removal of those tags makes sense or I am loosing valuable reads?
EDIT:
here are some reads as an example:
those include both the SA: and the XA optional flag
BIOMICS-HISEQHI:522:HCYM5ADXX:2:2115:14837:88360 16 chr12 87311997 1 34M17S * 0 0 TCTATAAACACCTCTAAGAAAATAAACTAGAAAATCCAGATGAAATGGATA BBBBBBABB>A<B>B>A?7BBB?A=0BBA?A?BBBBBBBA=BBBBBBBB?B NM:i:1 MD:Z:0A33 AS:i:33 XS:i:31 SA:Z:chr2,161581931,+,32M19S,1,0; XA:Z:chr9,-104599379,51M,5;chr3,+170653467,51M,5; BIOMICS-HISEQHI:522:HCYM5ADXX:2:1102:3525:68438 16 chr12 87311999 0 32M17S * 0 0 TATAAACACCTCTAAGAAAATAAACTAGAAAATCCAGATGAAATGGATA AA==5.DB8BB8=/??*<D99D??9?D<BEBD@FDD????E<E?E<CAC NM:i:0 MD:Z:32 AS:i:32 XS:i:30 SA:Z:chr2,161581931,+,32M17S,0,0; XA:Z:chr9,-104599381,49M,4;chr3,+170653467,49M,4;chr12,+46991828,49M,5; BIOMICS-HISEQHI:522:HCYM5ADXX:2:1209:6904:71888 256 chr17 59076122 0 33M18H * 0 0 TTTTCCCATTCTGTAGATTGTCTGTTTACTCTG IGGIIIIIIIIGA@GGGHFIIIIIFFHEIHIHF NM:i:0 MD:Z:33 AS:i:33 XS:i:32 SA:Z:chr11,22585338,+,18S33M,0,0; XA:Z:chr4,-151801895,19S32M,0; (END)
those got only the XA: optional flag
BIOMICS-HISEQHI:522:HCYM5ADXX:2:2104:11639:55952 0 chr1 8120580 1 15S31M5S * 0 0 TCTTGGGATGGACATGTGAGCTGAAATGGCGCCATTGCACTCCAGCTTGGG GIIIGCGFGCGHEGFHIICCGIHDD>HB<CHHGAHHIIHHEHHFFFFFEDD NM:i:0 MD:Z:31 AS:i:31 XS:i:29 XA:Z:chr8,+98304845,30M21S,1;chr20,+51943978,23S28M,0;chr16,+70407077,14S37M,2;chr8,+55962122,15S36M,2;chrX,-70173856,26M25S,0; BIOMICS-HISEQHI:522:HCYM5ADXX:2:2105:16548:32340 0 chr1 8120580 6 15S34M * 0 0 CCTTGGGATGGACATGTGAGCTGAAATGGCGCCATTGCACTCCAGCCTG HHIIHIEIBF1?DHGGDDDF<FB?BGIHI>FGI<EHC@CCHEFE>@BCC NM:i:0 MD:Z:34 AS:i:34 XS:i:30 XA:Z:chr8,+98304845,30M19S,0;chr16,+70407077,14S35M,1;chr8,+55962122,15S34M,1; BIOMICS-HISEQHI:522:HCYM5ADXX:2:2106:5931:90491 0 chr1 8120580 6 14S32M * 0 0 CTTGGGATGTACATGTGAGCTGAAATGGCGCCATTGCACTCCAGCC @H4EC381)*1*:?B:*?:0:?BD?G*?8'-<-5C4=;;D.?7A>E NM:i:0 MD:Z:32 AS:i:32 XS:i:28 XA:Z:chr16,+70407077,13S33M,1;chr8,+55962122,14S32M,1; BIOMICS-HISEQHI:522:HCYM5ADXX:2:2108:7261:70911 0 chr1 8120580 0 15S36M * 0 0 TCTTGGGATGGACATGTGAGCTGAAATGGCGCCATTGCACTCCAGTCTGGG :>33@@1(.=?1>99?3.=33.6=3:>)=;533-5=9;;3>9?;?8=?337 NM:i:2 MD:Z:30C3A1 AS:i:30 XS:i:29 XA:Z:chr8,+98304845,30M21S,1;chr1,-8936944,28M23S,0;chr16,+70407077,14S37M,2;chr8,+55962122,15S36M,2;chr10,-95309935,26M25S,0; BIOMICS-HISEQHI:522:HCYM5ADXX:2:2113:12753:98209 0 chr1 8120580 9 15S36M * 0 0 CCTTGGGATGGACATGTGAGCTGAAATGGCGCCATTGCACTCCAGCCTGAG ><6)<2:86-97;3:?>?==<?>?1=?3>75;?3=?;?;=???;?==>7;> NM:i:0 MD:Z:36 AS:i:36 XS:i:30 XA:Z:chr8,+98304845,30M21S,0;chr16,+70407077,14S37M,2;
Any help would be appreciated
no one??? anyone???
I will probably also post this question over to seqansewers
Personally I find your post quite confusing, it's unclear to me what you are asking... To the question "What are your opinion on the
-c
and-r
strings? I haven't played with them but default seems ok for my needs. "Am, I missing something, Do I do something wrong?" hard to tell without more background about your task...what would you think a proper background would be?
experiment?
propose of the analysis?
regarding the "what am I missing" question, it refers mostly to the observation of the existence of the SA:Z and XA:Z, optional flags, although I have followed the recommendations described in nummerous posts in the forums.
I could add that I have single end reads...
Dear Theodore; I am having equal difficulty filtering for uniquely aligned reads using bwa0.7.10. I simply want to extract all reads that align exactly once to the genome. In STAR, MAPQ is calculated as a function of how many times it can be potentially aligned, so I can simply filter based on that. I'm not sure what strategy is appropriate for bwa, where MAPQ is calculated differently. Would outputing only the highest alignments do the trick? I am hoping you have made progress on this front. Regards, Julien
do you have figured out the question to extract all reads that align exactly once to the genome successfully ?
if I am not mistaken the following as noted above should do it
try to map with:
and then from the sam file, before converting it to bam do:
the
-F 256
applies only if the-M
has been used during mappingI hope that it helps