Questions:
How are sites identified to find the fraction of bases that are different from the reference base at each position? It seems from the documentation that . and , is equal to a match on the reference base.
How is the the fraction of sites with all bases equal to the reference genome calculated ?
How is the fraction of positions where all the bases are different from the base in the reference calculated? It seems that a letter A,C,G,or T indicates a difference in the reference base.
My output from samtools mpileup is:
gi|110645304|ref|NC_002516|pseudocap|136 1 T 19 ^].^],^],^],^],^],^],^],^],^],^],^],^],^],^],^],^],^],^], @AAAA>AAAAAAAAAAAA< ]]]]]]]]]]]]]]]]]]]
gi|110645304|ref|NC_002516|pseudocap|136 2 T 25 .......,,,,,,,,,,,,,,,,,, /////@/BBBABBBBBBBBBBABBB ]]]]]]]]]]]]]]]]]]]]]]]]]
gi|110645304|ref|NC_002516|pseudocap|136 3 T 26 .......,,,,,,,,,,,,,,,,,,^]. 55555@5EEECAEEEEE@DCEEEEE@ ]]]]]]]]]]]]]]]]]]]]]]]]]]
gi|110645304|ref|NC_002516|pseudocap|136 4 A 26 .......,,,,,,,,,,,,,,,,,,. FHFFAFAJGIJGIJIJJIHIIJIJI? ]]]]]]]]]]]]]]]]]]]]]]]]]]
gi|110645304|ref|NC_002516|pseudocap|136 5 A 27 .......,,,,,,,,,,,,,,,,,,.^]. FGFFGDHJHIGHHJHJJHGGIJJIHB> ]]]]]]]]]]]]]]]]]]]]]]]]]]]
gi|110645304|ref|NC_002516|pseudocap|136 6 A 28 .......,,,,,,,,,,,,,,,,,,..^]. FHHFHEHJGIGFHIGJJGGCGJJIBB>@ ]]]]]]]]]]]]]]]]]]]]]]]]]]]]
gi|110645304|ref|NC_002516|pseudocap|136 7 G 28 .......,,,,,,,,,,,,,,,,,,... FJHFHFGJIIGBFICJJGGHHJJJDF@@ ]]]]]]]]]]]]]]]]]]]]]]]]]]]]
gi|110645304|ref|NC_002516|pseudocap|136 8 A 29 .......,,,,,,,,,,,,,,,,,,...^]. FJHHIFJJGIH?FIFJIHEGGJIJ?DDCC ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
gi|110645304|ref|NC_002516|pseudocap|136 9 G 31 .......,,,,,,,,,,,,,,,,,,....^].^]. HJGHIHJJDHFF@IGIIGFAHIHI:DDFCCB ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
gi|110645304|ref|NC_002516|pseudocap|136 10 A 31 .......,,,,,,,,,,,,,,,,,,...... HJFHIDJI0DC?@HCHH@??EIAH0B?FCC@ ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
From: http://www.htslib.org/doc/samtools-1.7.html
In the pileup format (without -u or -g), each line represents a genomic position, consisting of chromosome name, 1-based coordinate, reference base, the number of reads covering the site, read bases, base qualities and alignment mapping qualities. Information on match, mismatch, indel, strand, mapping quality and start and end of a read are all encoded at the read base column. At this column, a dot stands for a match to the reference base on the forward strand, a comma for a match on the reverse strand, a '>' or '<' for a reference skip, ACGTN' for a mismatch on the forward strand and
acgtn' for a mismatch on the reverse strand. A pattern `\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this reference position and the next reference position.
Hello M.O.L.S ,
could you please reformulate your question a bit. I don't understand them completely.
But it seems, the way to your answer is to count the numbers of occurrence of
.
and,
in each read column (column 5), as these indicates a match. The number of reads covering the position, is given in column 4. By subtracting the number of matches from this column, you get the number of reads that doesn't have a match. Now you can calculate any fraction you like.fin swimmer
Hello M.O.L.S ,
Please use the formatting bar (especially the
code
option) to present your post better. I've done it for you this time.Thank you!
This is something I have been struggling with for a while by myself, and need some help with.
I have:
You can avoid these steps and intermediate files by piping the output of your aligner directly to samtools sort:
Thanks, that command makes sense.
Hello M.O.L.S ,
is this an assignment? What have you done already to answer these questions?
samtools mpileup
is deprecated since v1.9 and should be replaced bybcftools mpileup
. There the output format has changed.fin swimmer
Thanks, but I need to still use samtools mpileup version 1.7