Going From Cigar String In Sam To Genomic Coordinates?
2
4
Entering edit mode
12.4 years ago
User 9996 ▴ 820

How can I go from a CIGAR string, given in the SAM output format, to a set of start/end genomic coordinates for paired-end sequences? The SAM format gives the start coordinate but I need to find the end coordinate as well. Thanks.

sam parsing cigar alignment • 9.3k views
8
Entering edit mode
12.4 years ago

Have a look at the Cigar.java source code in Picard. For example:

/**
* @return The number of reference bases that the read covers, including padding.
*/
int length = 0;
for (final CigarElement element : cigarElements) {
switch (element.getOperator()) {
case M:
case D:
case N:
case EQ:
case X:
case P:
length += element.getLength();
}
}
return length;
}

2
Entering edit mode

Once you have length, the position of the last base is: end = start + length - 1. Don't forget the minus one, since we're "counting fenceposts" (google it).

4
Entering edit mode
12.4 years ago

The bamToBed program in BEDTools will do this by create BED entries for each BAM alignment where the end end coordinate reflects the CIGAR string. Moreover, if you want to create separate BED entries for "spliced" or "split" alignments (i.e., when there is an "N: CIGAR op present), use the -split option.

2
Entering edit mode

Just remember, BED entries are 0-indexed, half-open ranges (up to but not including the end position). A 5 bp read starting at (1-based) position 10 would be "chr1 9 15" in BED.

0
Entering edit mode

Hi

I am trying the split option on my bam file and it is only reporting 6 fields in the output and that too for all the alignments. Can you suggest what is going wrong??