Parsing a SAM file efficiently
3
1
Entering edit mode
6.4 years ago
rrhaines20 ▴ 20

Hi,

I want to parse a SAM file based on QNAME, however the python script I designed to do so takes too long.

To visualize my issue, here are the first 5 lines of my 4.25 GB SAM file:

HWI-D00372:292:C6U8JANXX:2:1310:9053:40752    99    chr1_gl000191_random    623    255    51M    =    737    165    CTTTGGGAGGCTGAGGCAGGTGGATCACCTGAGGTCAGGAGTTCGAGACCA    /<BBBFFFFFFFFFFFFFFFFFFBFFFFFFFBFFFFFFFFFFFFFFBFFFF    XA:i:0    MD:Z:51    PG:Z:MarkDuplicates    NM:i:0
HWI-D00372:292:C6U8JANXX:2:2108:14511:86999    99    chr1_gl000191_random    721    255    51M    =    769    99    TTAGCTGGGGCTGGTGGCGGATGCCTGTAATCCCAGCTACTCAGGAGGCTG    BBBBBFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFB<BFFFFBFFF    XA:i:0    MD:Z:51    PG:Z:MarkDuplicates    NM:i:0
HWI-D00372:292:C6U8JANXX:2:1310:9053:40752    147    chr1_gl000191_random    737    255    51M    =    623    -165    GCGGATGCCTGTAATCCCAGCTACTCAGGAGGCTGAGGCTGGGGAATCACT    FFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFF/<FFFFFFFFB/FFBBBBB    XA:i:0    MD:Z:51    PG:Z:MarkDuplicates    NM:i:0
HWI-D00372:292:C6U8JANXX:2:2108:14511:86999    147    chr1_gl000191_random    769    255    51M    =    721    -99    CTGAGGCTGGGGAATCACTTGAACCTGGAAGGCGGAGGTTGCAGTGAGCTG    F<FFFFBFFFFFBB<F/FFFFF<FFBFFFFFFFFFFFFFFFBFF<FBBBBB    XA:i:0    MD:Z:51    PG:Z:MarkDuplicates    NM:i:0
HWI-D00372:292:C6U8JANXX:2:1313:1300:84021    163    chr1_gl000191_random    833    255    51M    =    877    95    GCACTCCAGCCTACGCAACAAGAGCGAAACTCTGTCTCAAAAAATAAAAAA    BBBBBFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFF    XA:i:0    MD:Z:51    PG:Z:MarkDuplicates    NM:i:0

The QNAME is the "HWI-D00372:292:C6U8JANXX:2:1310:9053:40752" part of the SAM file line.

I have a list of about 1 million QNAMEs that I want to iterate through, pick out each SAM file line that corresponds to each QNAME (each QNAME corresponds with 2 separate SAM file lines because this is paired-end sequencing alignment data), then write these lines out into a new SAM file.

The best python script I could design to do this task is the following:

parsedSamData = "" # initialize parsedSamData variable

# read SAM file in as a single string
file_handle = open(samFile, 'r')
file_handle.close()

for QNAME in QNAME_list: # iterate through each QNAME
regex = re.escape(QNAME) + r".*\n" # design a regular expression to match the SAM file lines with the QNAME
mate_pairs = re.finditer(regex, sam_file_contents_string) # match QNAME with the SAM file lines
for entry in mate_pairs: # access the match object
parsedSamData += entry.group() # append the 2 lines to the parsedSamData variable

# save the parsed SAM file data
fh = open(parsed_SAM_file_path, "w")
fh.write(parsedSamData)
fh.close()

This definitely does what I want, however the time it takes to go through each QNAME is about 5 seconds. This translates to my script running for about 58 days for this particular SAM file. I would like to perform this task in a few days or less, as I have multiple SAM files to perform this on that range in size from 4-12 GB.

I appreciate and thank you for any help!

next-gen sequencing programming python SAM • 6.2k views
6
Entering edit mode
6.4 years ago
John 13k

Eek, you load the whole SAM file in to RAM as a single gigantic string, which you then regex search n times where n is the number of QNAMEs you're interested in! Thats going to need a LOT of memory and will be crayz slow! Try this:

mate_pairs = {}
QNAME_list = set(QNAME_list) # make the checking a tiny bit faster by making the list a set
with open(pathToSam, 'rb') as samFile:
for line in samFile:
QNAME = line.split('\t')[0]
if QNAME in QNAME_list:
try:
mate_pairs[QNAME].append(line)
except KeyError:
mate_pairs[QNAME] = [line]
with open(parsed_SAM_file_path, "wb") as outfile:
outfile.write(read + '\n' + mate + '\n')

This will fail if you dont have exactly two reads per QNAME (which is probably a good thing..) and should take about 5-20 minutes to run.

EDIT:

Also, dont give up on Python just because your first foray was thwarted by a pretty sweet grep!
Yes you can solve many easy problems with grep/cat/sed/awk one-liners - but I always found there are more -weird -flags in those tools to -remember, than there are programming paradigms that will never change.

1
Entering edit mode

I rarely use grep for anything, but this really does seem like a good use :)

1
Entering edit mode

Yeah well thats the thing about shell commands piped together - they seem like they work fine and they're a good use, but unless we walk through the logic ourselves in at least pseudocode, one tends to miss things. The problem is, they're impossible to unit test and write exceptions for.
This is a pretty good example actually. If you have a pre-filtered SAM file and only 1 of your pairs is in the file (or worse, more than 2!) a programming language would raise an exception while the grep would pass whatever it found through without incident.

1
Entering edit mode

I used the grep method based on another post and, surprisingly, it was taking as long if not longer than my original python script. I then tried your python script and it took ~30 seconds! Also, it worked perfectly. Thank you very much!

5
Entering edit mode
6.4 years ago

I am curious why you didn't try for already existing tools. Any tool that work with the BAM format will be much faster. I know of Picard's FilterSamReads (https://broadinstitute.github.io/picard/command-line-overview.html) that exactly does what you are trying to achieve using your Python's script.