How Do I Do The Intersection Of Two Sam Files
2
2
Entering edit mode
12.4 years ago
KCC ★ 4.1k

I have two sam files produced by alignment to two different versions of the reference genome. Is there a way to get the intersection of the sam files so I get a sam with the sequences that have been mapped to both? (So, I would ideally want to discard any sequences that are unmapped in an alignment to at least one of the genomes.)

sam • 3.7k views
ADD COMMENT
3
Entering edit mode
12.4 years ago

You can try using this python script. I am not 100% sure it will work though. I am at home right now and don't have access to any of my SAM files to test it out:

import sys

inFileA = open(sys.argv[1],'r')
inFileB = open(sys.argv[2],'r')

reads = {}
for line in inFileA:
    if line[0] != "@": 
        data = line.strip().split('\t')
        name = data[0]
        flag = int(data[1])
        if not flag & 4:
            reads[name] = True

for line in inFileB:
    if line[0] != "@": 
        data = line.strip().split('\t')
        name = data[0]
        flag = int(data[1])
        if not flag & 4:
            if reads.has_key(name):
                print name

Use it by: python theScript.py fileA.sam fileB.sam > listOfReadNames

ADD COMMENT
2
Entering edit mode
12.4 years ago
  • use samtools view with option '-f' to dump the sequences mapped on both assemblies
  • pipe in cut to extract the names of the reads
  • pipe in sort to order the names
  • redirect to file

use comm or join to list the common names in both files

ADD COMMENT

Login before adding your answer.

Traffic: 2541 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6