Command line .sam file processing
2
0
Entering edit mode
2.5 years ago

Hi,

I have done some alignments using bowtie2 and I am having trouble with the processing of the .sam file. As explained in the bowtie2 Manual "When Bowtie 2 prints a SAM alignment for a pair, it prints two records (i.e. two lines of output), one for each mate. The first record describes the alignment for mate 1 and the second record describes the alignment for mate 2". I therefore need to "maintain pairs" to continue with the analysis, but the problem is that not all pairs are valid. For me, a pair is valid when the two records are 76M and have the flags XM:i:0 and XN:i:0. I have done the processing with Excel but the problem is that by doing so, I cannot use samtools sort to convert the .sam file to .bam file and continue with the analysis.

I explain it better with an example:

OUTPUT 1

NB501214:74:H2MLWAFXY:1:11101:9943:2048 97 CHRVI 7025 39 76M CHRXIV 9092 0 GTTTAATTCTGAGAAATCATGGAGCCCTGTTGGCCTGGAAGATGCTAAACTTCCCAAGGAAGCTTACCGATTTAAG AAAAAEEEEEEEEEEEEEEEEEEEEAAEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEAEEEEEE AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:76 YT:Z:UP
NB501214:74:H2MLWAFXY:1:11101:9943:2048 145 CHRXIV 9092 42 76M CHRVI 7025 0 ATGTGCCTTCTATTACGCACCTTTTATCTCGGGTGGATTCTTTTCATGTTGGTACAAGGTTTCCAAAACATGAGGA EEEEEE/EEEEEAEEAEAEEEEEEEE/EE/EAEAEEEEAEEEEEEEEEEEAEEEAEEEEEEEEEE/EEEEEAAAAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:76 YT:Z:UP

Valid pair because both records are 76M, XM:i:0 and XN:i:0.

OUTPUT 2

NB501214:74:H2MLWAFXY:1:11101:11467:2768 65 CHRXII 1076743 39 76M CHRII 1044 0 ATTCCATCCAGTCACCGACATTAACAAAGAGTCGTATAAGCGGAAAGGGAGTCAAATGGTTTTGCTAGAGAGAAAG AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEE AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:76 YT:Z:UP
NB501214:74:H2MLWAFXY:1:11101:11467:2768 129 CHRII 1044 42 75M CHRXII 1076743 0 CTACAGGGTCCCCATGATATGGCTCGATGTCTTCCAAGTATTCTTTGTATTCCTCATCATTTCGCAGCATTCTCT AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEAEEEEEEEE/EEEEEEAAEEAEEEE/EEEEAEEEEE AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:75 YT:Z:UP

Not valid because one record is 76M and the other one is 75M.

Is there any way to do this processing without Excel (i.e. command line)? I know how to extract lines that have the flags that I want, but I do not know how to maintain pairs and do the processing taking into account that when one of the records of a pair is not valid, the whole pair has to be deleted.

Thank you very much.

bowtie2 pairs sam • 1.5k views
ADD COMMENT
0
Entering edit mode

I therefore need to "maintain pairs" to continue with the analysis, but the problem is that not all pairs are valid.

hum.. what is "the analysis" ? why to you need to keep all the "valid" pairs ?

ADD REPLY
0
Entering edit mode

I need to get a coverage file but I am interested in discordant reads (only those that have no mismatch), so I need to filter out everything that doesn't fit with it.

ADD REPLY
0
Entering edit mode

That 75M read is likely a result of trimming. Maybe you should not trim if it's causing you to throw things out that you would otherwise keep.

ADD REPLY
0
Entering edit mode
2.5 years ago

The proper way to do this is to write a simple program, say with SimpleSam where you keep alignments based on the CIGAR string

https://github.com/mdshw5/simplesam

the file be maintained in the SAM format

your code will look approximately like this (not tested just banging it out live):

from simplesam import Reader, Writer

# Keeps track of valid pairs
store = []

inbam = Reader("input.bam")
for rec in inbam:
    cond = rec.CIGAR == '76M'
    if rec in store:
        store[rec.QNAME] = cond
    else:
        store[rec.QNAME] &= cond

outbam = Writer("out.bam", inbam.header)
# Read the BAM file again and output valid pairs only
for rec in Reader('test.bam'):
    if rec.QNAME in store:
        outbam.write(rec)
ADD COMMENT
0
Entering edit mode

Thank you very much Istvan, I´ve been trying what you suggested and I think it could work (I'm starting with bioinformatics and coding is totally new for me).

Thanks!

ADD REPLY
0
Entering edit mode
2.5 years ago

using picard and samjdk in paired mode.

 java -jar ${PICARD_JAR} SortSam  SO=queryname I=input.bam O=/dev/stdout |\
java -jar ${JVARKIT_DIST}/samjdk.jar --pair -e 'return records.stream().filter(R->!R.isSecondaryOrSupplementary()).allMatch(R->!R.getReadUnmappedFlag() && R.getCigarLength()==1 && R.hasAttribute("XM") && R.getIntegerAttribute("XM")==0 && R.hasAttribute("XN") && R.getIntegerAttribute("XN")==0);'
ADD COMMENT

Login before adding your answer.

Traffic: 1834 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6