How to extract only continuously aligned reads from a SAMfile
0
0
Entering edit mode
7.0 years ago
yarmda ▴ 40

We found that most of the reads in a large number of samples have been ligated together from various fragments. I am trying to see if I can identify and salvage those reads that managed to slip through the cracks and are whole/continuous.

I can visualize the alignments and see large gaps among many of the reads within the same organism, as well as a read being split between two. I think this is sufficient proof of the problem and also proof that the information to identify continuous reads can be found in the SAM file.

Can anyone help me identify reads that mapped continuously (barring reasonable INDELs)? These are 150 bp reads and my threshold for continuity is flexibly around no gap larger than the size of the read. Alternatively, identifying all reads that are not continuous gets the job done just as well.

samtools sam alignment • 1.2k views
ADD COMMENT
0
Entering edit mode

Have you considered trimming the reads to 50bp, for example? Those would be far less likely to be chimeric. It might be easier and more productive than throwing out all the problematic ones (assuming that most reads are affected).

ADD REPLY
0
Entering edit mode

We are considering that option, but that still leaves a risk of chimeric reads being included, depending on how the trimming worked out.

We're hoping to be able to identify about 30-40% of the reads as non-chimeric. If we can't identify even 15% as non-chimeric, we'll probably resort to trimming.

We also have a need for as complete a read as possible. There is a lot of value for our particular experiment in getting 130-150 bp alignments that are complete and continuous.

Thanks for the suggestion!

ADD REPLY
0
Entering edit mode

See this discussion about split reads: Split Read in Samtools

That might be what you are looking for.

ADD REPLY
0
Entering edit mode

The solution is listed as

samtools view -f 256 Input.bam | awk '$6 ~/S/ && $7 == "=" {print $0}' > Secondary_clipped.sam

But, this pulls out secondary alignments and then checks if there is a soft clipping event. Secondary alignments may happen, but this won't necessarily capture all chimeric reads. Soft clipping is likely as well, but I don't know if it will capture everything I'm looking for.

ADD REPLY

Login before adding your answer.

Traffic: 1869 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6