Hi! I was wondering how I can extract reads from a bam file (the paired end reads were mapped to the reference genome and then the sam file was converted to bam and sorted) that have a high GC% content. I loaded my bam file into FASTQC are I have 2 distinct peaks when analysing GC content. I would like to BLAST these reads and see where they are mapping.
I don't know if there is a script to do this, but you can do it with pysam. You should check that there isn't a large number of unmapped reads though - often if you sample is contaminated it will show up as as a bi-model %GC graph and a large number of unmapped reads.
import pysam bamfile = pysam.AlignmentFile("mybam.bam") outfile = pysam.ALignmentFile("outfile.bam", "w", template=bamfile) for read in bamfile.fetch(until_eof=True): sequence = read.query_sequence.upper() GC = float(sequence.count('C') + sequence.count('G'))/length(sequence) if GC > 0.65: outfile.write(read) outfile.close()
using samjdk http://lindenb.github.io/jvarkit/SamJdk.html
java -jar dist/samjdk.jar -e 'return !record.getReadUnmappedFlag() && record.getReadString().replaceAll("[^GCgcSs]","").length()/(double)record.getReadLength() > 0.9;' input.bam