Question

Extract read with high GC content from a bam file

1

Entering edit mode

5.2 years ago

popescuiofelia ▴ 10

Hi! I was wondering how I can extract reads from a bam file (the paired end reads were mapped to the reference genome and then the sam file was converted to bam and sorted) that have a high GC% content. I loaded my bam file into FASTQC are I have 2 distinct peaks when analysing GC content. I would like to BLAST these reads and see where they are mapping.

GC content reads • 4.1k views

ADD COMMENT • link updated 5.2 years ago by Pierre Lindenbaum 161k • written 5.2 years ago by popescuiofelia ▴ 10

0

Entering edit mode

Do your bam also contain unmapped reads? What is the organism and type of sequencing / library? There are a number of reasons which could account for this pattern, and knowing the above information would help provide more accurate / direct solutions to your problem.

As for your question, you can do this with reformat.sh from BBTools / BBMap:

reformat.sh -Xmx1g threads=1 in=infile.bam out=outfile.bam mingc=0.7 maxgc=1

Adjust mingc and maxgc as needed.

ADD REPLY • link 5.2 years ago by h.mon 35k

0

Entering edit mode

Thank you very much. I first extracted mapped reads from the sorted bam file (that came from bowtie) and I added the mapped reads to the FASTQC and I still have a peak of around 55%GC content. I am working on Dictyostelium discoideum which is very AT rich (27% GC content) and the sequencing was done on DNA. I should also say that even in the sequences that map really well (>90%) I see the same pattern. I am interested to see where are located and maybe if they are the same between strains. This is a link to the image I get image

ADD REPLY • link 5.2 years ago by popescuiofelia ▴ 10

score 5 · Answer 1 · 2019-02-06

5

Entering edit mode

5.2 years ago

i.sudbery 19k

I don't know if there is a script to do this, but you can do it with pysam. You should check that there isn't a large number of unmapped reads though - often if you sample is contaminated it will show up as as a bi-model %GC graph and a large number of unmapped reads.

import pysam
bamfile = pysam.AlignmentFile("mybam.bam")
outfile = pysam.ALignmentFile("outfile.bam", "w", template=bamfile)
for read in bamfile.fetch(until_eof=True):
    sequence = read.query_sequence.upper()
    GC = float(sequence.count('C') + sequence.count('G'))/length(sequence)
    if GC > 0.65:
        outfile.write(read)
outfile.close()

ADD COMMENT • link 5.2 years ago by i.sudbery 19k

0

Entering edit mode

Thank you very much! I have tried that but I am still trying to make it work. It outputs an error which looks like this:

  Traceback (most recent call last):
  File "pysam.py", line 1, in <module>
    import pysam
  File "/home/ofelia/Documents/Sequences hard drive/Seqeuences/pysam.py", line 2, in <module>
    bamfile = pysam.AlignmentFile("mapped.bam")
AttributeError: module 'pysam' has no attribute 'AlignmentFile'

I have read online and it should be sth with the versions but it is the same as on their website: pysam website.

ADD REPLY • link 5.2 years ago by popescuiofelia ▴ 10

0

Entering edit mode

This is because you've called your file "pysam.py". When you do "import pysam", you are importing the "pysam.py" you just created.

ADD REPLY • link 5.2 years ago by i.sudbery 19k

score 1 · Answer 2 · 2019-02-06

1

Entering edit mode

5.2 years ago

Pierre Lindenbaum 161k

using samjdk http://lindenb.github.io/jvarkit/SamJdk.html

java -jar dist/samjdk.jar -e 'return !record.getReadUnmappedFlag() && record.getReadString().replaceAll("[^GCgcSs]","").length()/(double)record.getReadLength() > 0.9;' input.bam