Question: Extract read with high GC content from a bam file
1
gravatar for popescuiofelia
20 months ago by
London, UK
popescuiofelia10 wrote:

Hi! I was wondering how I can extract reads from a bam file (the paired end reads were mapped to the reference genome and then the sam file was converted to bam and sorted) that have a high GC% content. I loaded my bam file into FASTQC are I have 2 distinct peaks when analysing GC content. I would like to BLAST these reads and see where they are mapping.

gc content reads • 1.1k views
ADD COMMENTlink modified 20 months ago by Pierre Lindenbaum131k • written 20 months ago by popescuiofelia10

Do your bam also contain unmapped reads? What is the organism and type of sequencing / library? There are a number of reasons which could account for this pattern, and knowing the above information would help provide more accurate / direct solutions to your problem.

As for your question, you can do this with reformat.sh from BBTools / BBMap:

reformat.sh -Xmx1g threads=1 in=infile.bam out=outfile.bam mingc=0.7 maxgc=1

Adjust mingc and maxgc as needed.

ADD REPLYlink modified 20 months ago • written 20 months ago by h.mon31k

Thank you very much. I first extracted mapped reads from the sorted bam file (that came from bowtie) and I added the mapped reads to the FASTQC and I still have a peak of around 55%GC content. I am working on Dictyostelium discoideum which is very AT rich (27% GC content) and the sequencing was done on DNA. I should also say that even in the sequences that map really well (>90%) I see the same pattern. I am interested to see where are located and maybe if they are the same between strains. This is a link to the image I get image

ADD REPLYlink modified 20 months ago • written 20 months ago by popescuiofelia10
3
gravatar for i.sudbery
20 months ago by
i.sudbery9.4k
Sheffield, UK
i.sudbery9.4k wrote:

I don't know if there is a script to do this, but you can do it with pysam. You should check that there isn't a large number of unmapped reads though - often if you sample is contaminated it will show up as as a bi-model %GC graph and a large number of unmapped reads.

import pysam
bamfile = pysam.AlignmentFile("mybam.bam")
outfile = pysam.ALignmentFile("outfile.bam", "w", template=bamfile)
for read in bamfile.fetch(until_eof=True):
    sequence = read.query_sequence.upper()
    GC = float(sequence.count('C') + sequence.count('G'))/length(sequence)
    if GC > 0.65:
        outfile.write(read)
outfile.close()
ADD COMMENTlink modified 20 months ago • written 20 months ago by i.sudbery9.4k

Thank you very much! I have tried that but I am still trying to make it work. It outputs an error which looks like this:

  Traceback (most recent call last):
  File "pysam.py", line 1, in <module>
    import pysam
  File "/home/ofelia/Documents/Sequences hard drive/Seqeuences/pysam.py", line 2, in <module>
    bamfile = pysam.AlignmentFile("mapped.bam")
AttributeError: module 'pysam' has no attribute 'AlignmentFile'

I have read online and it should be sth with the versions but it is the same as on their website: pysam website.

ADD REPLYlink modified 20 months ago • written 20 months ago by popescuiofelia10

This is because you've called your file "pysam.py". When you do "import pysam", you are importing the "pysam.py" you just created.

ADD REPLYlink written 20 months ago by i.sudbery9.4k
1
gravatar for Pierre Lindenbaum
20 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:

using samjdk http://lindenb.github.io/jvarkit/SamJdk.html

java -jar dist/samjdk.jar -e 'return !record.getReadUnmappedFlag() && record.getReadString().replaceAll("[^GCgcSs]","").length()/(double)record.getReadLength() > 0.9;' input.bam
ADD COMMENTlink written 20 months ago by Pierre Lindenbaum131k

Thank you very much! I will give it a try and let you know if it worked.

ADD REPLYlink written 20 months ago by popescuiofelia10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1484 users visited in the last hour