Extract read with high GC content from a bam file
2
1
Entering edit mode
5.2 years ago

Hi! I was wondering how I can extract reads from a bam file (the paired end reads were mapped to the reference genome and then the sam file was converted to bam and sorted) that have a high GC% content. I loaded my bam file into FASTQC are I have 2 distinct peaks when analysing GC content. I would like to BLAST these reads and see where they are mapping.

GC content reads • 4.1k views
ADD COMMENT
0
Entering edit mode

Do your bam also contain unmapped reads? What is the organism and type of sequencing / library? There are a number of reasons which could account for this pattern, and knowing the above information would help provide more accurate / direct solutions to your problem.

As for your question, you can do this with reformat.sh from BBTools / BBMap:

reformat.sh -Xmx1g threads=1 in=infile.bam out=outfile.bam mingc=0.7 maxgc=1

Adjust mingc and maxgc as needed.

ADD REPLY
0
Entering edit mode

Thank you very much. I first extracted mapped reads from the sorted bam file (that came from bowtie) and I added the mapped reads to the FASTQC and I still have a peak of around 55%GC content. I am working on Dictyostelium discoideum which is very AT rich (27% GC content) and the sequencing was done on DNA. I should also say that even in the sequences that map really well (>90%) I see the same pattern. I am interested to see where are located and maybe if they are the same between strains. This is a link to the image I get image

ADD REPLY
5
Entering edit mode
5.2 years ago

I don't know if there is a script to do this, but you can do it with pysam. You should check that there isn't a large number of unmapped reads though - often if you sample is contaminated it will show up as as a bi-model %GC graph and a large number of unmapped reads.

import pysam
bamfile = pysam.AlignmentFile("mybam.bam")
outfile = pysam.ALignmentFile("outfile.bam", "w", template=bamfile)
for read in bamfile.fetch(until_eof=True):
    sequence = read.query_sequence.upper()
    GC = float(sequence.count('C') + sequence.count('G'))/length(sequence)
    if GC > 0.65:
        outfile.write(read)
outfile.close()
ADD COMMENT
0
Entering edit mode

Thank you very much! I have tried that but I am still trying to make it work. It outputs an error which looks like this:

  Traceback (most recent call last):
  File "pysam.py", line 1, in <module>
    import pysam
  File "/home/ofelia/Documents/Sequences hard drive/Seqeuences/pysam.py", line 2, in <module>
    bamfile = pysam.AlignmentFile("mapped.bam")
AttributeError: module 'pysam' has no attribute 'AlignmentFile'

I have read online and it should be sth with the versions but it is the same as on their website: pysam website.

ADD REPLY
0
Entering edit mode

This is because you've called your file "pysam.py". When you do "import pysam", you are importing the "pysam.py" you just created.

ADD REPLY
1
Entering edit mode
5.2 years ago

using samjdk http://lindenb.github.io/jvarkit/SamJdk.html

java -jar dist/samjdk.jar -e 'return !record.getReadUnmappedFlag() && record.getReadString().replaceAll("[^GCgcSs]","").length()/(double)record.getReadLength() > 0.9;' input.bam
ADD COMMENT
0
Entering edit mode

Thank you very much! I will give it a try and let you know if it worked.

ADD REPLY

Login before adding your answer.

Traffic: 2254 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6