Question

Getting reads that end at a digest site from a sam/bam file

0

Entering edit mode

7.1 years ago

Jautis ▴ 530

Hi, I have bisulfite converted sequencing reads and, from my bam file, would like to select reads that end at a Msp1 digest site (C^CGG). The way I would think to go about doing that is making a bed file of CCGG sites in the genome (from the fasta file) and then selecting reads from the bam file that end at those sites. However, I'm not sure how to do either part of this.

Can anyone help me out or do you have better ideas? Thanks!

bisulfite digest • 1.6k views

ADD COMMENT • link updated 7.1 years ago by Pierre Lindenbaum 161k • written 7.1 years ago by Jautis ▴ 530

score 0 · Answer 1 · 2017-03-14

0

Entering edit mode

7.1 years ago

Devon Ryan 104k

You presumably want to create a BED (not BAM) file of the cut sites and find reads that overlap. Of course you'll get reads that span over a site, but that'll be a minor portion.

ADD COMMENT • link 7.1 years ago by Devon Ryan 104k

0

Entering edit mode

Yes, that would be correct- I mistyped in the initial question and have edited it.

How would I go about constructing that bed file? As for getting the reads that span that site, would I do that by reformatting the bam file into a bed-like format (adding the length of the read to the inital site so I check the end) or is there a tool I could use on the bam file itself?

ADD REPLY • link 7.1 years ago by Jautis ▴ 530

0

Entering edit mode

For generating the BED file, you can use biopython, which can determine cut sites in DNA.

For finding the reads, just use samtools with the BED file and the BAM file. If you want anything more exact, then write a little script in pysam, though that'd be largely overkill.

ADD REPLY • link 7.1 years ago by Devon Ryan 104k

0

Entering edit mode

I don't think samtools does what I want it to here. I want reads that end at an Msp1 digest site, not those that include a digest site anywhere within the read.

Using biopython works perfectly for the first part, thank you!

ADD REPLY • link 7.1 years ago by Jautis ▴ 530

0

Entering edit mode

I know, but you'll probably find that 99% of the overlapping alignments end on a cut site, so you can save yourself the added time and ignore the 1% extra. If the extra precision is important then you can code something with pysam.

ADD REPLY • link 7.1 years ago by Devon Ryan 104k

score 0 · Answer 2 · 2017-03-14

0

Entering edit mode

7.1 years ago

Pierre Lindenbaum 161k

using awk

 curl -s "http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/hgsv_sv_discovery/working/20151021_3.5kb_jumping/SRS000213.bwabacktrack_GRCh38Primary.20151001.YRI.jumping_library.bam" | samtools view -h | awk -F '\t' '($0 ~ /^@/ || $10 ~ /CCGG$/)'

Using samjs

curl -s "http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/hgsv_sv_discovery/working/20151021_3.5kb_jumping/SRS000213.bwabacktrack_GRCh38Primary.20151001.YRI.jumping_library.bam" | java -jar ~/src/jvarkit-git/dist/samjs.jar -e 'record.getReadString().endsWith("CCGG")'
(...)
SL-HAC:C7FRWANXX150920:C7FRWANXX:5:2211:11368:69901 161 1   11741   0   25M 12  129355703   0   ACTGATGATTTTGCTGCATGGCCGG   FF

ADD COMMENT • link 7.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi Pierre, because the library was digested I don't expect the reads to actually end with the full Msp1 cut site and would like to retain a read that only had the C or CCG. Additionally, the library is bisulfite converted, so just looking for CCGG at the ends of a read won't work. Rather, I believe I need to (i) generate a list of CCGG sites and (ii) select for reads that end within the cut site, but I'm not sure how to do either task

ADD REPLY • link 7.1 years ago by Jautis ▴ 530