Question

BAM File: Trim Reads So Only First Base Remains

1

Entering edit mode

10 months ago

vanbelj ▴ 40

I mapped single-end 50-bp reads to a genome and now have BAM files. I need to trim the reads within the BAM file so that only the first base remains (the first base called by the sequencer). Ideally it would output another BAM or bigwig file.

Does anyone know a package that can do this?

NGS BAM Trimming • 1.8k views

ADD COMMENT • link updated 10 months ago by GenoMax 141k • written 10 months ago by vanbelj ▴ 40

0

Entering edit mode

I need to trim the reads within the BAM file so that only the first base remains (the first base called by the sequencer)

Can you further clarify or illustrate?

ADD REPLY • link 10 months ago by GenoMax 141k

0

Entering edit mode

Essentially I want to use the full-length of the read to accurately map the read to the genome, but only want the first base to contribute to coverage counts. So after mapping reads to the genome, I want to trim each read so only its first base remains.

For instance in the attached picture I've highlight two reads. One maps to the Forward strand and the other maps to the Reverse strand.

The Forward read would be trimmed so that only the 5' T is retained. The Reverse read would be trimmed so that only the 3' A is retained.

If these were the only two reads present, the coverage plot would have values of 1 at base # 590 and 642 and coverage values of 0 everywhere else.

If you're wondering why I want to do this, the technique I use ligates an adapter to the piece of DNA that is closest to a protein of interest. Libraries are sequenced from this adapter, so the first base represents the closest location to the protein of interest. The other bases increase background signal.

Thanks for any help.

Example

ADD REPLY • link 10 months ago by vanbelj ▴ 40

0

Entering edit mode

Since you are interested in only position 1 where a mapping starts you could simply take that position from your BAM file (column 4 in SAM format) and add those up.

ADD REPLY • link 10 months ago by GenoMax 141k

0

Entering edit mode

Is this something that could be done with the samtools package, or would I need to use text processing such as awk? Does column 4 take strandedness into account? The description of column 4 says "leftmost mapping position" - does this mean the 5' base of the read?

ADD REPLY • link 10 months ago by vanbelj ▴ 40

0

Entering edit mode

Since the reads are always in 5' --> 3' direction it will always be the 5' end. In case you need to take strand information into account then you can process the bitwise flags in column 2 (16 for reverse strand).

ADD REPLY • link 10 months ago by GenoMax 141k

0

Entering edit mode

10 months ago

Jeremy ▴ 910

If you're willing to convert your file to SAM format, you can use the following AWK solution:

grep -v '^@' old.sam | awk '{$10=substr($10,1,1); print}' > new.sam

This will remove the header. To add the header back, you can use the following:

samtools view -H old.sam > header.sam
cat header.sam new.sam > final.sam

ADD COMMENT • link 10 months ago by Jeremy ▴ 910

0

Entering edit mode

While this may trim the sequence it is going to leave the quality string as is. So you will end up with a malformed SAM file. Also take into consideration the explanation that OP provided above.

ADD REPLY • link 10 months ago by GenoMax 141k

0

Entering edit mode

10 months ago

Pierre Lindenbaum 161k

I wrote https://jvarkit.readthedocs.io/en/latest/Biostar9566948/

ADD COMMENT • link 10 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

While OP had originally asked for this on further clarification it does not appear like they need to do this after all. ( see: BAM File: Trim Reads So Only First Base Remains )

ADD REPLY • link 10 months ago by GenoMax 141k

score 2 · Accepted Answer · 2023-06-20

R is good for this sort of thing. Very easy to do and examine many genomic manipulations.

library(GenomicRanges)
library(rtracklayer)

# read in BAM file
foo <- import("foo.bam")

# convert from GenomicAlignments to GRanges
foo <- GRanges(foo)

# look at first 3 elements
foo[1:3]

GRanges object with 3 ranges and 0 metadata columns:
  seqnames              ranges strand
<Rle>           <IRanges>  <Rle>
  [1]     chr1         28815-28854      -
  [2]     chr6 172095647-172095686      +
  [3]     chr7       101791-101830      -
  -------
  seqinfo: 24 sequences from an unspecified genome

# get just the first base
foo <- resize(foo, width=1, fix='start')

# examine after resizing to confirm strand
foo[1:3]

GRanges object with 3 ranges and 0 metadata columns:
  seqnames    ranges strand
<Rle> <IRanges>  <Rle>
  [1]     chr1     28854      -
  [2]     chr6 172095647      +
  [3]     chr7    101830      -
  -------
  seqinfo: 24 sequences from an unspecified genome

# export as bed
export(foo, "foo.bed")

# get coverage
foo_cov <- coverage(foo)

# export as bigwig
export(foo_cov, "foo_cov.bw")

If, for instance, you want to count how many reads overlap all your proteins of interest, you could import a bed file or GFF file describing the genomic locations of all your proteins and get counts as follows:

# count the number of overlapping reads for each protein    
proteinCounts <- countOverlaps(allmyproteins, foo)

The result would be a vector with the same number of elements as "allmyproteins" but each position would hold how many reads overlapped with the protein coordinates.

If you want a normalized track, also fairly simple:

# get total reads
nreads <- length(foo)

# rpm coverage, limit to 3 signficant digits
foo_rpm <- lapply(foo_cov, function(x) signif(10^6 * x/nreads,3))

# recast as SimpleRleList
foo_rpm <- as(foo_rpm,"SimpleRleList")

export.bw(foo_rpm, "foo_rpm.bw")