B/Sam To Fasta In R
1
1
Entering edit mode
11.6 years ago
laemtao ▴ 40

I'm looking for a way in Bioconductor to convert a bam file to a filtered fasta file.

Currently, I've come across two methods to read bam files in R:

CDS1 <- readGappedAlignments("aln.bam")
CDS2 <- scanBam("aln.bam")

I like using the first method as I can seamlessly utilize the *ranges packages. Is there a way to retrieve sequence data from a readGappedAlignments object? If not, what is the preferred method to construct a data.frame with: "rname", "strand", "start", "stop", "width", "sequence" from a bam file inside of R?

Thanks!

r bioconductor bam sam fasta • 4.4k views
ADD COMMENT
3
Entering edit mode
11.6 years ago

You can configure your readGappedAlignments call with a custom param object that returns the things you like. So if you are asking if you want to get the sequences of the reads aligned you can do so like this:

library(GenomicRanges)
library(Rsamtools)
bam <- system.file("extdata", "ex1.bam", package="Rsamtools")
ga <- readGappedAlignments(bam, param=ScanBamParam(what="seq"))
head(ga)
GappedAlignments with 6 alignments and 1 metadata column:
      seqnames strand       cigar    qwidth     start       end     width      ngap |                                  seq
         <Rle>  <Rle> <character> <integer> <integer> <integer> <integer> <integer> |                       <DNAStringSet>
  [1]     seq1      +         36M        36         1        36        36         0 | CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
  [2]     seq1      +         35M        35         3        37        35         0 |  CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT
  [3]     seq1      +         35M        35         5        39        35         0 |  AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC
  [4]     seq1      +         36M        36         6        41        36         0 | GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT
  [5]     seq1      +         35M        35         9        43        35         0 |  GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG
  [6]     seq1      +         35M        35        13        47        35         0 |  ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA
  ---
  seqlengths:
   seq1 seq2
   1575 1584

Converting this to the data.frame you are after should be pretty straight forward.

If, however, you are rather interested in getting the genomic sequences that are fenceposted by the ranges returned in your GappedAlignments, you can use the getSeq function from BSgenome package to get the genomic sequences you are after -- I suspect you aren't asking for this, so I'll leave this as an exercise to the reader.

ADD COMMENT
0
Entering edit mode

Thanks a lot Steve!

ADD REPLY

Login before adding your answer.

Traffic: 2637 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6