Question: Extracting regions from .bam file using a .gff or .gtf file
4
gravatar for Tom A
5.1 years ago by
Tom A90
United States
Tom A90 wrote:

Hello, 

What is the most efficient way to extract RNA-seq reads from a .bam file that match to a genomic region defined in a .gff or .gtf file? 

Im new to this so my terminology may be incorrect, but I basically want to "merge" a .bam file with a .gff (or .gtf) file to extract a list of the sequences that fall into that defined region. 

Thanks in advance. 

rna-seq bam gff gtf • 8.0k views
ADD COMMENTlink modified 5.1 years ago by Chris Fields2.1k • written 5.1 years ago by Tom A90

Regarding your terminology, I would call what you're trying to do "intersecting" the files instead of "merging"

ADD REPLYlink written 5.1 years ago by Katie D'Aco1000
2
gravatar for Alex Reynolds
5.1 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

You could do this with the BEDOPS toolkit.

First, convert the datasets into BED files, e.g.:

$ bam2bed < reads.bam > reads.bed
$ gff2bed < annotations.gff > annotations.gff.bed
$ gtf2bed < annotations.gtf > annotations.gtf.bed

BEDOPS tools are written to work with standard input and output streams, so you can use piping to filter annotations for classes of elements, for instance:

$ gff2bed < annotations.gff | grep -w "gene" > genes.gff.bed
$ gtf2bed < annotations.gtf | grep -w "exon" > exons.gtf.bed
...

Then do set or map operations on BED files, e.g. to get all reads that overlap annotations, use bedops --element-of:

$ bedops --element-of -1 reads.bed annotations.gff.bed > justReadsThatOverlapAnnotations.bed

And to get both reads and their associated, overlapping annotations, use bedmap --echo --echo-map:

$ bedmap --echo --echo-map reads.bed annotations.gtf.bed > readsWithAssociatedAnnotations.bed

Note here that set operations return members of the reference set (or subsets thereof), while map operations return elements from both reference and map sets. The documentation goes into more detail about both operations and arguments for each application.

ADD COMMENTlink modified 5.1 years ago • written 5.1 years ago by Alex Reynolds28k

Thank you Alex! I will give it a shot. 

 

ADD REPLYlink written 5.1 years ago by Tom A90

Hope this helps! I check in here every few days, so if you have any questions about using these tools, feel free to post them here or on our user forum. 

ADD REPLYlink written 5.1 years ago by Alex Reynolds28k

Ok, so Im not very good with Unix. But after installing and running all of the "make" commands and the "cp bin/* /usr/local/bin" command, I get the following error: "[bam2bed] - The samtools binary could not be found in your user PATH -- please locate and install this binary" even though the binary is in the same directory I am in. Any thoughts?

ADD REPLYlink written 5.1 years ago by Tom A90

Make sure samtools is in /usr/local/bin or somewhere in your PATH where the script can find it (run echo $PATH to see which directories will be searched through, for binaries like samtools and others).

ADD REPLYlink written 5.1 years ago by Alex Reynolds28k
1
gravatar for Devon Ryan
5.1 years ago by
Devon Ryan90k
Freiburg, Germany
Devon Ryan90k wrote:

The simplest way would be to convert the GTF/GFF file to a BED format (you lose a lot of information, of course), and then use samtools view -hL regions.bed alignments.bam > alignments_in_regions.sam.

Now having said that, it'd be good to know what you plan to do with the reads as there's likely a better way to do what you actually want.

ADD COMMENTlink written 5.1 years ago by Devon Ryan90k

Thank you Devon. How does one convert to a BED format? Is there a samtools command for that as well? 

ADD REPLYlink written 5.1 years ago by Tom A90

awk '{printf("%s\t%s\t%s\t.\t%s\t%s\n",$1,$4,$5,$6,$7)}' file.gtf > file.bed  is the simplest method, though you probably want to only look at exons or something like that, so just add a test for that.

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by Devon Ryan90k

Not only is this easily done w/ samtools, but the user proably already samtools installed (or should if he doesn't ;)

ADD REPLYlink written 5.1 years ago by Katie D'Aco1000

http://ea-utils.googlecode.com/svn/trunk/clipper/gtf2bed

ADD REPLYlink written 4.8 years ago by earonesty220
1
gravatar for Chris Fields
5.1 years ago by
Chris Fields2.1k
University of Illinois Urbana-Champaign
Chris Fields2.1k wrote:

As Devon mentioned it completely depends on what you are trying to accomplish.  Are the regions pretty self-contained or discontinuous?  Is the sequence data from an RNA-Seq experiment, where you have to take splice junctions into consideration?  

For RNA-Seq I have used both htseq-count and featureCounts to get counts via GTF (I am more consistently using the latter as it takes multiple BAM files as input and the results are the same as htseq).  If you have GFF you can convert to GTF using Cufflink's excellent gffread tool.

You can also get counts from a BED file as mentioned above by Devon and Alex, though these (in general) require a little more fiddling (e.g. may require bed12 output).  bedtools is what we have traditionally used; not familiar with BEDOPS yet though I may have to give it a try.

ADD COMMENTlink written 5.1 years ago by Chris Fields2.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1080 users visited in the last hour