Question: Basic Bam File Annotation
14
gravatar for Jeremy Leipzig
7.9 years ago by
Philadelphia, PA
Jeremy Leipzig18k wrote:

What are the options for obtaining basic annotation information from BAM alignments of NGS data. The stuff everyone wants to know:

% intergenic
% genic
% intronic
% exonic
% within 5kbp 5' UTR
% within repeats
% within CpG islands

Please provide the basic command line syntax with your answer.

annotation bam • 7.1k views
ADD COMMENTlink modified 7.9 years ago by Joseph D0 • written 7.9 years ago by Jeremy Leipzig18k

i will also accept an expansion of the list of stuff everyone wants to know

ADD REPLYlink written 7.9 years ago by Jeremy Leipzig18k

I would add to the list: coding, 5' UTR, 3' UTR, synonymous & non-synonymous

ADD REPLYlink written 7.9 years ago by Casey Bergman18k

Uh, is this from RNASeq data?

ADD REPLYlink written 7.9 years ago by Ketil3.9k

anything, DNA-Seq, RNA-Seq, ChIP-Seq, exome, whatever

ADD REPLYlink written 7.9 years ago by Jeremy Leipzig18k
17
gravatar for Aaronquinlan
7.9 years ago by
Aaronquinlan11k
United States
Aaronquinlan11k wrote:

I have been tinkering with a new tool for BEDTools (it's currently alpha, so it is in the development repo) that attempts to do this. I pushed it to the public repository, if you'd like to play with it. The tool is called "tagBam" and works as follows. You supply a BAM file, and a list of annotation files and associated labels. For each alignment, it will search for overlaps with each annotation file. Each time there is an overlap, the label you supply will be appended to a custom "YB" tag that will be added to the BAM alignment. For example:

$ tagBam -i aln.bam -files exons.bed introns.bed cpg.bed utrs.bed \
                    -tags exonic intonic cpg utr \
                    > aln.tagged.bam

For alignments that have overlaps, you should see new BAM tags like "YB:Z:exonic", "YB:Z:cpg;utr"

I should emphasize that this is experimental, but I am hoping to make it available in the next release. As always, comments and suggestions are welcome.

Another option is to write a custom script using existing interfaces such as pysam, the BioPerl BAM interface, Picard, samtools C-API, bamtools C++-API, etc. These solutions are nice because you will inevitably run up against a nuanced rule for the annotations that can't be addressed in a one-stop-shop. In particular, if you are a Python person, the pybedtools suite or the HTSeq suite are good options.

ADD COMMENTlink modified 7.9 years ago • written 7.9 years ago by Aaronquinlan11k

Awesome! I have been having a lot of trouble not as much with computing these annotations but storing them in a coherent non-adhoc format that I can recall a few weeks later ;-) Putting it back right back into the BAM file is a phenomenal idea - I need to adopt this for other things as well.

ADD REPLYlink written 7.9 years ago by Istvan Albert ♦♦ 80k

Right. I plan to add an option such that one can define there own tag in addition to the default "YB" tag.

ADD REPLYlink written 7.9 years ago by Aaronquinlan11k

Wish I could edit my comment..."their", of course.

ADD REPLYlink written 7.9 years ago by Aaronquinlan11k

I agree, very cool. I guess you could just tabulate the YB tag as it streams to STDOUT to avoid an extra BAM around. One thing: what if some annos are stranded (genes) and some are not (cpg's)?

ADD REPLYlink written 7.9 years ago by brentp23k

Nice catch Brent. Currently, there is a "-s" option to enforce that overlaps are on the same strand. However, this rule applies to all annotation files. I'll give some thought to how to best handle that case.

ADD REPLYlink written 7.9 years ago by Aaronquinlan11k

Added ability to override the default "YB" tag with the -tag option. -tags is now -labels

ADD REPLYlink written 7.9 years ago by Aaronquinlan11k
5
gravatar for Ryan Dale
7.9 years ago by
Ryan Dale4.8k
Bethesda, MD
Ryan Dale4.8k wrote:

As Aaron mentioned, HTseq can be really useful (and importantly, flexible) for this sort of thing. As an example, this classify_reads.py script will count up

  • total reads in each of the combinatorial overlapping classes (e.g., exon only, intron only, and intron+exon which you can get from overlapping genes or multiple isoforms)
  • the total reads falling in each class (regardless of what other classes overlap, so "intron_all" and "exon_all")
  • the total reads
  • the total unannotated reads
  • total sequence space in each class

You can use the --exclude or --include options to specify what features in the GFF/GTF to look for, e.g.,

classify_reads.py --bam $BAM --gff $GFF \
                  --include intron exon repeat_region > report.txt
ADD COMMENTlink written 7.9 years ago by Ryan Dale4.8k

Nice. This is a solid example of how to effectively use HTseq.

ADD REPLYlink written 7.9 years ago by Aaronquinlan11k
4
gravatar for Jeremy Leipzig
7.9 years ago by
Philadelphia, PA
Jeremy Leipzig18k wrote:

Picard's RnaSeqMetrics:

wget <ftp://genome-ftp.cse.ucsc.edu/goldenPath/hg19/database/refFlat.txt.gz> .
gunzip refFlat.txt.gz
java -jar picard-tools-1.48/CollectRnaSeqMetrics.jar INPUT=myBamfile.bam REF_FLAT=refFlat.txt OUTPUT=myBamfile.bam.rnaSeqMetric STRAND=NONE VALIDATION_STRINGENCY=LENIENT

output:

## net.sf.picard.metrics.StringHeader
# net.sf.picard.analysis.CollectRnaSeqMetrics REF_FLAT=refFlat.txt STRAND_SPECIFICITY=NONE INPUT=myBamfile.bam OUTPUT=myBamfile.bam.rnaSeqMetric VALIDATION_STRINGENCY=LENIENT    ASSUME_SORTED=true STOP_AFTER=0 TMP_DIR=/tmp/leipzig VERBOSITY=IN
O QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
## net.sf.picard.metrics.StringHeader
# Started on: Mon Jun 20 15:18:21 EDT 2011

## METRICS CLASS    net.sf.picard.analysis.RnaSeqMetrics
PF_ALIGNED_BASES    RIBOSOMAL_BASES CODING_BASES    UTR_BASES   INTRONIC_BASES  INTERGENIC_BASES    CORRECT_STRAND_READS    INCORRECT_STRAND_READS  PCT_RIBOSOMAL_BASES PCT_CODING_BASES    PCT_UTR_BASES   PCT_INTRONIC_BASES  PCT_INTERGENIC_BASES    PCT_MRNA_BASES  PCT_CO
RRECT_STRAND_READS
974654759   0   19072146    8005286 407874405   539702922   0   0   0   0.019568    0.008213    0.418481    0.553738    0.027782    0
ADD COMMENTlink written 7.9 years ago by Jeremy Leipzig18k

Didn't know about CollectRnaSeqMetrics -- which made me go check out the rest of Picard. Lots of great stuff in there.

ADD REPLYlink written 7.9 years ago by Ryan Dale4.8k

oh man, thanks for that ftp link. anyone getting their reference from ensembl will want to add 'chr' to the chromosome column. not having it will result in java.lang.NullPointerException at net.sf.picard.annotation.RefFlatReader.makeTranscriptFromRefFlatLine(RefFlatReader.java:173).

ADD REPLYlink written 7.4 years ago by Joe Brown70
0
gravatar for Joseph D
7.4 years ago by
Joseph D0
Joseph D0 wrote:

Hi I tried classify_reads.py and got an error. please help

python classify_reads.py --bam my.bam --gff refFlat.gtf --include intron exon repeat_region > report.txt

Traceback (most recent call last): File "classify_reads.py", line 191, in [?] main() File "classify_reads.py", line 182, in main chroms=args.assembly) File "classify_reads.py", line 107, in classify for iv, step_set in gaos[read.iv].steps(): File "_HTSeq.pyx", line 523, in HTSeq._HTSeq.GenomicArray.getitem (src/_HTSeq.c:8927) KeyError: None

Joseph

ADD COMMENTlink written 7.4 years ago by Joseph D0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1408 users visited in the last hour