Basic Bam File Annotation
5
15
Entering edit mode
10.1 years ago

What are the options for obtaining basic annotation information from BAM alignments of NGS data. The stuff everyone wants to know:

% intergenic
% genic
% intronic
% exonic
% within 5kbp 5' UTR
% within repeats
% within CpG islands


bedtools bam htseq annotation rseqc • 9.0k views
1
Entering edit mode

I would add to the list: coding, 5' UTR, 3' UTR, synonymous & non-synonymous

0
Entering edit mode

i will also accept an expansion of the list of stuff everyone wants to know

0
Entering edit mode

Uh, is this from RNASeq data?

0
Entering edit mode

anything, DNA-Seq, RNA-Seq, ChIP-Seq, exome, whatever

0
Entering edit mode
17
Entering edit mode
10.1 years ago

I have been tinkering with a new tool for BEDTools (it's currently alpha, so it is in the development repo) that attempts to do this. I pushed it to the public repository, if you'd like to play with it. The tool is called "tagBam" and works as follows. You supply a BAM file, and a list of annotation files and associated labels. For each alignment, it will search for overlaps with each annotation file. Each time there is an overlap, the label you supply will be appended to a custom "YB" tag that will be added to the BAM alignment. For example:

tagBam -i aln.bam -files exons.bed introns.bed cpg.bed utrs.bed \ -tags exonic intonic cpg utr \ > aln.tagged.bam  For alignments that have overlaps, you should see new BAM tags like "YB:Z:exonic", "YB:Z:cpg;utr" I should emphasize that this is experimental, but I am hoping to make it available in the next release. As always, comments and suggestions are welcome. Another option is to write a custom script using existing interfaces such as pysam, the BioPerl BAM interface, Picard, samtools C-API, bamtools C++-API, etc. These solutions are nice because you will inevitably run up against a nuanced rule for the annotations that can't be addressed in a one-stop-shop. In particular, if you are a Python person, the pybedtools suite or the HTSeq suite are good options. ADD COMMENT 0 Entering edit mode Awesome! I have been having a lot of trouble not as much with computing these annotations but storing them in a coherent non-adhoc format that I can recall a few weeks later ;-) Putting it back right back into the BAM file is a phenomenal idea - I need to adopt this for other things as well. ADD REPLY 0 Entering edit mode Right. I plan to add an option such that one can define there own tag in addition to the default "YB" tag. ADD REPLY 0 Entering edit mode Wish I could edit my comment..."their", of course. ADD REPLY 0 Entering edit mode I agree, very cool. I guess you could just tabulate the YB tag as it streams to STDOUT to avoid an extra BAM around. One thing: what if some annos are stranded (genes) and some are not (cpg's)? ADD REPLY 0 Entering edit mode Nice catch Brent. Currently, there is a "-s" option to enforce that overlaps are on the same strand. However, this rule applies to all annotation files. I'll give some thought to how to best handle that case. ADD REPLY 0 Entering edit mode Added ability to override the default "YB" tag with the -tag option. -tags is now -labels ADD REPLY 6 Entering edit mode 10.1 years ago Ryan Dale 4.9k As Aaron mentioned, HTseq can be really useful (and importantly, flexible) for this sort of thing. As an example, this classify_reads.py script will count up • total reads in each of the combinatorial overlapping classes (e.g., exon only, intron only, and intron+exon which you can get from overlapping genes or multiple isoforms) • the total reads falling in each class (regardless of what other classes overlap, so "intron_all" and "exon_all") • the total reads • the total unannotated reads • total sequence space in each class You can use the --exclude or --include options to specify what features in the GFF/GTF to look for, e.g., classify_reads.py --bamBAM --gff \$GFF \
--include intron exon repeat_region > report.txt

0
Entering edit mode

Nice. This is a solid example of how to effectively use HTseq.

4
Entering edit mode
10.1 years ago

Picard's RnaSeqMetrics:

wget <ftp://genome-ftp.cse.ucsc.edu/goldenPath/hg19/database/refFlat.txt.gz> .
gunzip refFlat.txt.gz
java -jar picard-tools-1.48/CollectRnaSeqMetrics.jar INPUT=myBamfile.bam REF_FLAT=refFlat.txt OUTPUT=myBamfile.bam.rnaSeqMetric STRAND=NONE VALIDATION_STRINGENCY=LENIENT


output:

## net.sf.picard.metrics.StringHeader
# net.sf.picard.analysis.CollectRnaSeqMetrics REF_FLAT=refFlat.txt STRAND_SPECIFICITY=NONE INPUT=myBamfile.bam OUTPUT=myBamfile.bam.rnaSeqMetric VALIDATION_STRINGENCY=LENIENT    ASSUME_SORTED=true STOP_AFTER=0 TMP_DIR=/tmp/leipzig VERBOSITY=IN
O QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
# Started on: Mon Jun 20 15:18:21 EDT 2011

## METRICS CLASS    net.sf.picard.analysis.RnaSeqMetrics
PF_ALIGNED_BASES    RIBOSOMAL_BASES CODING_BASES    UTR_BASES   INTRONIC_BASES  INTERGENIC_BASES    CORRECT_STRAND_READS    INCORRECT_STRAND_READS  PCT_RIBOSOMAL_BASES PCT_CODING_BASES    PCT_UTR_BASES   PCT_INTRONIC_BASES  PCT_INTERGENIC_BASES    PCT_MRNA_BASES  PCT_CO
974654759   0   19072146    8005286 407874405   539702922   0   0   0   0.019568    0.008213    0.418481    0.553738    0.027782    0

0
Entering edit mode

Didn't know about CollectRnaSeqMetrics -- which made me go check out the rest of Picard. Lots of great stuff in there.

0
Entering edit mode

oh man, thanks for that ftp link. anyone getting their reference from ensembl will want to add 'chr' to the chromosome column. not having it will result in java.lang.NullPointerException at net.sf.picard.annotation.RefFlatReader.makeTranscriptFromRefFlatLine(RefFlatReader.java:173).

0
Entering edit mode
9.6 years ago
Joseph D • 0

python classify_reads.py --bam my.bam --gff refFlat.gtf --include intron exon repeat_region > report.txt


Traceback (most recent call last): File "classify_reads.py", line 191, in [?] main() File "classify_reads.py", line 182, in main chroms=args.assembly) File "classify_reads.py", line 107, in classify for iv, step_set in gaos[read.iv].steps(): File "_HTSeq.pyx", line 523, in HTSeq._HTSeq.GenomicArray.getitem (src/_HTSeq.c:8927) KeyError: None

Joseph

0
Entering edit mode
8 weeks ago