Question: Getting transcript annotation in BED format along with reference seq.
0
gravatar for darxsys
3.8 years ago by
darxsys190
Croatia
darxsys190 wrote:

I want to test how different RNA transcript abundance software work. I would like to use this simulator to

produce reads from transcripts because it enables to automatically generate transcript abundances.

To do that, I have to provide transcript annotation file in BED format and a reference sequence of the chromosome or a gene. My question is: How to correctly obtain a bed file for a gene and its transcripts and also the gene sequence from UCSC.  I have looked into How To Get Bed File Containing Exons Of Canonical Transcripts And Their Corresponding Gene Symbols answer, but I don't understand it entirely and it seems it is a different use-case. Thanks.

ADD COMMENTlink modified 3.8 years ago by Alex Reynolds28k • written 3.8 years ago by darxsys190
4
gravatar for Alex Reynolds
3.8 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

As a general example, you could get a GFF file of annotations via GENCODE, and convert GFF to BED via BEDOPS gff2bed. If you're working with a recent build of the human genome, for instance:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz \
    | gunzip --stdout - \
    | gff2bed - \
    > annotations.bed

BEDOPS bedops and bedmap facilitate making associations between BED-formatted reads and annotations. You could do bedops -e 1, for example, to get reads that overlap annotations. You could do further filtering with awk or grep to get subsets of these annotations, such as exons or genes.

To do sequence lookups, you could use samtools faidx against a directory of indexed FASTA files for your genome of interest (for instance, UCSC makes FASTA files available to download for human and other builds).

An example Perl script could parse a BED file containing reads-of-interest into sequence data, via indexed FASTA files:

#!/usr/bin/env perl

use strict;
use warnings;

# contains per-chromosome UCSC FASTA (.fa) files and samtools-indexed index (.fai) files
my $fastaDir = "/foo/bar/baz"; 

while (<STDIN>) {
    chomp;
    my ($chr, $start, $stop, $id, $score, $strand) = split("\t", $_);
    my $queryKey = "$chr:$start-$stop";
    my $queryResult = `samtools faidx $fastaDir/$chr.fa $queryKey`; chomp $queryResult;
    my @lines = split("\n", $queryResult);
    my @seqs = @lines[1..(scalar @lines - 1)];
    my $seq = join("", @seqs);
    if ($strand eq "-") { $seq = revdnacomp($seq); }
    my $header = ">".join(":",($chr, $start, $stop, $id, $score, $strand));
    print STDOUT $header."\n".uc($seq)."\n";
}

sub revdnacomp {
    my $dna = shift @_;
    my $revcomp = reverse($dna);
    $revcomp =~ tr/ACGTacgt/TGCAtgca/;
    return $revcomp;
}

Hopefully, these are useful starting points.

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Alex Reynolds28k

Unfortunately, it seems that a bed file produced by gff2bed is not how it should look like. I've compared bed fields from such files with those provided by the mentioned simulator's examples, and they differ quite a lot so I can't use the simulator on gff2bed files.

This is how simulator expects them to be: http://www.deviantpics.com/STx

This is how gtf2bed produces them: http://www.deviantpics.com/STr

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by darxsys190

Hii,

can you write a simple 'awk' script to convert the resultant files into the format you want?

ADD REPLYlink written 3.8 years ago by Manvendra Singh2.0k

Probably, If I knew what fields correspond to which in these two file formats. I am not very familiar with variations between bed files.

ADD REPLYlink written 3.8 years ago by darxsys190

file you want is bed12 format,

gtf2bed has produced bed file, without prefix "chr"

you can add it

you can check UCSC bed12 format, I am sure you can convert it

ADD REPLYlink written 3.8 years ago by Manvendra Singh2.0k

What I've noticed is that the simulator returns an error trying to convert field number 10 to exon length. I don't see any such corresponding field in this bed3 file.

ADD REPLYlink written 3.8 years ago by darxsys190

The documentation for gff2bed shows a table of how GFF fields are mapped to BED columns/indices.

ADD REPLYlink written 3.8 years ago by Alex Reynolds28k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 965 users visited in the last hour