Getting transcript annotation in BED format along with reference seq.
1
0
Entering edit mode
7.8 years ago
darxsys ▴ 220

I want to test how different RNA transcript abundance software work. I would like to use this simulator to

produce reads from transcripts because it enables to automatically generate transcript abundances.

To do that, I have to provide transcript annotation file in BED format and a reference sequence of the chromosome or a gene. My question is: How to correctly obtain a bed file for a gene and its transcripts and also the gene sequence from UCSC. I have looked into this answer, but I don't understand it entirely and it seems it is a different use-case. Thanks.

RNA-Seq genome annotation transcriptome • 6.1k views
4
Entering edit mode
7.8 years ago

As a general example, you could get a GFF file of annotations via GENCODE, and convert GFF to BED via BEDOPS gff2bed. If you're working with a recent build of the human genome, for instance:

$wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz \ | gunzip --stdout - \ | gff2bed - \ > annotations.bed  BEDOPS bedops and bedmap facilitate making associations between BED-formatted reads and annotations. You could do bedops -e 1, for example, to get reads that overlap annotations. You could do further filtering with awk or grep to get subsets of these annotations, such as exons or genes. To do sequence lookups, you could use samtools faidx against a directory of indexed FASTA files for your genome of interest (for instance, UCSC makes FASTA files available to download for human and other builds). An example Perl script could parse a BED file containing reads-of-interest into sequence data, via indexed FASTA files: #!/usr/bin/env perl use strict; use warnings; # contains per-chromosome UCSC FASTA (.fa) files and samtools-indexed index (.fai) files my$fastaDir = "/foo/bar/baz";

while (<STDIN>) {
chomp;
my ($chr,$start, $stop,$id, $score,$strand) = split("\t", $_); my$queryKey = "$chr:$start-$stop"; my$queryResult = samtools faidx $fastaDir/$chr.fa $queryKey; chomp$queryResult;
my @lines = split("\n", $queryResult); my @seqs = @lines[1..(scalar @lines - 1)]; my$seq = join("", @seqs);
if ($strand eq "-") {$seq = revdnacomp($seq); } my$header = ">".join(":",($chr,$start, $stop,$id, $score,$strand));
print STDOUT $header."\n".uc($seq)."\n";
}

sub revdnacomp {
my $dna = shift @_; my$revcomp = reverse($dna);$revcomp =~ tr/ACGTacgt/TGCAtgca/;
return \$revcomp;
}


Hopefully, these are useful starting points.

0
Entering edit mode

Unfortunately, it seems that a bed file produced by gff2bed is not how it should look like. I've compared bed fields from such files with those provided by the mentioned simulator's examples, and they differ quite a lot so I can't use the simulator on gff2bed files.

This is how simulator expects them to be: http://www.deviantpics.com/STx

This is how gtf2bed produces them: http://www.deviantpics.com/STr

0
Entering edit mode

Hi,

Can you write a simple 'awk' script to convert the resultant files into the format you want?

0
Entering edit mode

Probably, If I knew what fields correspond to which in these two file formats. I am not very familiar with variations between bed files.

0
Entering edit mode

file you want is bed12 format,

gtf2bed has produced bed file, without prefix "chr"

you can check UCSC bed12 format, I am sure you can convert it

0
Entering edit mode

What I've noticed is that the simulator returns an error trying to convert field number 10 to exon length. I don't see any such corresponding field in this bed3 file.

0
Entering edit mode

The documentation for gff2bed shows a table of how GFF fields are mapped to BED columns/indices.