Question

Mapping RNA seq reads from bed file to genes

0

Entering edit mode

4.5 years ago

jrose ▴ 30

I have a bed file with columns: chrom, read_start, read_end, id, strand

where every row represents a read from an RNA seq data set (e.g. roadmap).

I would like to map every row to a gene (ensemble id/ gene name). I have an annotation file taken from ensembl (ftp://ftp.ensembl.org/pub/current_gtf/homo_sapiens/) . Is an appropriate strategy to find the rows in the annotation file which intersects the current bed file row, and look for the row in the annotation file where the column type value is "gene"?

RNAseq gene-expression tpm • 2.7k views

ADD COMMENT • link 4.5 years ago by jrose ▴ 30

0

Entering edit mode

4.5 years ago

heskett ▴ 110

It seems like you might be trying to reinvent the wheel here since an alignment tool like STAR will do all the counting for you given the raw reads and the gene GTFs. However if youre stuck with a bed file of reads, all you need to do is convert the GTF to bed simply using awk, and then do a bedtools intersect or bedtools map which will do the gene counting for you

ADD COMMENT • link 4.5 years ago by heskett ▴ 110

0

Entering edit mode

you are correct. I am stuck with the bed files (don't have raw reads due to privacy restrictions).

ADD REPLY • link 4.5 years ago by jrose ▴ 30

score 2 · Accepted Answer · 2021-04-25

To get genes into BED format:

$ wget -qO- ftp://ftp.ensembl.org//pub/release-103/gtf/homo_sapiens/Homo_sapiens.GRCh38.103.gtf.gz | gunzip -c | awk '{ print $0" transcript_id \"\";"}' | gtf2bed - | awk '($8=="gene")' > Homo_sapiens.GRCh38.103.gene.bed

It is necessary to add a blank transcript_id. Unfortunately, Ensembl GTF files do not follow specification, but the awk statement fixes that.

To prep reads:

$ sort-bed reads.bed > reads.sorted.bed

You may want to check that the chromosome field in reads.sorted.bed use the Ensembl chromosome naming scheme, i.e., 1, 2, etc. If your BED file uses the UCSC scheme (chr1, chr2, etc.) then the chr string needs to be removed, e.g. with awk or similar. Chromosome name schemes should match, or mapping will not work.

To map reads to each gene:

$ bedmap --echo --echo-map-id-uniq --skip-unmapped Homo_sapiens.GRCh38.103.gene.bed reads.sorted.bed > reads_mapped_to_genes.bed

To map genes to each read:

$ bedmap --echo --echo-map-id-uniq --skip-unmapped reads.sorted.bed Homo_sapiens.GRCh38.103.gene.bed > genes_mapped_to_reads.bed

Reference: