To get genes into BED format:
$ wget -qO- ftp://ftp.ensembl.org//pub/release-103/gtf/homo_sapiens/Homo_sapiens.GRCh38.103.gtf.gz | gunzip -c | awk '{ print $0" transcript_id \"\";"}' | gtf2bed - | awk '($8=="gene")' > Homo_sapiens.GRCh38.103.gene.bed
It is necessary to add a blank transcript_id. Unfortunately, Ensembl GTF files do not follow specification, but the awk statement fixes that.
To prep reads:
$ sort-bed reads.bed > reads.sorted.bed
You may want to check that the chromosome field in reads.sorted.bed use the Ensembl chromosome naming scheme, i.e., 1, 2, etc. If your BED file uses the UCSC scheme (chr1, chr2, etc.) then the chr string needs to be removed, e.g. with awk or similar. Chromosome name schemes should match, or mapping will not work.
To map reads to each gene:
$ bedmap --echo --echo-map-id-uniq --skip-unmapped Homo_sapiens.GRCh38.103.gene.bed reads.sorted.bed > reads_mapped_to_genes.bed
To map genes to each read:
$ bedmap --echo --echo-map-id-uniq --skip-unmapped reads.sorted.bed Homo_sapiens.GRCh38.103.gene.bed > genes_mapped_to_reads.bed
Reference:
Worked great! Much appreciated!
Thanks! Trying this out now!