Question

bed intersect to get gene names from bed

1

Entering edit mode

2.4 years ago

whb ▴ 60

Hi, I have a bed file for the baits and I want to get a list of targeted genes. So I downloaded a bed file from UCSC with the settings below and used bedtools to intersect. But I end up having >8000 genes (the panel has 353 genes). and GNB1 for example is not in the panel. Could you tell me what did I do wrong please? Thank you!

Clade: Mammals
assembly: hg38
group: Gene and Gene Predictions
track: NCBI RefSeq
table: RefSeq All (ncbiRefSeq)
output format: select from primary table (chromosomes, cdsStart, cdsEnd, name2)

Then I used:

bedtools intersect -a bait.bed -b UCSC.bed -wa -wb > intersect.bed

head intersect.bed

chr1    1819822 1820002 chr1:1819775-1819875    chr1    1787330 1825453 GNB1
chr1    1819822 1820002 chr1:1819775-1819875    chr1    1787330 1825453 GNB1
chr1    1819822 1820002 chr1:1819775-1819875    chr1    1787330 1825453 GNB1
chr1    1819822 1820002 chr1:1819775-1819875    chr1    1787330 1825453 GNB1
chr1    1819822 1820002 chr1:1819775-1819875    chr1    1787330 1825453 GNB1
chr1    1819822 1820002 chr1:1819775-1819875    chr1    1787330 1825453 GNB1
chr1    3683000 3683240 chr1:3683057-3683185    chr1    3682365 3732762 TP73
chr1    3683000 3683240 chr1:3683057-3683185    chr1    3682365 3733079 TP73
chr1    3683000 3683240 chr1:3683057-3683185    chr1    3682365 3733079 TP73
chr1    3683000 3683240 chr1:3683057-3683185    chr1    3682365 3731555 TP73
chr1    3683000 3683240 chr1:3683057-3683185    chr1    3682365 3732762 TP73

bedtools intersect • 1.5k views

ADD COMMENT • link updated 2.4 years ago by Alex Reynolds 35k • written 2.4 years ago by whb ▴ 60

0

Entering edit mode

check you have no duplicate in your input bed.

ADD REPLY • link 2.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi Pierre,

thank you for the reply. Please let me know if I am wrong but it doesn't look like it has duplicates. I wanted to input the bed file to UCSC but there is a limit of 1000 entries and there are ~9000 lines in the bait.bed file.

head -20 bait.bed

chr1    819068  819248  chr1:819073-819173
chr1    1819822 1820002 chr1:1819775-1819875
chr1    2833781 2833961 chr1:2833822-2833922
chr1    3682319 3682479 chr1:3682365-3682435
chr1    3683000 3683240 chr1:3683057-3683185
chr1    3688157 3688337 chr1:3688198-3688297
chr1    3688932 3689112 chr1:3688973-3689072
chr1    3690846 3691006 chr1:3690905-3690949
chr1    3707530 3707810 chr1:3707546-3707796
chr1    3708893 3709073 chr1:3708934-3709033
chr1    3721974 3722254 chr1:3722018-3722212
chr1    3723292 3723532 chr1:3723351-3723474
chr1    3724468 3724648 chr1:3724509-3724608
chr1    3727070 3727270 chr1:3727112-3727229
chr1    3727579 3727819 chr1:3727625-3727775
chr1    3728073 3728273 chr1:3728126-3728222
chr1    3729268 3729508 chr1:3729324-3729453
chr1    3729954 3730194 chr1:3729997-3730153

head UCSC.bed

#chrom  cdsStart    cdsEnd  name2
chr1    67093579    67127240    C1orf141
chr1    67093004    67127240    C1orf141
chr1    67134970    67134970    C1orf141
chr1    67093004    67103382    C1orf141
chr1    67093004    67127240    C1orf141
chr1    67093004    67127240    C1orf141
chr1    67093004    67127240    C1orf141
chr1    67093004    67127240    C1orf141
chr1    67093569    67127240    C1orf141
chr1    67093579    67127240    C1orf141
chr1    67096270    67127240    C1orf141
chr1    201283702   201328836   PKP1
chr1    201283702   201328836   PKP1
chr1    8355086 8364133 RERE
chr1    8355086 8656297 RERE
chr1    8355086 8656297 RERE
chr1    33519517    34165097    CSMD2
chr1    33519517    34165813    CSMD2
chr1    33519517    34165097    CSMD2

ADD REPLY • link 2.4 years ago by whb ▴ 60

0

Entering edit mode

chr1    67093004    67127240    C1orf141
chr1    67093004    67127240    C1orf141

ADD REPLY • link 2.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Sorry I missed that. But even if there are duplicates in the UCSC bed file. Why genes that are not targeted in the panel appear after intersecting? I previously counted the number of genes based on the unique gene names not line numbers. So I imagine removing the duplicates will not change the number of unique genes in the bed file after intersect? Thanks again

ADD REPLY • link 2.4 years ago by whb ▴ 60

score 0 · Answer 1 · 2021-11-13

Here's another option that might give you correct or expected results:

$ bedmap --echo --echo-map --skip-unmapped --delim '\t' <(sort-bed --unique bait.bed) <(sort-bed --unique UCSC.bed) > answer.bed

Passing --unique to sort-bed more quickly strips out duplicates after sorting.

Using --echo-map returns all of the mapped element. If you just want a list of unique gene names, you can replace --echo-map with --echo-map-id-uniq:

$ bedmap --echo --echo-map-id-uniq --skip-unmapped --delim '\t' <(sort-bed --unique bait.bed) <(sort-bed --unique UCSC.bed) > answer.bed

Ref: