all coding regions .bed file hg38 Whole Genome Sequencing
1
1
Entering edit mode
23 months ago
cocchi.e89 ▴ 190

Quick question: is there out there a .bed (or similar) files that span over all coding regions in hg38 coordinates? I need to analyze some WGS samples

Thank you very much in advance for any help!

wgs coding bed • 4.2k views
0
Entering edit mode

You could download the hg38 GTF file from GENCODE and extract relevant columns and records from it.

4
Entering edit mode
23 months ago
wget -O - "http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_35/gencode.v35.annotation.gtf.gz" |\
gunzip -c | grep 'transcript_type "protein_coding"' |\
awk '($3=="exon") {printf("%s\t%s\t%s\n",$1,int($4)-1,$5);}' |\
sort -T . -t $'\t' -k1,1 -k2,2n | bedtools merge  ? ADD COMMENT 0 Entering edit mode thanks so much, I was wondering, what are the regions that are filtered out between line 2 and line 3? (regions that are flagged as protein_coding but not exons? ADD REPLY 0 Entering edit mode You should download and take a look at the GTF file. Explore it so you understand what the above command does with real data. ADD REPLY 0 Entering edit mode I did and de-piped command, mine was more a theoretical question ADD REPLY 1 Entering edit mode Right, then what are the unique values you see in $3 when transcript_type is protein_coding? That should tell you what non-exonic regions in protein coding transcripts are.

0
Entering edit mode

Don't the exon entries contain non-coding UTRs?

0
Entering edit mode

For the future bioinformaticians who land on this page: Please note that the release 35 in no more the latest release. So you may want to update the link to the gtf.gz file in the script above. You can find out the latest release by inspecting: http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/. For example, the current release (as of May 2022) is release 40, thus the link is: http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/gencode.v40.annotation.gtf.gz.
By the way, the coordinates are on the human reference genome: hg38 (i.e.GRCh38).

0
Entering edit mode

Thank you for this. Can I ask, what is the best method to filter/extract only variants that appear in the exome (i.e. in the generated .bed file) from an annotated WGS vcf?

0
Entering edit mode

unrelated to the original question.

0
Entering edit mode

Apologies, I thought it was related as I imagined this is exactly what @cocchi.e89 was trying to do with the .bed file he was requesting.