How to create interval list from reference fasta or dict file?
2
0
Entering edit mode
3.2 years ago
MatthewP ★ 1.4k

I am using GATK pipeline on WGS data. My BAM files is aligned to GRCh38 from GENCODE. So I want to create interval file for this GRCh38 instead of download from GATKbundle, because some of their contigs have different names. For example "KI270706.1" in GENCODE's GRCh38 is "chr1_KI270706v1_random" in interval list downloaded from GATKbundle. I can't modify interval list downloaded from GATKbundle, because some contigs are different, for example exists in interval list but not GENCODE's GRCh38.

Can anyone tells me how to create my own interval list from GENCODE GRCh38? Thanks.

GATK • 5.7k views
ADD COMMENT
0
Entering edit mode

Hello @MatthewP , were you able to that for a fasta file for example? If you have a fasta file containing a targeted region, how to you generate a intervals.bed file for that fasta?

Thank you!

ADD REPLY
7
Entering edit mode
3.2 years ago
MatthewP ★ 1.4k

Ok I can answer myself.

# Create dict and faidx index for fasta
gatk CreateSequenceDictionary -R GRCh38.primary_assembly.genome.fa
samtools faidx GRCh38.primary_assembly.genome.fa
# get fasta bed file
awk -v FS="\t" -v OFS="\t" '{print $1 FS "0" FS ($2-1)}' GRCh38.primary_assembly.genome.fa.fai > GRCh38.primary_assembly.genome.bed
# remove blacklist
bedtools subtract -a GRCh38.primary_assembly.genome.bed -b ../hg38.blacklist.bed > GRCh38.primary_assembly.genome.interval.bed
# convert to interval list file
gatk BedToIntervalList -I GRCh38.primary_assembly.genome.interval.bed -O \ 
GRCh38.primary_assembly.genome.interval_list -SD GRCh38.primary_assembly.genome.dict

Note, this is picard style interval list, so named pattern is *.interval_list

ADD COMMENT
0
Entering edit mode

Hi.

Thanks for your scripts! It is awesome.

Because call of intervals used bed files, I was wondering if gtf could do it as well. So I test gtf2bed command from bedops to see if gencode gtf could be used for this purpose.

It output a hundreds Mb interval list with many repeated information and ensembl_id as well. I did not test whether this test interval gonna work.

Searching for the whole internet, here is the most accurate answer for this question.

Thanks again.

ADD REPLY
0
Entering edit mode
3.2 years ago
tothepoint ▴ 800

Use picard tool

java -jar picard.jar CreateSequenceDictionary \ 
      R=reference.fasta \ 
      O=reference.dict

This will do the job.

ADD COMMENT
1
Entering edit mode

Ok, I already created this dict file, but don't know this can serve as interval list, thanks.

ADD REPLY
1
Entering edit mode

Ok, this won't work, parse dict file to -L parameter of gatk command HaplotypeCaller will cause error.

ADD REPLY

Login before adding your answer.

Traffic: 3149 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6