Question

Batch download of hg19 sequences in bed file

1

Entering edit mode

7.4 years ago

bioguy24 ▴ 230

I am trying to identify runs of homopolymers in sequences of a bed and have a perl script that works for the desired output below. There are ~55,000 lines in the custom.bed, is there a way to download all the sequences in the bed in the desired output format? UCSC table browser will give the output in the correct format however there is a limit of 1,000 entries per query. The below seems close but the output is formatted incorrect. Thank you :).

custom.bed
chr1    948953  948956  chr1:948953-948956  .   ISG15

BED=custom.bed
for chr in `seq 1 22` X Y
do
wget -O - -q http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz | gunzip -c >> hg19.fa
done

 fastaFromBed -fi hg19.fa -bed $BED -fo $BED.fasta

current output:

chr1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

desired output of each line for fasta sequence:

>hg19_refGene_NM_002335_0 range=chr11:68080173-68080283 5'pad=10 3'pad=10 strand=+ repeatMasking=none
 gccggacaacATGGAGGCAGCGCCGCCCGGGCCGCCGTGGCCGCTGCTGC
 TGCTGCTGCTGCTGCTGCTGGCGCTGTGCGGCTGCCCGGCCCCCGCCGCG
 Ggtaggtgggc

ngs FASTA Bed • 3.0k views

ADD COMMENT • link updated 7.4 years ago by genecats.ucsc ▴ 580 • written 7.4 years ago by bioguy24 ▴ 230

1

Entering edit mode

Are the fields in custom.bed tab-separated? You are not going to get the header you want (hg19_refGene_NM_002335_0 range=chr11:68080173-68080283 5'pad=10 3'pad=10 strand=+ repeatMasking=none, unless this is a dummy example) from the command you are using.

ADD REPLY • link 7.4 years ago by GenoMax 141k

2

Entering edit mode

7.4 years ago

Chun-Jie Liu ▴ 280

From your script, you want to download all hg19 fasta, then use bedtools to extract from fasta. Your script is not correct.

for chr in `seq 1 22` X Y M
do
wget -O - -q http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr${chr}.fa.gz &
done
wait

gunzip -c *fa.gz > hg19.fasta

fastaFromBed -fi hg19.fa -bed $BED -fo $BED.fasta

ADD COMMENT • link 7.4 years ago by Chun-Jie Liu ▴ 280

score 5 · Accepted Answer · 2016-12-16

Hello, cmccabe.

Thank you for your question about getting FASTA sequences from the UCSC Genome Browser.

Are you uploading your "custom.bed" as a custom track and then attempting to get the FASTA sequence from the Table Browser? You are correct that there is a limit of 1000 regions if you are using the Table Browser's "define regions" option, but for custom tracks, there is no such limit.

Here are some basic steps that you could use to get the sequence you're interested in:

Upload your "custom.bed" as a custom track: https://genome.ucsc.edu/cgi-bin/hgCustom?db=hg19
After uploading, navigate to the Table Browser:https://genome.ucsc.edu/cgi-bin/hgTables.
Make the following selections: clade: Mammal genome: Human assembly: Feb. 2009 (GRCh37/hg19) group: Custom Tracks track: My Custom Track table: myCustomTrack output: sequence output file: enter a file name to save your results to a file, or leave blank to display results in your browser
Click "get output".
Select your sequence output options.
Click "get sequence".

You should end up with a FASTA with names similar to what you are looking for in your "desired output".

If you have any follow up questions, it would be helpful if you could post them to our Google Groups forum: https://groups.google.com/a/soe.ucsc.edu/forum/#!forum/genome, that way our whole team can see the question and help with an answer.

Thanks,

Matthew from the UCSC Genome Browser