Question: Batch download of hg19 sequences in bed file
0
gravatar for bioguy24
2.9 years ago by
bioguy24190
Chicago
bioguy24190 wrote:

I am trying to identify runs of homopolymers in sequences of a bed and have a perl script that works for the desired output below. There are ~55,000 lines in the custom.bed, is there a way to download all the sequences in the bed in the desired output format? UCSC table browser will give the output in the correct format however there is a limit of 1,000 entries per query. The below seems close but the output is formatted incorrect. Thank you :).

custom.bed
chr1    948953  948956  chr1:948953-948956  .   ISG15

BED=custom.bed
for chr in `seq 1 22` X Y
do
wget -O - -q http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz | gunzip -c >> hg19.fa
done

 fastaFromBed -fi hg19.fa -bed $BED -fo $BED.fasta

current output:

chr1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

desired output of each line for fasta sequence:

>hg19_refGene_NM_002335_0 range=chr11:68080173-68080283 5'pad=10 3'pad=10 strand=+ repeatMasking=none
 gccggacaacATGGAGGCAGCGCCGCCCGGGCCGCCGTGGCCGCTGCTGC
 TGCTGCTGCTGCTGCTGCTGGCGCTGTGCGGCTGCCCGGCCCCCGCCGCG
 Ggtaggtgggc
ngs bed fasta • 1.4k views
ADD COMMENTlink modified 2.9 years ago by genecats.ucsc570 • written 2.9 years ago by bioguy24190
1

Are the fields in custom.bed tab-separated? You are not going to get the header you want (hg19_refGene_NM_002335_0 range=chr11:68080173-68080283 5'pad=10 3'pad=10 strand=+ repeatMasking=none, unless this is a dummy example) from the command you are using.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by genomax74k
5
gravatar for genecats.ucsc
2.9 years ago by
genecats.ucsc570
genecats.ucsc570 wrote:

Hello, cmccabe.

Thank you for your question about getting FASTA sequences from the UCSC Genome Browser.

Are you uploading your "custom.bed" as a custom track and then attempting to get the FASTA sequence from the Table Browser? You are correct that there is a limit of 1000 regions if you are using the Table Browser's "define regions" option, but for custom tracks, there is no such limit.

Here are some basic steps that you could use to get the sequence you're interested in:

  1. Upload your "custom.bed" as a custom track: https://genome.ucsc.edu/cgi-bin/hgCustom?db=hg19
  2. After uploading, navigate to the Table Browser:https://genome.ucsc.edu/cgi-bin/hgTables.
  3. Make the following selections: clade: Mammal genome: Human assembly: Feb. 2009 (GRCh37/hg19) group: Custom Tracks track: My Custom Track table: myCustomTrack output: sequence output file: enter a file name to save your results to a file, or leave blank to display results in your browser

  4. Click "get output".

  5. Select your sequence output options.
  6. Click "get sequence".

You should end up with a FASTA with names similar to what you are looking for in your "desired output".

If you have any follow up questions, it would be helpful if you could post them to our Google Groups forum: https://groups.google.com/a/soe.ucsc.edu/forum/#!forum/genome, that way our whole team can see the question and help with an answer.

Thanks,

Matthew from the UCSC Genome Browser

ADD COMMENTlink written 2.9 years ago by genecats.ucsc570

Thank you all very much :)

ADD REPLYlink written 2.9 years ago by bioguy24190
2
gravatar for Chun-Jie Liu
2.9 years ago by
Chun-Jie Liu260
US, Houston
Chun-Jie Liu260 wrote:

From your script, you want to download all hg19 fasta, then use bedtools to extract from fasta. Your script is not correct.

for chr in `seq 1 22` X Y M
do
wget -O - -q http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr${chr}.fa.gz &
done
wait

gunzip -c *fa.gz > hg19.fasta

fastaFromBed -fi hg19.fa -bed $BED -fo $BED.fasta
ADD COMMENTlink written 2.9 years ago by Chun-Jie Liu260
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1904 users visited in the last hour