Question: Batch download of hg19 sequences in bed file
0
gravatar for cmccabe
24 months ago by
cmccabe170
Chicago
cmccabe170 wrote:

I am trying to identify runs of homopolymers in sequences of a bed and have a perl script that works for the desired output below. There are ~55,000 lines in the custom.bed, is there a way to download all the sequences in the bed in the desired output format? UCSC table browser will give the output in the correct format however there is a limit of 1,000 entries per query. The below seems close but the output is formatted incorrect. Thank you :).

custom.bed
chr1    948953  948956  chr1:948953-948956  .   ISG15

BED=custom.bed
for chr in `seq 1 22` X Y
do
wget -O - -q http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz | gunzip -c >> hg19.fa
done

 fastaFromBed -fi hg19.fa -bed $BED -fo $BED.fasta

current output:

chr1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

desired output of each line for fasta sequence:

>hg19_refGene_NM_002335_0 range=chr11:68080173-68080283 5'pad=10 3'pad=10 strand=+ repeatMasking=none
 gccggacaacATGGAGGCAGCGCCGCCCGGGCCGCCGTGGCCGCTGCTGC
 TGCTGCTGCTGCTGCTGCTGGCGCTGTGCGGCTGCCCGGCCCCCGCCGCG
 Ggtaggtgggc
ngs bed fasta • 991 views
ADD COMMENTlink modified 24 months ago by genecats.ucsc560 • written 24 months ago by cmccabe170
1

Are the fields in custom.bed tab-separated? You are not going to get the header you want (hg19_refGene_NM_002335_0 range=chr11:68080173-68080283 5'pad=10 3'pad=10 strand=+ repeatMasking=none, unless this is a dummy example) from the command you are using.

ADD REPLYlink modified 24 months ago • written 24 months ago by genomax59k
4
gravatar for genecats.ucsc
24 months ago by
genecats.ucsc560
genecats.ucsc560 wrote:

Hello, cmccabe.

Thank you for your question about getting FASTA sequences from the UCSC Genome Browser.

Are you uploading your "custom.bed" as a custom track and then attempting to get the FASTA sequence from the Table Browser? You are correct that there is a limit of 1000 regions if you are using the Table Browser's "define regions" option, but for custom tracks, there is no such limit.

Here are some basic steps that you could use to get the sequence you're interested in:

  1. Upload your "custom.bed" as a custom track: https://genome.ucsc.edu/cgi-bin/hgCustom?db=hg19
  2. After uploading, navigate to the Table Browser:https://genome.ucsc.edu/cgi-bin/hgTables.
  3. Make the following selections: clade: Mammal genome: Human assembly: Feb. 2009 (GRCh37/hg19) group: Custom Tracks track: My Custom Track table: myCustomTrack output: sequence output file: enter a file name to save your results to a file, or leave blank to display results in your browser

  4. Click "get output".

  5. Select your sequence output options.
  6. Click "get sequence".

You should end up with a FASTA with names similar to what you are looking for in your "desired output".

If you have any follow up questions, it would be helpful if you could post them to our Google Groups forum: https://groups.google.com/a/soe.ucsc.edu/forum/#!forum/genome, that way our whole team can see the question and help with an answer.

Thanks,

Matthew from the UCSC Genome Browser

ADD COMMENTlink written 24 months ago by genecats.ucsc560

Thank you all very much :)

ADD REPLYlink written 24 months ago by cmccabe170
2
gravatar for Chun-Jie Liu
24 months ago by
Chun-Jie Liu260
US, Houston
Chun-Jie Liu260 wrote:

From your script, you want to download all hg19 fasta, then use bedtools to extract from fasta. Your script is not correct.

for chr in `seq 1 22` X Y M
do
wget -O - -q http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr${chr}.fa.gz &
done
wait

gunzip -c *fa.gz > hg19.fasta

fastaFromBed -fi hg19.fa -bed $BED -fo $BED.fasta
ADD COMMENTlink written 24 months ago by Chun-Jie Liu260
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1128 users visited in the last hour