How to get list of 150 bp sequences for each of 15,000 from chromosome positions
4
2
Entering edit mode
9.5 years ago
sebastiz ▴ 20

I have a long list of chromosome positions in the format below, except there are 15,000 of them. I would like to get the sequence for each of the spans (or alternatively 150bp up from the start position). Can anyone tell me how to do this for this number of positions?

chr15:60212080-60213230
chr10:60242850-60243000
chr11:60469240-60469390
chr19:60954240-60954390
chr12:61260820-61260970
chr21:61576770-61576920
chr1:61586420-61586570
chr1:61927840-61927990

Thanks in advance

sequencing genome • 2.8k views
ADD COMMENT
3
Entering edit mode
9.5 years ago
smilefreak ▴ 420

K hopefully this should be easy enough.

Follow these steps.

  1. samtools faidx <reference genome>
  2. and a little bash loop.

    while read region
    do
    samtools faidx <reference genome> $region >> output_file.txt
    done < FILE_REGIONS
    
ADD COMMENT
3
Entering edit mode
9.5 years ago
PoGibas 5.1k

Using bedtools getfasta. Input is: chromosome positions & reference genome ( https://www.biostars.org/p/1796/ ).

ADD COMMENT
2
Entering edit mode
9.5 years ago

Using Bioconductor/R:

  1. create GRanges object containing your ranges
  2. use the appropriate BSgenome package (or FaFile) and getSeq()

Using DAS after replacing "-" with ",". This URL gives an example result:

http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr1:100000,200000

Many other possiblities!

ADD COMMENT
1
Entering edit mode
9.5 years ago

Using Python, you can use the pyfaidx module:

pip install --user pyfaidx
faidx genome.fasta --bed regions.bed
ADD COMMENT

Login before adding your answer.

Traffic: 2323 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6