how to quickly extract sequence from genome positions
4
1
Entering edit mode
8.4 years ago
Luyi Tian ▴ 120

Hi all, I have several millions of records with format chromosome position like:

chr1 10001
chr1 144244
chr2 ...

And I know it is from human hg19. Now I want to get the ATGC info for each position so it would become:

chr1 10001 A
chr1 144244 G
chr2 ...

Anyone know what is the best practice for such job. Tools like UCSC table work for single record, but too slow for a lot of them.

sequence DNA fasta • 8.3k views
ADD COMMENT
0
Entering edit mode

search this site for samtools faidx

ADD REPLY
2
Entering edit mode
8.4 years ago
iraun 6.2k

bedtools getfasta will also work for this job. You should create a bed file with the chromosome start and end positions that you'd like to extract:

chr1 10001 10002
chr1 144244 144245
...
ADD COMMENT
1
Entering edit mode

You'll want to make sure that the coordinates are 0-based for BED files, so probably subtract 1 from each start and end coordinate.

ADD REPLY
4
Entering edit mode
8.4 years ago
Danielk ▴ 640
samtools faidx genome.fasta chr1:10001-10002

also does this.

ADD COMMENT
0
Entering edit mode
8.4 years ago

You have two operations that need to be performed. 1) converting your coordinates into valid input for a tool - either 1-based chr:start-end or 0-based chr start end and 2) transposing the FASTA output so that each entry is on one line. You can do this using pyfaidx:

$ pip install pyfaidx
$ awk '{print $1,$2-1, $2}' records.txt | xargs faidx hg19.fa --bed - --transform transposed
ADD COMMENT
0
Entering edit mode
8.4 years ago

In R/Bioconductor you can use the BSGenome object. For example:

> library("BSgenome.Hsapiens.UCSC.hg19")
> getSeq(BSgenome.Hsapiens.UCSC.hg19, 'chr1', start=10000, end=10200)
  201-letter "DNAString" instance
seq: NTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA...CTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAA

You can combine this with the TxDb object to get the sequences of all human genes:

> library(Homo.sapiens)
> getSeq(BSgenome.Hsapiens.UCSC.hg19, genes(TxDb.Hsapiens.UCSC.hg19.knownGene)[1:10])
  A DNAStringSet instance of length 10
      width seq                                                                                                                             names               
 [1]  16043 ATAGTTACGCAGGCGCAGTGGGGAGAAACGCGACGCCTTGGGCCGCTCTGCCGAATGCAACC...GGCTCTGTTTAAAATGTCTTTGGACTCCCAGGGCTGAGTGGGCTGGGATCTCGTGTCCTCAA 1
 [2]   9969 TGAGATCACTTCCCTTGCAGACTTTGGAAGGGAGAGCACTTTATTACAGACCTTGGAAGCAA...TATTGATTTATTGTTGAATTCATAGTAAATTTTTACTGGTAAATGAATAAAGAATATTGTGG 10
 [3]  32214 GGCCCGTTAAGAAGAGCGTGGCCGGCCGCGGCCACCGCTGGCCCCAGGGAAAGCCGAGCGGC...CTGTGGCAACTTGTACTGAAAATCTGGTGCTCAATAAAGAAGCCCATGGCTGGTGGCATGCA 100
 [4] 226516 GGGGAGCGCCATCCGCTCCACTTCCACCTCCACATCCTCCACCGGCCAAGGTCCCCGCCGCT...AGAATATAAATGATACACCTCTGACCCCAGCGTTCTGAATAAAATGCTAATTTTGGATCTGG 1000
 [5] 355352 CTCAAATACACATCACCAAACAAATTTTCTCTATTATTTGGGTAGGCGTGACTGGTTTTCTT...ACTCGAATTCTCACAAGGAAAAGGCCATTAAAGCTCAAGGTGCATTTCAAACTCCAGGCTAC 10000
 [6]  15729 TGTGAAATATGAGTTGGCGAGGAAGATCGACCTATTATTGGCCTAGACCAAGGCGCTATGTA...AATTTGTTCATTAAAATTCTCCCAATAAAGCTTTACAGCCTTCTGCAAAGAAGTCTTGCGCA 100008586
 [7]   2784 GTGGGAAGTGCGGCTCCTTTCGCGTCCCCCACCCTCTCGGCTCCGCCTGGCAGCAGCTCCGC...ACATTCAAGTTGGGTATAATGACTTTATTTTCCCAGTAATTCCCTAAATAAAACTATTACAA 100009676
 [8]  16428 GAAAGAGAACCTGTAAACGCTCTCGGAATTATGGCGGCGGTGGATATCCGAGGTATACTGTA...ACCTACACACTCACATGTTTACTGGTAGATATGTTTAAAAGCAAAATAAAGGTATTTGTATA 10001
 [9]   7704 AGAAATCTCCTCAAGCCAGAGCCTGTGCTGTGAGGGGCTTCGGGACCTTGGGGCAGCTCCTG...GTGCACTTTATTTTGTTAGACCTCAATAAAAAAGTAAAAAAAAAAAACAAAAAAAACCAGAA 10002
[10]  57962 GCGGAGGCCACCCAGAGCTCACAGCCTCCTGCCAGCGCGCTCTCTGTTTCTCTGCAGCCCCG...AAATATATATAAAATTTATTGTATGAAAATTAAGCCTCAATAAACGTGATTATAAAAAACAA 10003
ADD COMMENT

Login before adding your answer.

Traffic: 2706 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6