How to extract ~2M short sequences based on coordinates from a 3G fasta file?
1
0
Entering edit mode
3.8 years ago
kynnjo ▴ 70

(Sorry for the Bioinformatics 101 question!)

I have a file with ~2 million sets of coordinates (chromosome<tab>begin-position<tab>end-position), corresponding to short (~50nt) human genomic sequences (hg19). I want to extract the actual sequences from a human genome assembly 19 fasta file (~3.0G).

I imagine this is a relatively common task, and therefore, that there must be standard tools to carry it out efficiently.

Sadly, my Google fu has not been up to the task of finding them.

I would appreciate not only the name of a tool to use, but also the command line one would use, especially if there are important flags and options I should be aware of when using such a tool.

genome assembly sequence • 703 views
ADD COMMENT
3
Entering edit mode
3.8 years ago
kynnjo ▴ 70

Shortly after I posted this question I found that samtools faidx -r <regions_file> ... does what I need.

ADD COMMENT

Login before adding your answer.

Traffic: 1945 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6