How to extract ~2M short sequences based on coordinates from a 3G fasta file?
Entering edit mode
15 months ago
kynnjo ▴ 40

(Sorry for the Bioinformatics 101 question!)

I have a file with ~2 million sets of coordinates (chromosome<tab>begin-position<tab>end-position), corresponding to short (~50nt) human genomic sequences (hg19). I want to extract the actual sequences from a human genome assembly 19 fasta file (~3.0G).

I imagine this is a relatively common task, and therefore, that there must be standard tools to carry it out efficiently.

Sadly, my Google fu has not been up to the task of finding them.

I would appreciate not only the name of a tool to use, but also the command line one would use, especially if there are important flags and options I should be aware of when using such a tool.

genome assembly sequence • 269 views
Entering edit mode
15 months ago
kynnjo ▴ 40

Shortly after I posted this question I found that samtools faidx -r <regions_file> ... does what I need.


Login before adding your answer.

Traffic: 2896 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6