Question: How to extract ~2M short sequences based on coordinates from a 3G fasta file?
0
gravatar for kynnjo
4 weeks ago by
kynnjo40
United States
kynnjo40 wrote:

(Sorry for the Bioinformatics 101 question!)

I have a file with ~2 million sets of coordinates (chromosome<tab>begin-position<tab>end-position), corresponding to short (~50nt) human genomic sequences (hg19). I want to extract the actual sequences from a human genome assembly 19 fasta file (~3.0G).

I imagine this is a relatively common task, and therefore, that there must be standard tools to carry it out efficiently.

Sadly, my Google fu has not been up to the task of finding them.

I would appreciate not only the name of a tool to use, but also the command line one would use, especially if there are important flags and options I should be aware of when using such a tool.

sequence assembly genome • 101 views
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by kynnjo40
3
gravatar for kynnjo
4 weeks ago by
kynnjo40
United States
kynnjo40 wrote:

Shortly after I posted this question I found that samtools faidx -r <regions_file> ... does what I need.

ADD COMMENTlink written 4 weeks ago by kynnjo40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 858 users visited in the last hour