Simple way to extract or mask specific sites only from multi-FASTA alignments
0
0
Entering edit mode
6.6 years ago
Kame ▴ 20

Hi,

I have a large genome alignment file in multi-FASTA format and a list of positions of interest (let's say SNP positions). For what I want to do next, I either need to extract those sites from the initial alignment (in all sequences) and output them as a subset in a new multi-FASTA file, or mask them in the initial alignment (by inserting N in every sequence at the corresponding sites, for instance).

I first thought that tools like seqtk (for extracting) or bedtools (for masking) would work, and run using a BED file containing the sites as follows:

seqtk subseq input.fas sites.bed > output.fas
bedtools maskfasta -fi input.fas -bed sites.bed -fo output.fas

However, it would seem that in order to work on multi-FASTA alignments, these tools require the BED file to explicitly contain all sites for all individual sequences in the alignment, using their headers in the FASTA file as the first column of the BED file ("chromosome"). The problem is that I have a very large number of sites of interest, and a large number of sequences that are aligned, so this BED file would be insanely huge.

Am I missing something, or doing things wrong? Would there be another (simple) way to do this?

Many thanks for your help and your time.

KS

SNP bedtools FASTA seqtk • 2.6k views
ADD COMMENT

Login before adding your answer.

Traffic: 1855 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6