Question

Subset multiple FASTA with a set of intervals provided in a BED/GFF file?

0

Entering edit mode

7.7 years ago

ksw ▴ 40

Hi, I am wondering if there were a BioPython or other tool to help me subset a multiple FASTA file?

Thank you!

fasta GFF • 2.5k views

ADD COMMENT • link updated 7.6 years ago by Alex Reynolds 35k • written 7.7 years ago by ksw ▴ 40

2

Entering edit mode

Is there a reason you're not using bedtools getfasta?

ADD REPLY • link 7.7 years ago by Devon Ryan 104k

0

Entering edit mode

If they have a multifasta for a region across samples, say chr1 for 10 individuals, bedtools getfasta won't work unless the bed is modified accordingly, right? That said, something tells me that getfasta is what they want.

ADD REPLY • link 7.7 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

Correct, only a custom program would handle that.

ADD REPLY • link 7.7 years ago by Devon Ryan 104k

0

Entering edit mode

I assume the fasta file contains entire chromosomes/contigs? It would for sure be possible to write a Biopython script for this, but I have the feeling that a tool like that probably already exists.

ADD REPLY • link 7.7 years ago by WouterDeCoster 47k

0

Entering edit mode

Doh! ~~What criteria do you want to use for subsetting? Names, length, size of file?~~

ADD REPLY • link 7.7 years ago by GenoMax 141k

0

Entering edit mode

Based on the title my guess is he wants to use intervals in a BED/GFF file :)

ADD REPLY • link 7.7 years ago by WouterDeCoster 47k

score 1 · Answer 1 · 2016-08-25

Here's a script that takes (potentially stranded) BED-formatted intervals from standard input. The output is FASTA for the input intervals, written to standard output.

This requires samtools be findable from your shell. You also need to specify a folder containing whole-genome FASTA files and their associated index files, unless they are in your current working directory.

#!/usr/bin/env perl

use strict;
use warnings;
use Getopt::Long;

#
# bed2faidxsta.pl
# --
# Reads BED data from standard input, writes FASTA to standard output.
#
# Dependent on samtools and indexed FASTA data files located in $fastaDir 
# variable. Set --fastaDir=dir to set custom directory containing a source 
# of per-build FASTA and associated index (.fai) files, or leave unset 
# to use data in current working directory.
#

# test if samtools is available
`samtools --version` || die "Error: The samtools application is required to run this script.\n";

# default FASTA input directory is present working directory
my $fastaDir = `pwd`; chomp $fastaDir;
# default is to assume input coordinates use zero-based index scheme
my $oneBased;
# default is to leave IDs alone
my $useIDPrefixAsStrand;

GetOptions ('fastaDir=s' => \$fastaDir, 'oneBased' => \$oneBased, 'useIDPrefixAsStrand' => \$useIDPrefixAsStrand) or die("Error: Problem with command line arguments\n");

while (<STDIN>) {
    chomp;
    my ($chr, $start, $stop, $id, $score, $strand) = split("\t", $_);
    if (!defined($chr) || !defined($start) || !defined($stop)) { die "Error: No chromosome name, start or stop position defined\n"; }
    if (! -e "$fastaDir/$chr.fa.fai") { die "Error: Could not locate indexed FASTA file for chromosome $chr\n"; }
    if (!defined($id)) { $id = "."; }
    if (!defined($score)) { $score = "."; }
    if (!defined($strand)) { $strand = "+"; } else { $strand = substr($strand, 0, 1); }
    # adjust coordinates to one-based index, if necessary
    my ($queryChr, $queryStart, $queryStop) = ($chr, $start, $stop);
    if (!$oneBased) {
        $queryStart++;
    }
    # adjust strand if required
    if ($useIDPrefixAsStrand) {
        $strand = substr($id, 0, 1);
    }
    # lookup
    my $queryKey = "$queryChr:$queryStart-$queryStop";
    my $queryResult = `samtools faidx $fastaDir/$chr.fa $queryKey`; chomp $queryResult; 
    # linearize result
    my @lines = split("\n", $queryResult);
    my @seqs = @lines[1..(scalar @lines - 1)];
    my $seq = join("", @seqs);
    # handle reverse-stranded elements
    if ($strand eq "-") {
        $seq = rc_sequence($seq);
    }
    # print to standard output
    my $header = ">".join(":",($chr, $start, $stop, $id, $score, $strand));
    print STDOUT $header."\n".$seq."\n";
}

sub rc_sequence {
    my $seq = shift @_;
    my $reverse_complement = reverse($seq);
    $reverse_complement =~ tr/ACGTacgt/TGCAtgca/;
    return $reverse_complement;
}

Assuming you have your indexed FASTA in the current working directory, you might use it like so:

$ bed2faidxsta.pl < input.bed > output.fa

Or if you have indexed FASTA in another directory (/foo/bar/baz):

$ bed2faidxsta.pl --fastaDir=/foo/bar/baz < input.bed > output.fa

If you have GFF, you can use BEDOPS gff2bed to write GFF as BED to standard output, and pipe that to the standard input stream of bed2faidxsta.pl:

$ gff2bed < input.gff | bed2faidxsta.pl --fastaDir=/foo/bar/baz > output.fa

score 0 · Answer 2 · 2016-08-24

The gffread utility, included with cufflinks, is a workhorse for gff,gff3,gtf files, and allows to extract from your fasta file the subsequences sequences indexed by your gff formatted intervals with a command like:

gffread your_intervals.gff -g your.fasta -w your_subset.fasta

The manual page details many options including

   -g     full path to a multi-fasta file with the genomic  sequences  for
          all  input mappings, OR a directory with single-fasta files (one
          per genomic sequence, with file names matching sequence names)

   -w     write a fasta file with spliced exons for each GFF transcript

your_intervals do not have to be gene models to extract the underlying sequence. But, if the are, there are more nice options, such as:

   -x     write a fasta file with spliced CDS for each GFF transcript

   -W     for -w and -x options, also write for each fasta record the exon
          coordinates projected onto the spliced sequence

   -y     write  a protein fasta file with the translation of CDS for each
          record

Highly recommended!