Question

Fasta sequence extraction

0

Entering edit mode

8.4 years ago

Eva_Maria ▴ 190

I have tab limited file like this

GCF_000707685.1_2840 0 145
GCF_000706885.1_542 0 150
GCF_000593365.1_489 0 156
GCF_000593345.1_3957 256 289
GCF_000593325.1_3041 780 958

I want to extract sequence based on position like 0 to 145 , 256 to 289

>GCF_000707685.1_2840
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
>GCF_000706885.1_542
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

How to do this? awk or perl

blast awk perl genome alignment • 2.6k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.4 years ago by Eva_Maria ▴ 190

3

Entering edit mode

Hi,

The getfasta in bedtools can accomplish this job. Please follow the link http://bedtools.readthedocs.org/en/latest/content/tools/getfasta.html for details.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by biolab ★ 1.4k

0

Entering edit mode

its not working for me (files about 7 gb). is there any programme or awk?

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Eva_Maria ▴ 190

0

Entering edit mode

Hi, I write a perl script as follows. Usage: perl extr_seq.pl genome.fa list.txt. I am just wondering why bedtools does not work for you. Personally I think it's a good tool.

#!/usr/bin/perl -w
use strict;

my %data;
my $id = '';
my $seq = '';

open my $fasta, '<', $ARGV[0] or die $!;
while (<$fasta>) {
    chomp;
    s/\r//;
    next if /^\s*$/;
    if (/>(\S+)/) {
        $data{$id} = $seq if $seq ne '';
        $id = $1;
        $seq = '';
    } else {
        $seq .= $_;
    }
}
$data{$id} = $seq;
close $fasta;

open my $list, '<', $ARGV[1] or die $!;
while (<$list>) {
    chomp;
    s/\r//;
    next if /^\s*$/;
    my ($gene, $start, $end) = split;
    if (exists $data{$gene}) {
        my $extr_seq = substr($data{$gene}, $start - 1, $end - $start + 1);
        print ">$gene\n$extr_seq\n";
    }
}
close $list;

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by biolab ★ 1.4k

0

Entering edit mode

thanks it works

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Eva_Maria ▴ 190

0

Entering edit mode

If bedtools is not working for you then try https://github.com/mdshw5/pyfaidx#cli-script-faidx

I think bedtools might read the entire FASTA into memory before extracting, but not sure.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Matt Shirley 10k

0

Entering edit mode

Here is an awk, one line solution to you problem.

awk 'FS="\t" {if(($2==0) && ($3==145) || ($2==256) && ($3 == 289)) print $1}' < foo.file.txt

if this is your file:

GCF_000707685.1_2840 0 145
GCF_000706885.1_542 0 150
GCF_000593365.1_489 0 156
GCF_000593345.1_3957 256 289
GCF_000593325.1_3041 780  958

the output is:

$ awk 'FS="\t" {if(($2==0) && ($3==145) || ($2==256) && ($3 == 289)) print $1}' < foo.file.txt
GCF_000707685.1_2840

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Phil S. ▴ 700

0

Entering edit mode

I want to do as a Batch processing because i have 20 mb text file and 256 mb fasta

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Eva_Maria ▴ 190

0

Entering edit mode

what do you mean by "batch processing"?

ADD REPLY • link 8.4 years ago by Phil S. ▴ 700