use a list of id to extract sequence from different genomics
1
0
Entering edit mode
7 weeks ago
LZH289 • 0

Hi all, I have a list of txt file

   123489
   12387
   16379

and i want to extract their sequence from a file

>Os01g3345.1 pacid=123489 polypeptide=Os01g3345.1 locus=Os01g3345 
ATTTTCGGGGAAATTTCCGGGGG
ATTGGCCTTAAA
>AT01g3345.1 pacid=123567 polypeptide=AT01g3345.1 locus=Os01g3345 
ATTTTCGGGGAAATTTCCGGGGG
ATTGGCCTTAAA

and so on.

My question is how to use pacid as a query to extract sequence match txt file?

I have tried something like this, only last match appear. But I want all matched result. Could any one help, thanks.

    #!/usr/bin/perl

use strict;
use warnings;

$ARGV[2] or die "use getSeqs.pl <File with IDs> <Input Fasta> <Output Fasta>\n";
my $list_file = shift @ARGV;
my $fasta_in = shift @ARGV;
my $fasta_out = shift @ARGV;

my %sel;
open (my $lh, "<", $list_file) or die;
while (<$lh>) {
    chomp;
    $sel{$_}++;
}
close $lh;

$/ = "\n>";
open (my $ih, "<", $fasta_in) or die;
open (my $oh, ">", $fasta_out) or die;
while (<$ih>) {
    s/>//g;
    my ($id_line, @seq) = split (/\n/, $_);
    if ($id_line =~ /pacid=(\w+)/) {
        my $id = $1;
        if (defined $sel{$id}) {
            print $oh ">$id\n";
            print $oh join "\n", @seq;
            print $oh "\n";
        }
    }
}
close $ih;
close $oh;
sequence perl • 407 views
ADD COMMENT
0
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY
0
Entering edit mode

thank you!

ADD REPLY
0
Entering edit mode

If pacid's are unique, then you can use grep direct.

$ awk  '{if(NR==1) {print $0} else {if($0 ~ /^>/) {print "\n"$0} else {printf $0}}} END {print "\r"}' test.fa | grep -A 1 -wf test.txt 

>Os01g3345.1 pacid=123489 polypeptide=Os01g3345.1 locus=Os01g3345 
ATTTTCGGGGAAATTTCCGGGGGATTGGCCTTAAA
ADD REPLY
0
Entering edit mode

Hi, thanks for your reply. I can get result, but only last gene in the list are shown in the result file. Do you know why?

ADD REPLY
0
Entering edit mode
7 weeks ago

linearize and extract the ID at the same time, sort both list, join, restore the fasta:

join -t $'\t' -1 1 -2 1 \
     <(awk -F '[ =]+' '/^>/ {for(i=1;i<NF;i++){if($i=="pacid"){PAC=$(i+1);break;}} printf("%s%s\t%s\t",(N>0?"\n":""),PAC,$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' jeter.fa  | sort -t $'\t' -k1,1 ) \
     <(sort -t $'\t' -k1,1 list.id.txt ) |\
cut -f 2- | tr "\t" "\n"
ADD COMMENT
0
Entering edit mode

Hi, thanks for your reply. I can get result, but only last gene in the list are shown in the result file. Do you know why?

ADD REPLY
0
Entering edit mode

check you file is a ascii file (file list.id.txt), check there is no extra whitespaces in your files, etc...

ADD REPLY
0
Entering edit mode

I check txt file and no extra whitespaces in the files.

ADD REPLY

Login before adding your answer.

Traffic: 991 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6