Extract According To Row
6
0
Entering edit mode
11.6 years ago
2011101101 ▴ 110

I have a query document in fasta format. I want to extract some sequences, according to a row from another document. The two document are large.

For example. the query document is like below.enter link description here

>1
AAAAAAAAAAAAACAGTTGGCATG
>2
AAAAAAAAAAAAACCGAGTACCGTTCACGCC
>3
AAAAAAAAAAAAACCTTGAAC

The other document is like below.

motif    MZQ1    MZQ3    MZQ4    MZQ5    MZQ6    MZQ7    MZQ8    MZQ2
AAAAAAAAAAAAAGCTCGGAT    1    0    0    0    0    0    0    0
AAAAAAAAAAAAACAGTTGGCATG    0    0    0    0    0    1    0    0
AAAAAAAAAAAAACCGAGTACCGTTCACGCC    0    0    0    0    0    1    0    0
AAAAAAAAAAAAACCTTGAAC    0    0    0    0    0    0    0    1
AAAAAAAAAAAAACGGGATTC    0    0    0    0    1    0    0    0
AAAAAAAAAAAAACTCAGTTCTGCCT    0    0    0    0    0    1    0    0

The expected result is the following:

motif    MZQ1    MZQ3    MZQ4    MZQ5    MZQ6    MZQ7    MZQ8    MZQ2
AAAAAAAAAAAAACAGTTGGCATG    0    0    0    0    0    1    0    0
AAAAAAAAAAAAACCGAGTACCGTTCACGCC    0    0    0    0    0    1    0    0
AAAAAAAAAAAAACCTTGAAC    0    0    0    0    0    0    0    1
• 3.5k views
ADD COMMENT
2
Entering edit mode

It would be nice if you could try out suggested solutions and let us know which one performed best? I would be interested to see time comparison of solutions offered by Poe and Pierre Lindenbaum. Thanks

ADD REPLY
3
Entering edit mode
11.6 years ago
PoGibas 5.1k

head -n 1 another_document > result && grep -v '>' query_document | grep -F -f - another_document >> result

ADD COMMENT
0
Entering edit mode

grep -F -f - another_document

ADD REPLY
1
Entering edit mode

grep -F -f to extract patterns between documents - http://stackoverflow.com/a/11490467/1286528.
- is piped pattern that you want to extract (motif sequences).

ADD REPLY
2
Entering edit mode
11.6 years ago
head -n 1 other.tsv > result
sort -t '    ' -k1,1 other.tsv | join  -t '    ' -1 1 -2 1 <(grep -v ">" file.fa | sort -u ) - >> result
ADD COMMENT
0
Entering edit mode

+1 I was just thinking of using Join as well, but then I saw your answer :)

ADD REPLY
1
Entering edit mode
11.6 years ago

Put the FASTA sequences into a hash table, and print out rows from the matrix file if the motif field element is defined in the hash table:

#!/usr/bin/env perl

use strict;
use warnings;

my $fastaFn = $ARGV[0];
my $masterFn = $ARGV[1];

my $seqsRef;
my $header;
my $sequence;

open FASTA, "< $fastaFn" or die "could not open FASTA file\n";
while (<FASTA>) {
    chomp;
    if (/>/) {
        $header = $_;
        $header =~ s/^>//;
    }
    else {
        $sequence = $_;
        $seqsRef->{$sequence} = $header;
    }
}
close FASTA;

open MASTER, "< $masterFn" or die "could not open master file for filtering\n";
my $ln = <MASTER>;
print STDOUT "$ln\n";
while (<MASTER>) {
    chomp;
    my @elems = split("\t", $_);
    my $motif = $elems[0];
    if (defined $seqsRef->{$motif}) {
        print STDOUT "$_\n";
    }
}
close MASTER;

To use it:

$ filter.pl myQuerySeqs.fa myDataMatrix.mtx > myFilteredMatrix.mtx

The file myQuerySeqs.fa is your FASTA file. The myDataMatrix.mtx file is the "master" matrix file that you want to filter on sequences from the FASTA file. Output is sent to myFilteredMatrix.mtx.

This should be fairly fast, if memory-intensive, because hash table lookups are in constant time.

ADD COMMENT
0
Entering edit mode

Because my document is very large,how to get the myDataMatrix.mtx?

ADD REPLY
0
Entering edit mode

You already have myDataMatrix.mtx (at least, if I understand your original question correctly).

ADD REPLY
0
Entering edit mode

Yes,I understand ,thank you

ADD REPLY
0
Entering edit mode

Don't forget to accept your answer when when you find the right solution for you

ADD REPLY
0
Entering edit mode
11.6 years ago
Whetting ★ 1.6k

Not the most elegant, but this should do it (unless your Fasta file is too big for memory?)

from Bio import SeqIO
tags=[]
for seq_record in SeqIO.parse("in.fas", "fasta"):
    if str(seq_record.seq) not in tags:
        tags.append(str(seq_record.seq))


for tag in tags:
    with open("2.txt","rU") as f:
        for line in f:
            line=line.rstrip()
            if tag in line:
                print line
ADD COMMENT
0
Entering edit mode

how to use it ?

ADD REPLY
0
Entering edit mode

save the file as "rows.py" run in from terminal as "python rows.py"

ADD REPLY
0
Entering edit mode
11.6 years ago
Agatha ▴ 350

I am not sure how big are your files but if R can handle them, then you can use :

require("Biostrings")
sequence_data<<-read.DNAStringSet("file1.fasta")
motifs<-read.table("file2.txt",header=T)
tab3<-subset(motifs, motifs$motif%in%as.character(sequence_data))

> tab3
         motif MZQ1 MZQ3 MZQ4 MZQ5 MZQ6 MZQ7 MZQ8 MZQ2
2        AAAAAAAAAAAAACAGTTGGCATG    0    0    0    0    0    1    0    0
3 AAAAAAAAAAAAACCGAGTACCGTTCACGCC    0    0    0    0    0    1    0    0
4           AAAAAAAAAAAAACCTTGAAC    0    0    0    0    0    0    0    1
ADD COMMENT
0
Entering edit mode
ADD REPLY
1
Entering edit mode

That is true, if sequencedata ( DNAStringSet object) would be converted to a dataframe...

ADD REPLY
0
Entering edit mode
11.6 years ago
Naren ▴ 1000
 #!/usr/bin/perl -w
print"Enter REFERENCE file: ";
chomp($file=<STDIN>);
open(FH,$file);
@org_det=<FH>;
print"Enter QUERY file: ";
$hspfile=<STDIN>;
open(FH1,$hspfile);
@hsporg=<FH1>;
print "enter output file : ";
$OUT = <STDIN>;
chomp($OUT);
open(OUT1,">$OUT");

foreach(@hsporg)
{
    @org=split('\t',$_);
    chomp($org=$_);
    foreach(@org_det)
    {
        @orginfo=split('\t',$_);
        if($org=~$orginfo[0])

        {
            print OUT1 "$_";
        }
            }
    }
close FH;
close FH1;
close OUT1;
ADD COMMENT

Login before adding your answer.

Traffic: 2768 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6