Question: Extract According To Row
0
gravatar for 2011101101
6.3 years ago by
2011101101100
2011101101100 wrote:

I have a query document in fasta format. I want to extract some sequences, according to a row from another document. The two document are large.

For example. the query document is like below.enter link description here

>1
AAAAAAAAAAAAACAGTTGGCATG
>2
AAAAAAAAAAAAACCGAGTACCGTTCACGCC
>3
AAAAAAAAAAAAACCTTGAAC

The other document is like below.

motif    MZQ1    MZQ3    MZQ4    MZQ5    MZQ6    MZQ7    MZQ8    MZQ2
AAAAAAAAAAAAAGCTCGGAT    1    0    0    0    0    0    0    0
AAAAAAAAAAAAACAGTTGGCATG    0    0    0    0    0    1    0    0
AAAAAAAAAAAAACCGAGTACCGTTCACGCC    0    0    0    0    0    1    0    0
AAAAAAAAAAAAACCTTGAAC    0    0    0    0    0    0    0    1
AAAAAAAAAAAAACGGGATTC    0    0    0    0    1    0    0    0
AAAAAAAAAAAAACTCAGTTCTGCCT    0    0    0    0    0    1    0    0

The expected result is the following:

motif    MZQ1    MZQ3    MZQ4    MZQ5    MZQ6    MZQ7    MZQ8    MZQ2
AAAAAAAAAAAAACAGTTGGCATG    0    0    0    0    0    1    0    0
AAAAAAAAAAAAACCGAGTACCGTTCACGCC    0    0    0    0    0    1    0    0
AAAAAAAAAAAAACCTTGAAC    0    0    0    0    0    0    0    1
• 1.7k views
ADD COMMENTlink modified 6.3 years ago by Pierre Lindenbaum120k • written 6.3 years ago by 2011101101100
2

It would be nice if you could try out suggested solutions and let us know which one performed best? I would be interested to see time comparison of solutions offered by Poe and Pierre Lindenbaum. Thanks

ADD REPLYlink written 6.3 years ago by zx87547.5k
3
gravatar for PoGibas
6.3 years ago by
PoGibas4.8k
Vilnius
PoGibas4.8k wrote:

head -n 1 another_document > result && grep -v '>' query_document | grep -F -f - another_document >> result

ADD COMMENTlink written 6.3 years ago by PoGibas4.8k

grep -F -f - another_document

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by 2011101101100
1

grep -F -f to extract patterns between documents - http://stackoverflow.com/a/11490467/1286528.
- is piped pattern that you want to extract (motif sequences).

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by PoGibas4.8k
2
gravatar for Pierre Lindenbaum
6.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:
head -n 1 other.tsv > result
sort -t '    ' -k1,1 other.tsv | join  -t '    ' -1 1 -2 1 <(grep -v ">" file.fa | sort -u ) - >> result
ADD COMMENTlink written 6.3 years ago by Pierre Lindenbaum120k

+1 I was just thinking of using Join as well, but then I saw your answer :)

ADD REPLYlink written 6.3 years ago by zx87547.5k
1
gravatar for Alex Reynolds
6.3 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

Put the FASTA sequences into a hash table, and print out rows from the matrix file if the motif field element is defined in the hash table:

#!/usr/bin/env perl

use strict;
use warnings;

my $fastaFn = $ARGV[0];
my $masterFn = $ARGV[1];

my $seqsRef;
my $header;
my $sequence;

open FASTA, "< $fastaFn" or die "could not open FASTA file\n";
while (<FASTA>) {
    chomp;
    if (/>/) {
        $header = $_;
        $header =~ s/^>//;
    }
    else {
        $sequence = $_;
        $seqsRef->{$sequence} = $header;
    }
}
close FASTA;

open MASTER, "< $masterFn" or die "could not open master file for filtering\n";
my $ln = <MASTER>;
print STDOUT "$ln\n";
while (<MASTER>) {
    chomp;
    my @elems = split("\t", $_);
    my $motif = $elems[0];
    if (defined $seqsRef->{$motif}) {
        print STDOUT "$_\n";
    }
}
close MASTER;

To use it:

$ filter.pl myQuerySeqs.fa myDataMatrix.mtx > myFilteredMatrix.mtx

The file myQuerySeqs.fa is your FASTA file. The myDataMatrix.mtx file is the "master" matrix file that you want to filter on sequences from the FASTA file. Output is sent to myFilteredMatrix.mtx.

This should be fairly fast, if memory-intensive, because hash table lookups are in constant time.

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Alex Reynolds28k

Because my document is very large,how to get the myDataMatrix.mtx?

ADD REPLYlink written 6.3 years ago by 2011101101100

You already have myDataMatrix.mtx (at least, if I understand your original question correctly).

ADD REPLYlink written 6.3 years ago by Alex Reynolds28k

Yes,I understand ,thank you

ADD REPLYlink written 6.3 years ago by 2011101101100

Don't forget to accept your answer when when you find the right solution for you

ADD REPLYlink written 6.3 years ago by Agatha340
0
gravatar for Whetting
6.3 years ago by
Whetting1.5k
Bethesda, MD
Whetting1.5k wrote:

Not the most elegant, but this should do it (unless your Fasta file is too big for memory?)

from Bio import SeqIO
tags=[]
for seq_record in SeqIO.parse("in.fas", "fasta"):
    if str(seq_record.seq) not in tags:
        tags.append(str(seq_record.seq))


for tag in tags:
    with open("2.txt","rU") as f:
        for line in f:
            line=line.rstrip()
            if tag in line:
                print line
ADD COMMENTlink written 6.3 years ago by Whetting1.5k

how to use it ?

ADD REPLYlink written 6.3 years ago by 2011101101100

save the file as "rows.py" run in from terminal as "python rows.py"

ADD REPLYlink written 6.3 years ago by Whetting1.5k
0
gravatar for Agatha
6.3 years ago by
Agatha340
Agatha340 wrote:

I am not sure how big are your files but if R can handle them, then you can use :

require("Biostrings")
sequence_data<<-read.DNAStringSet("file1.fasta")
motifs<-read.table("file2.txt",header=T)
tab3<-subset(motifs, motifs$motif%in%as.character(sequence_data))

> tab3
         motif MZQ1 MZQ3 MZQ4 MZQ5 MZQ6 MZQ7 MZQ8 MZQ2
2        AAAAAAAAAAAAACAGTTGGCATG    0    0    0    0    0    1    0    0
3 AAAAAAAAAAAAACCGAGTACCGTTCACGCC    0    0    0    0    0    1    0    0
4           AAAAAAAAAAAAACCTTGAAC    0    0    0    0    0    0    0    1
ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Agatha340

We can also use merge instead of subset: http://stackoverflow.com/questions/1299871/how-to-join-data-frames-in-r-inner-outer-left-right

ADD REPLYlink written 6.3 years ago by zx87547.5k
1

That is true, if sequencedata ( DNAStringSet object) would be converted to a dataframe...

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by Agatha340
0
gravatar for Nari
6.3 years ago by
Nari870
United States
Nari870 wrote:
 #!/usr/bin/perl -w
print"Enter REFERENCE file: ";
chomp($file=<STDIN>);
open(FH,$file);
@org_det=<FH>;
print"Enter QUERY file: ";
$hspfile=<STDIN>;
open(FH1,$hspfile);
@hsporg=<FH1>;
print "enter output file : ";
$OUT = <STDIN>;
chomp($OUT);
open(OUT1,">$OUT");

foreach(@hsporg)
{
    @org=split('\t',$_);
    chomp($org=$_);
    foreach(@org_det)
    {
        @orginfo=split('\t',$_);
        if($org=~$orginfo[0])

        {
            print OUT1 "$_";
        }
            }
    }
close FH;
close FH1;
close OUT1;
ADD COMMENTlink written 6.3 years ago by Nari870
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1405 users visited in the last hour