Get Filenames Of All Files With 'Atg' In Fasta Files
2
3
Entering edit mode
12.2 years ago
Sam ▴ 70

I have a bunch of files which are similar to following:

File00
>A18178 1
atgcaccaataaaaaaacaagcttaacctaattc

File01
>A21196 1
cggccagatcta

File02
>A21197 1
agcttagatctggccgggg

File03
>AX557348 1
gcggatttactcaggggagagcccagataa
atggagtctgtgcgtccaca

I need to look into each file and check if the 2nd line starts with 'atg'. for example, only File00 starts with atg. This needs to be done as a shell script.

To be honest, this is a school assignment but I have been thinking about this for last 2-3 days and I cant seem to get it. I have put together following command:

grep -H -m 1 '^atg' ./mrna_split/File* |awk -F ":" '{print $1}'

The problem with the above command is that it also gives file names that have atg on the second line. For example, File03 will also be outputed. The grep outputs: ./mrna_split/File00:atgcaccaataaaaaaacaagcttaacctaattc. The awk command then gets only the filename.

Any input is appreciated.

Thank you.

awk script • 2.6k views
ADD COMMENT
5
Entering edit mode
12.2 years ago

try:

find ./ -name "File[0-9]*" -exec awk '/^>/ {name=$0; next;} NR==2 {if(substr($0,1,3)=="atg") print name;}' '{}' ';'

(loop over each files, execute awk, get the fasta header, print it if the 2nd line starts with "atg")

ADD COMMENT
2
Entering edit mode

I think he only needs the filenames, not the sequence ids:

find ./ -name "File[0-9]*" -exec awk 'NR==2 {if(substr($0,1,3)=="atg") print FILENAME; exit}' '{}' ';'

should work

ADD REPLY
1
Entering edit mode

I suppose it's possible that the first line of sequence be "AT" or even just "A", since the specification states only that lines should be shorter than 80 characters. However, I'd regard that as a very odd FASTA file.

ADD REPLY
0
Entering edit mode

what if one atg sequence is splitted over two lines? e.g. ...ATnG...

ADD REPLY
0
Entering edit mode

Hey guys. Thanks for the reply. The above command (both andreas and Pierre) works perfectly. In the end, I am supposed to make a atg.fa file which has all the sequence that start with atg. right now, after i get the filename from the above script, i use for loop in shell script to cat the matched file with atg.fa . Is there a better way?

ADD REPLY
4
Entering edit mode
12.2 years ago
Neilfws 49k

Bioperl solution using Seq::IO. Save code as findatg.pl. Assumes 1 sequence per file (otherwise will print filename for every matching occurrence in file).

Now go learn some Bioperl, you won't regret it.

#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;

my $file = shift || die("Usage: findatg.pl <fasta file>\n");
my $seqio = Bio::SeqIO->new(-file => $file, -format => "fasta");

while(my $fa = $seqio->next_seq) {
  print "$file\n" if($fa->seq =~ /^atg/i);
}

Then run something like:

find ./ -name "File*" -exec perl findatg.pl {} \;
ADD COMMENT
7
Entering edit mode

Yes, I see your question tags now. There are 2 kinds of solutions in bioinformatics: "whatever works" and "whatever I was told I had to use." As a researcher, I favour the first one :)

ADD REPLY
0
Entering edit mode

I know bioperl. However, I needed to do this as a shell script as i mentioned in my OP. Otherwise it was an easy task even with just perl.Thank you for your reply.

ADD REPLY

Login before adding your answer.

Traffic: 3873 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6