Filter sequences by ID and copy to a new file
2
0
Entering edit mode
3.4 years ago
bionewbie • 0

Hi,

I'm really new to bioinformatics and I have a large file with sequences. I only need specific sequences for my analysis. How can I filter them by accession number?

thanks!

sequence • 676 views
ADD COMMENT
1
Entering edit mode
3.4 years ago
Fatima ▴ 1000

If your sequences are only one line you can use this command:

cat IDs.txt | while read line ; do grep -A 1 "${line}" inputfile.fasta >> outputfile.fasta ; done

This command only works when accession numbers do not overlap. Also, please make sure IDs.txt doesn't have any empty lines. And, each line in IDs.txt should have one and only one Accession number, with no extra space.

If your sequences are multi-liner you can convert them to a one-liner fasta file first and then use the above command:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' <  file.fasta >> inputfile.fasta

When IDs overlap, --perl-regex and $ or \t or other delimiters can be added (depending on the format of the header)

grep -A 1 --perl-regex "${line}$"

grep -A 1 extracts the line containing the pattern, and the line after that

ADD COMMENT
0
Entering edit mode

In this line, which file is meant to be "file.fasta"? The output file?

   awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' <  file.fasta >> inputfile.fasta
ADD REPLY
1
Entering edit mode

file.fasta is your input file.

ADD REPLY
1
Entering edit mode

Yes, I called the output file inputfile.fasta, because it will be used as input in the other command.

ADD REPLY
0
Entering edit mode
ADD COMMENT

Login before adding your answer.

Traffic: 1852 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6