Question

Filter sequences by ID and copy to a new file

0

Entering edit mode

3.4 years ago

bionewbie • 0

Hi,

I'm really new to bioinformatics and I have a large file with sequences. I only need specific sequences for my analysis. How can I filter them by accession number?

thanks!

sequence • 676 views

ADD COMMENT • link updated 3.4 years ago by Fatima ▴ 1000 • written 3.4 years ago by bionewbie • 0

score 1 · Answer 1 · 2020-11-13

1

Entering edit mode

3.4 years ago

Fatima ▴ 1000

If your sequences are only one line you can use this command:

cat IDs.txt | while read line ; do grep -A 1 "${line}" inputfile.fasta >> outputfile.fasta ; done

This command only works when accession numbers do not overlap. Also, please make sure IDs.txt doesn't have any empty lines. And, each line in IDs.txt should have one and only one Accession number, with no extra space.

If your sequences are multi-liner you can convert them to a one-liner fasta file first and then use the above command:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' <  file.fasta >> inputfile.fasta

When IDs overlap, --perl-regex and $ or \t or other delimiters can be added (depending on the format of the header)

grep -A 1 --perl-regex "${line}$"

grep -A 1 extracts the line containing the pattern, and the line after that

ADD COMMENT • link 3.4 years ago by Fatima ▴ 1000

0

Entering edit mode

In this line, which file is meant to be "file.fasta"? The output file?

   awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' <  file.fasta >> inputfile.fasta

ADD REPLY • link 3.4 years ago by bionewbie • 0

1

Entering edit mode

file.fasta is your input file.

ADD REPLY • link 3.4 years ago by GenoMax 141k

1

Entering edit mode

Yes, I called the output file inputfile.fasta, because it will be used as input in the other command.

ADD REPLY • link 3.4 years ago by Fatima ▴ 1000

score 0 · Answer 2 · 2020-11-13

0

Entering edit mode

3.4 years ago

GenoMax 141k

Use this solution: C: How do I extract Fasta Sequences based on a list of IDs?

ADD COMMENT • link 3.4 years ago by GenoMax 141k