How to extract contigs from FASTA file which contains specific sequence
5
3
Entering edit mode
6.9 years ago
Paul ★ 1.4k

Dear all,

do you have any idea, how to easy extract contigs from fasta file wich contains specific sequence?

For example:

my sequnce:

ACCGTACCC

my FASTA:

>c1042
ACCGTACCC
>c1043
GCTACAGTTGAAAGGGGACCGTACCC
>c1044
ATGAATAAAATAATTTTGTATCATAAATCGAGCTGTTAATTATT
>c1044
TTCATATTTGTAGCTAAGCAGAGGCGAAGCGTTCTTGTATCG

My output:

>c1042
ACCGTACCC
>c1043
GCTACAGTTGAAAGGGGACCGTACCC

Thank you so much for any ideas and help.

fasta contig extraction find • 5.1k views
0
Entering edit mode

Hello. Is there some way to do the same with biopython? Thanks

0
Entering edit mode

Please see @Devon Ryan answer for suggestions with biopython, or open a new question, with examples, and what you have tried.

0
Entering edit mode

This is not an answer. This should be a comment or a new post. If you're creating a new post, you should reference this post in addition to elaborating on what you've tried.

4
Entering edit mode
6.9 years ago
Ram 35k

You can use sed + grep (as suggested by NicoBxl and Devon) or BioPerl/BioPython (as suggested by Devon) or Heng Li's bioawk:

bioawk -c fastx '$seq ~ /ACCGTACCC/ { print ">"$name"\n"\$seq; }' #might need a bit of tweaking
1
Entering edit mode
6.9 years ago
cat fasta.fa | grep -B1 "ACCGTACCC" > out.fa
1
Entering edit mode

It should be noted that this won't work if there are multi-line entries (though one could use sed to reformat things to get around that).

0
Entering edit mode

of course. for that you could use fastx (fasta-formatter)

1
Entering edit mode
6.9 years ago

Use biopython or bioperl. With biopython, you could either use the re module or even just find() on the str() representation of each sequence. Either of these should be simple enough if you're familiar with either perl or python.

1
Entering edit mode
6.9 years ago

In this case I've changed the duplicate defline since pyfaidx requires unique sequence ids. You could mangle all the key names by passing your own key_function if you like.

1
Entering edit mode
4.3 years ago

Another option, using BBMap:

bbduk.sh in=file.fa out=unmatched.fa outm=matched.fa literal=ACCGTACCC mm=f k=9 rcomp=f


This optionally allows some number of mismatches, and matching reverse-complements (if rcomp=t), which are often helpful.