How to extract contigs from FASTA file which contains specific sequence

3

Entering edit mode

10.5 years ago

Paul ★ 1.5k

Dear all,

Do you have any idea how to easy extract contigs from fasta file wich contains specific sequence?

For example:

My sequence:

ACCGTACCC

My FASTA:

>c1042
ACCGTACCC
>c1043
GCTACAGTTGAAAGGGGACCGTACCC
>c1044
ATGAATAAAATAATTTTGTATCATAAATCGAGCTGTTAATTATT
>c1044
TTCATATTTGTAGCTAAGCAGAGGCGAAGCGTTCTTGTATCG

My output:

>c1042
ACCGTACCC
>c1043
GCTACAGTTGAAAGGGGACCGTACCC

Thank you so much for any ideas and help.

fasta find extraction contig • 7.2k views

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Paul ★ 1.5k

0

Entering edit mode

Hello. Is there some way to do the same with biopython? Thanks

ADD REPLY • link 7.9 years ago by joselu ▴ 110

0

Entering edit mode

Please see @Devon Ryan answer for suggestions with biopython, or open a new question, with examples, and what you have tried.

ADD REPLY • link 7.9 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

This is not an answer. This should be a comment or a new post. If you're creating a new post, you should reference this post in addition to elaborating on what you've tried.

ADD REPLY • link 7.9 years ago by Ram 45k

4

Entering edit mode

10.5 years ago

Ram 45k

You can use sed + grep (as suggested by NicoBxl and Devon) or BioPerl/BioPython (as suggested by Devon) or Heng Li's bioawk:

bioawk -c fastx '$seq ~ /ACCGTACCC/ { print ">"$name"\n"$seq; }' #might need a bit of tweaking

ADD COMMENT • link 3.3 years ago by Ram 45k

1

Entering edit mode

10.5 years ago

Nicolas Rosewick 11k

cat fasta.fa | grep -B1 "ACCGTACCC" > out.fa

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Nicolas Rosewick 11k

1

Entering edit mode

It should be noted that this won't work if there are multi-line entries (though one could use sed to reformat things to get around that).

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Devon Ryan 105k

0

Entering edit mode

of course. for that you could use fastx (fasta-formatter)

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Nicolas Rosewick 11k

1

Entering edit mode

10.5 years ago

Devon Ryan 105k

Use biopython or bioperl. With biopython, you could either use the re module or even just find() on the str() representation of each sequence. Either of these should be simple enough if you're familiar with either perl or python.

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Devon Ryan 105k

1

Entering edit mode

10.5 years ago

Matt Shirley 10k

	>c1042
	ACCGTACCC
	>c1043
	GCTACAGTTGAAAGGGGACCGTACCC
	>c1044
	ATGAATAAAATAATTTTGTATCATAAATCGAGCTGTTAATTATT
	>c1045
	TTCATATTTGTAGCTAAGCAGAGGCGAAGCGTTCTTGTATCG

view raw 125610.fa hosted with ❤ by GitHub

	from pyfaidx import Fasta
	fa = Fasta('125610.fa')
	for sequence in fa:
	if 'ACCGTACCC' in str(sequence):
	print('>' + sequence.name)
	print(sequence)

view raw 125610.py hosted with ❤ by GitHub

In this case I've changed the duplicate defline since pyfaidx requires unique sequence ids. You could mangle all the key names by passing your own key_function if you like.

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Matt Shirley 10k

1

Entering edit mode

7.9 years ago

Brian Bushnell 20k

Another option, using BBMap:

bbduk.sh in=file.fa out=unmatched.fa outm=matched.fa literal=ACCGTACCC mm=f k=9 rcomp=f

This optionally allows some number of mismatches, and matching reverse-complements (if rcomp=t), which are often helpful.

ADD COMMENT • link 7.9 years ago by Brian Bushnell 20k

Login before adding your answer.