How to extract sequence from fasta by sequence similarity
5
0
Entering edit mode
7.5 years ago
Janey ▴ 30

Hello I have two fasta files with different IDs which belongs to the two genotypes . first fasta file Consists of 100 contigs while second file include 100,000 contigs. I want to extranct the same contigs of first file from second file. I thank you for your suggestions.

Thank you

alignment • 2.5k views
ADD COMMENT
0
Entering edit mode

Hi,

You can convert your 100,000 contig fasta to tsv using fasta_formatter from the FASTX-Toolkit

Then use grep with the --file option to supply your text file (list of 100 IDs) of patterns.

You cane use Galaxy, too.

ADD REPLY
1
Entering edit mode
7.5 years ago
venu 7.1k

Assuming you have sequence as one string (Otherwise linearize both fasta files)

sed '/^>/d' file_1.fa | while read -r line; do grep -B 1 "$line" file_2.fa >> foo.res.txt; done

With this approach you don't need to worry if the headers are different for same contig in 2 files. If the header is same in two files, you can proceed with faSomeRecords as mentioned in other answers.

ADD COMMENT
0
Entering edit mode
7.5 years ago
Sej Modha 5.3k

You can either BLAST them against each other or use a clustering program like cdhit to cluster identical sequences together.

ADD COMMENT
0
Entering edit mode
7.5 years ago
Farbod ★ 3.4k

Dear Janey, Hi

You can create a list fo your 100 IDs (a text file, each ID in a new line, it is your listFile) and then use some script/tools same as faSomeRecords to extract the sequences of those IDs from the 100,000 contig file (which is now in.fa):

./faSomeRecords in.fa listFile out.fa

Hope I get your point correctly

~ Best

ADD COMMENT
0
Entering edit mode
7.5 years ago
Janey ▴ 30

thanks for answers of my friends but i need tool or software that finally tell me: ID: 23 from file 1 has similar seuence to ID; 666 from file 2

ADD COMMENT
0
Entering edit mode

I think your title was not very clear ;-)

And the threshold of "similarity" is a problem here.

Are you searching for exact matches ?

ADD REPLY
0
Entering edit mode

hi farbod yes about 98-100% similarity

ADD REPLY
0
Entering edit mode

Then just search the second file (think of it as "reference") using the first using any NGS aligner (and look for 100% matches?). bowtie v.1 may be the best tool if these are raw Illumina sequences.

ADD REPLY
0
Entering edit mode
7.5 years ago
nuketbilgen ▴ 40

Hi, How about zipped fastq files? zgrep command is not working. :/

ADD COMMENT

Login before adding your answer.

Traffic: 1682 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6