Question: How to extract sequence from fasta by sequence similarity
0
gravatar for Janey
2.7 years ago by
Janey30
USA
Janey30 wrote:

Hello I have two fasta files with different IDs which belongs to the two genotypes . first fasta file Consists of 100 contigs while second file include 100,000 contigs. I want to extranct the same contigs of first file from second file. I thank you for your suggestions.

Thank you

alignment • 1.1k views
ADD COMMENTlink modified 2.7 years ago by nuketbilgen30 • written 2.7 years ago by Janey30

Hi,

You can convert your 100,000 contig fasta to tsv using fasta_formatter from the FASTX-Toolkit

Then use grep with the --file option to supply your text file (list of 100 IDs) of patterns.

You cane use Galaxy, too.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Farbod3.3k
1
gravatar for venu
2.7 years ago by
venu6.2k
Germany
venu6.2k wrote:

Assuming you have sequence as one string (Otherwise linearize both fasta files)

sed '/^>/d' file_1.fa | while read -r line; do grep -B 1 "$line" file_2.fa >> foo.res.txt; done

With this approach you don't need to worry if the headers are different for same contig in 2 files. If the header is same in two files, you can proceed with faSomeRecords as mentioned in other answers.

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by venu6.2k
0
gravatar for Sej Modha
2.7 years ago by
Sej Modha4.2k
Glasgow, UK
Sej Modha4.2k wrote:

You can either BLAST them against each other or use a clustering program like cdhit to cluster identical sequences together.

ADD COMMENTlink written 2.7 years ago by Sej Modha4.2k
0
gravatar for Farbod
2.7 years ago by
Farbod3.3k
Toronto
Farbod3.3k wrote:

Dear Janey, Hi

You can create a list fo your 100 IDs (a text file, each ID in a new line, it is your listFile) and then use some script/tools same as faSomeRecords to extract the sequences of those IDs from the 100,000 contig file (which is now in.fa):

./faSomeRecords in.fa listFile out.fa

Hope I get your point correctly

~ Best

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by Farbod3.3k
0
gravatar for Janey
2.7 years ago by
Janey30
USA
Janey30 wrote:

thanks for answers of my friends but i need tool or software that finally tell me: ID: 23 from file 1 has similar seuence to ID; 666 from file 2

ADD COMMENTlink written 2.7 years ago by Janey30

I think your title was not very clear ;-)

And the threshold of "similarity" is a problem here.

Are you searching for exact matches ?

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Farbod3.3k

hi farbod yes about 98-100% similarity

ADD REPLYlink written 2.7 years ago by Janey30

Then just search the second file (think of it as "reference") using the first using any NGS aligner (and look for 100% matches?). bowtie v.1 may be the best tool if these are raw Illumina sequences.

ADD REPLYlink written 2.7 years ago by genomax69k
0
gravatar for nuketbilgen
2.7 years ago by
nuketbilgen30
United Kingdom
nuketbilgen30 wrote:

Hi, How about zipped fastq files? zgrep command is not working. :/

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by nuketbilgen30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 619 users visited in the last hour