Extract Fasta Sequences From File Which Are Not Present In Other Fasta File
4
0
Entering edit mode
11.0 years ago

I have two fasta files say for example a1.fna and a2.fna a1.fna file's all reads are present in a2.fna now i want to extract reads along with its header and sequence from a2.fna which are not present in a1.fna means the reads other than those who are present in a1.fna can i do this using samtools or any other linux commands??

a1.fna

>H8X10ID02GRZSV length=47 xy=2662_0365 region=2 run=R_2013_05_04_16_13_20_    
GAGTTTTGATCCTCGCGTCAGGAGGTGGTGGACGAGATATAACGACG    
>H8X10ID02G9ZWR length=60 xy=2867_0633 region=2 run=R_2013_05_04_16_13_20_    
AGGTTCGATTCCGGAGAGCGAAGCATTTGCCAAGTCGACTGCCAAGGCACACAGGGAGTA

a2.fna

>H8X10ID02GRZSV length=47 xy=2662_0365 region=2 run=R_2013_05_04_16_13_20_    
GAGTTTTGATCCTCGCGTCAGGAGGTGGTGGACGAGATATAACGACG    
>H8X10ID02G9ZWR length=60 xy=2867_0633 region=2 run=R_2013_05_04_16_13_20_    
AGGTTCGATTCCGGAGAGCGAAGCATTTGCCAAGTCGACTGCCAAGGCACACAGGGAGTA    
>H8X10ID02IKYPK length=55 xy=3402_0550 region=2 run=R_2013_05_04_16_13_20_    
GCCAGCAGCGCGGTAACGAGCGCAACTAGTCGACTGCCAAGGCACACAGGGAGTA    
>H8X10ID02HSM36 length=44 xy=3079_2164 region=2 run=R_2013_05_04_16_13_20_    
GAGTTTTGATCCTGGCTCAGTCTCCGACGACTACACGACGACTG    
>H8X10ID02I10R9 length=70 xy=3596_1767 region=2 run=R_2013_05_04_16_13_20_    
AGGTTCGATTCCGAGAGGATGAGATGGCGAAAGCATTGCCAAAGTCGACTGCCAAAGGCA    
CACAGGGGGA

output file should contain

>H8X10ID02IKYPK length=55 xy=3402_0550 region=2 run=R_2013_05_04_16_13_20_    
GCCAGCAGCGCGGTAACGAGCGCAACTAGTCGACTGCCAAGGCACACAGGGAGTA    
>H8X10ID02HSM36 length=44 xy=3079_2164 region=2 run=R_2013_05_04_16_13_20_    
GAGTTTTGATCCTGGCTCAGTCTCCGACGACTACACGACGACTG    
>H8X10ID02I10R9 length=70 xy=3596_1767 region=2 run=R_2013_05_04_16_13_20_    
AGGTTCGATTCCGAGAGGATGAGATGGCGAAAGCATTGCCAAAGTCGACTGCCAAAGGCA    
CACAGGGGGA

thank you in advance...

samtools • 6.8k views
ADD COMMENT
1
Entering edit mode

Please specify, do you want to find such reads on basis of just 'header' mismatch or 'sequence' mismatch?

ADD REPLY
0
Entering edit mode

@Vari i have edited the question i hope it satisfies your question

ADD REPLY
1
Entering edit mode

@Vari i have shown some part of my files in the question thank you for reply...

ADD REPLY
0
Entering edit mode

In your question, try to show at least few lines of your input files from your actual data.
It is possible with perl too.

ADD REPLY
0
Entering edit mode
ADD REPLY
3
Entering edit mode
11.0 years ago
arnstrm ★ 1.8k
grep ">" a1.fna  > a1.txt #get descriptions from 1st file
grep ">" a2.fna > a2.txt #get descriptions from 2nd file
comm -13 <(sort a1.txt) <(sort a2.txt) > IDS.txt #suppress lines unique to a1 and lines common in a1.txt and a2.txt

Now you have the list of IDs that are present in a2 only (IDS.txt). This Ids can be used to extract the sequences from a2.fna through various methods (standlaone blast: blastdbcmd, QIIME: http://qiime.org/scripts/extract_seqs_by_sample_id.html, cdbyank http://nebc.nox.ac.uk/bioinformatics/docs/cdbyank.html) or using simple perl or awk scripts. Hope this helps. EDIT: my assumption was that you have fasta sequences of reasonable size(and not reads). Obviously it won't work on large files efficiently.

ADD COMMENT
1
Entering edit mode

I don't think those approaches will work with the above commands, as they are now. You have to be careful to remove the leading ">" from the IDs since most programs don't consider that as part of the identifier. So, something like grep ">" seqs.fas | sed 's/^>//' | sort > seqs_idlist is probably better.

ADD REPLY
0
Entering edit mode

It's probably also possible just to grep from file to file.

ADD REPLY
0
Entering edit mode

@arnstrm its not working IDS.txt is giving me headers present in a1.fna is there any other way of doing this thank you for reply

ADD REPLY
0
Entering edit mode
grep ">" a1.fna |sed 's/^>//g' |awk '{print $1}' > a1.txt
grep ">" a2.fna |sed 's/^>//g' |awk '{print $1}' > a2.txt
comm -13 <(sort a1.txt) <(sort a2.txt)  > IDS.txt

Try this. I hadn't seen the input files and I assumed it to have only gene IDS in the defline.

ADD REPLY
1
Entering edit mode
11.0 years ago
bow ▴ 790

Here's one way to do it, using Biopython. I'm assuming the IDs itself are enough to identify whether two sequences are the same or not (i.e. two identical sequences always share the same ID and records with different IDs always have different sequences).

from Bio import SeqIO

# get the IDs in a1.fna
a1_ids = set([rec.id for rec in SeqIO.parse('a1.fna', 'fasta')])

# get entries unique to a2.fna
a2_unique = (rec for rec in SeqIO.parse('a2.fna', 'fasta') if rec.id not in a1_ids)

# write unique entries to new file: a2_unique.fna
with open('a2_unique.fna', 'w') as target:
    SeqIO.write(a2_unique, target, 'fasta')
ADD COMMENT
1
Entering edit mode
11.0 years ago

With Biopieces www.biopieces.org) its a variant of this HowTo - the difference being the -i in grab:

read_fasta -i a1.fna | write_tab -k SEQ_NAME -o a1.tab -x
read_fasta -i a2.fna | grab -i -E a1.tab -k SEQ_NAME | write_fasta -o a2_without_a1.fna -x
ADD COMMENT
0
Entering edit mode
11.0 years ago

That doesn't sound like something SAMtools would do. You should be able to write a simple shell or perl script for that though or possibly even simply use a series of UNIX commands. The -v option in JOIN could come in handy, although you'd have to be careful of sequence lines that aren't unique to one particular part of the file.

Hope this helps.

ADD COMMENT

Login before adding your answer.

Traffic: 3081 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6