Question: Extract Fasta Sequences From File Which Are Not Present In Other Fasta File
0
gravatar for namratapatel183
7.5 years ago by
namratapatel18310 wrote:

I have two fasta files say for example a1.fna and a2.fna a1.fna file's all reads are present in a2.fna now i want to extract reads along with its header and sequence from a2.fna which are not present in a1.fna means the reads other than those who are present in a1.fna can i do this using samtools or any other linux commands??

a1.fna

>H8X10ID02GRZSV length=47 xy=2662_0365 region=2 run=R_2013_05_04_16_13_20_    
GAGTTTTGATCCTCGCGTCAGGAGGTGGTGGACGAGATATAACGACG    
>H8X10ID02G9ZWR length=60 xy=2867_0633 region=2 run=R_2013_05_04_16_13_20_    
AGGTTCGATTCCGGAGAGCGAAGCATTTGCCAAGTCGACTGCCAAGGCACACAGGGAGTA

a2.fna

>H8X10ID02GRZSV length=47 xy=2662_0365 region=2 run=R_2013_05_04_16_13_20_    
GAGTTTTGATCCTCGCGTCAGGAGGTGGTGGACGAGATATAACGACG    
>H8X10ID02G9ZWR length=60 xy=2867_0633 region=2 run=R_2013_05_04_16_13_20_    
AGGTTCGATTCCGGAGAGCGAAGCATTTGCCAAGTCGACTGCCAAGGCACACAGGGAGTA    
>H8X10ID02IKYPK length=55 xy=3402_0550 region=2 run=R_2013_05_04_16_13_20_    
GCCAGCAGCGCGGTAACGAGCGCAACTAGTCGACTGCCAAGGCACACAGGGAGTA    
>H8X10ID02HSM36 length=44 xy=3079_2164 region=2 run=R_2013_05_04_16_13_20_    
GAGTTTTGATCCTGGCTCAGTCTCCGACGACTACACGACGACTG    
>H8X10ID02I10R9 length=70 xy=3596_1767 region=2 run=R_2013_05_04_16_13_20_    
AGGTTCGATTCCGAGAGGATGAGATGGCGAAAGCATTGCCAAAGTCGACTGCCAAAGGCA    
CACAGGGGGA

output file should contain

>H8X10ID02IKYPK length=55 xy=3402_0550 region=2 run=R_2013_05_04_16_13_20_    
GCCAGCAGCGCGGTAACGAGCGCAACTAGTCGACTGCCAAGGCACACAGGGAGTA    
>H8X10ID02HSM36 length=44 xy=3079_2164 region=2 run=R_2013_05_04_16_13_20_    
GAGTTTTGATCCTGGCTCAGTCTCCGACGACTACACGACGACTG    
>H8X10ID02I10R9 length=70 xy=3596_1767 region=2 run=R_2013_05_04_16_13_20_    
AGGTTCGATTCCGAGAGGATGAGATGGCGAAAGCATTGCCAAAGTCGACTGCCAAAGGCA    
CACAGGGGGA

thank you in advance...

samtools • 4.2k views
ADD COMMENTlink modified 7.5 years ago • written 7.5 years ago by namratapatel18310
1

Please specify, do you want to find such reads on basis of just 'header' mismatch or 'sequence' mismatch?

ADD REPLYlink modified 7.5 years ago • written 7.5 years ago by Nari880

@Vari i have edited the question i hope it satisfies your question

ADD REPLYlink written 7.5 years ago by namratapatel18310
1

@Vari i have shown some part of my files in the question thank you for reply...

ADD REPLYlink written 7.5 years ago by namratapatel18310

In your question, try to show at least few lines of your input files from your actual data.
It is possible with perl too.

ADD REPLYlink modified 7.5 years ago • written 7.5 years ago by Nari880

very similar to

Compare two fasta files by headers

ADD REPLYlink written 7.5 years ago by Pierre Lindenbaum131k
3
gravatar for arnstrm
7.5 years ago by
arnstrm1.8k
Ames, IA
arnstrm1.8k wrote:
grep ">" a1.fna  > a1.txt #get descriptions from 1st file
grep ">" a2.fna > a2.txt #get descriptions from 2nd file
comm -13 <(sort a1.txt) <(sort a2.txt) > IDS.txt #suppress lines unique to a1 and lines common in a1.txt and a2.txt

Now you have the list of IDs that are present in a2 only (IDS.txt). This Ids can be used to extract the sequences from a2.fna through various methods (standlaone blast: blastdbcmd, QIIME: http://qiime.org/scripts/extract_seqs_by_sample_id.html, cdbyank http://nebc.nox.ac.uk/bioinformatics/docs/cdbyank.html) or using simple perl or awk scripts. Hope this helps. EDIT: my assumption was that you have fasta sequences of reasonable size(and not reads). Obviously it won't work on large files efficiently.

ADD COMMENTlink modified 7.5 years ago • written 7.5 years ago by arnstrm1.8k
1

I don't think those approaches will work with the above commands, as they are now. You have to be careful to remove the leading ">" from the IDs since most programs don't consider that as part of the identifier. So, something like grep ">" seqs.fas | sed 's/^>//' | sort > seqs_idlist is probably better.

ADD REPLYlink written 7.5 years ago by SES8.4k

It's probably also possible just to grep from file to file.

ADD REPLYlink written 7.5 years ago by PoGibas4.8k

@arnstrm its not working IDS.txt is giving me headers present in a1.fna is there any other way of doing this thank you for reply

ADD REPLYlink written 7.5 years ago by namratapatel18310
grep ">" a1.fna |sed 's/^>//g' |awk '{print $1}' > a1.txt
grep ">" a2.fna |sed 's/^>//g' |awk '{print $1}' > a2.txt
comm -13 <(sort a1.txt) <(sort a2.txt)  > IDS.txt

Try this. I hadn't seen the input files and I assumed it to have only gene IDS in the defline.

ADD REPLYlink modified 7.5 years ago • written 7.5 years ago by arnstrm1.8k
1
gravatar for bow
7.5 years ago by
bow790
Netherlands
bow790 wrote:

Here's one way to do it, using Biopython. I'm assuming the IDs itself are enough to identify whether two sequences are the same or not (i.e. two identical sequences always share the same ID and records with different IDs always have different sequences).

from Bio import SeqIO

# get the IDs in a1.fna
a1_ids = set([rec.id for rec in SeqIO.parse('a1.fna', 'fasta')])

# get entries unique to a2.fna
a2_unique = (rec for rec in SeqIO.parse('a2.fna', 'fasta') if rec.id not in a1_ids)

# write unique entries to new file: a2_unique.fna
with open('a2_unique.fna', 'w') as target:
    SeqIO.write(a2_unique, target, 'fasta')
ADD COMMENTlink modified 7.5 years ago • written 7.5 years ago by bow790
1
gravatar for Martin A Hansen
7.5 years ago by
Martin A Hansen3.0k
Denmark
Martin A Hansen3.0k wrote:

With Biopieces www.biopieces.org) its a variant of this HowTo - the difference being the -i in grab:

read_fasta -i a1.fna | write_tab -k SEQ_NAME -o a1.tab -x
read_fasta -i a2.fna | grab -i -E a1.tab -k SEQ_NAME | write_fasta -o a2_without_a1.fna -x
ADD COMMENTlink written 7.5 years ago by Martin A Hansen3.0k
0
gravatar for Matt Miossec
7.5 years ago by
Matt Miossec350
Universidad Andrés Bello
Matt Miossec350 wrote:

That doesn't sound like something SAMtools would do. You should be able to write a simple shell or perl script for that though or possibly even simply use a series of UNIX commands. The -v option in JOIN could come in handy, although you'd have to be careful of sequence lines that aren't unique to one particular part of the file.

Hope this helps.

ADD COMMENTlink modified 7.5 years ago • written 7.5 years ago by Matt Miossec350
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1541 users visited in the last hour