adit sequences ID
1
0
Entering edit mode
9.4 years ago
Kurban ▴ 230

Hi guys,

I have two files, a fasta file (containing a sequences and sequence IDs) and a text file containing a list of sequence ID.

I would like to extract the respective sequence for each of the sequence IDs in the text file. then I searched the net and found the following commend can do the job:

cut -c 2- EXAMPLE.TXT | xargs -n 1 samtools faidx EXAMPLE.FA

I used it, but it gives me the error message like this :

[fai_fetch] Warning - Reference omp2352_c1_seq1 not found in FASTA file, returning empty sequence
xargs: samtools: terminated by signal 11

My fasta file is like this:

>comp904_c0_seq2 len=1452 path=[4675:0-87 7837:88-89 7839:90-96 6352:97-97 6299:98-1223 5917:1224-1227 5921:1228-1451]
TCTAATAAACCATCCATTCATTCACACCACCACTCTTGCTCATTGGAGACACCATGCACT
AATCAAAAAAAAAACCTTAATGTATATTAAAAAAAAATAATAGCGGGAGGAGGAAGAGAC
GGGAGGGGCTTTTCTATCAAGCTTTTCTCATAAATAATACCGAAAGTACGATCAAATTTT
TCTCATTGTTTTTCTTGCTGTGATAGAAAAAGATGCGCGCGTGCCGAACGAGGGGAAACG
GGGAAACAAGAGAGAACGAAGAAAAGGGAACAAGGATGAAAAATAAATCGCTCATCGTAC
AACACGAGAGGACAAAAGGGTCGTAAATAGTAATAGTAGTAATAGTAAGTAAGTAAATAG
GAGGAGAGGCTCTCTCATTCATACAAATTGAGACAAACAGATACAATATAGAGTTACGGC
TAAGATAATTATAATAATACAGTATGTACTGGGATCGAGACGGTTCCCGTCTCTGATTCA
AATGTCTCTGTGGGCCTGGCCCGCACTACACAGCACTGTTAGAGTTCGTAATTTTTATTT
TATTTTTATTTTTTTTTAATAATGTGTGGAGATGATTGTGTCGTTCGCAAGTTACGACAC
GTCCTGGTCCTCTACGTCCTGGAGTTGTTCTTTGGGTTCCGCTTCTCCGTCCCCTTGCAT
ATCACTGGTCCACAGCGTCAGGTTGTCGCGCAACAGTTGCATGATCAACGTGCTGTCTTT
ATAGCTCTCCTCGGATAAAGTGTCTAGTTCTGCAATGGCGTCGTCGAAGGCGGCTTTCGC
TAGACGACAGGCCCTGTCCGGACTGTTCAATATTTCGTAATAGAATACGGAGAAATTAAG
GGCCAATCCGAGCCTGATCGGATGAGTGGGTGGCAGTTCTGTCATTGCGATATCACTGGC
TGATTTGTAAGCGACCAACGAATGTTCAGCGGCGTCTTTACGGTCGTTGCCTGTCGCGAA
TTCGGCCAGATATCGGTGGTAGTCACCCTTCATTTTGTAATAAAACACTTTAGATTCGCC
CGTGGAAGCCGCCGGGATAAGATGTTTGTCCAAAACGGCCAATATATCCGAACAGATGTC
CCTCAACTCCTTCTCCACCTGCGCCCGGTACTGCCTTATCATCTCCAACTTGTCGTCCGT
CCCTTTGCTCTCTTCTTTCTGTTCGATCGAGGAGATTATACGCCAAGAAGCCCTCCGTGC
ACCTATCACGTTCTTGTAAGCGACAGAAAGAAGGTTTCGCTCCTCCACGGTCAGCTCCAG
GTCCAGCTTGGCCACCTTCTTCATAGCGTCCACCATTTCATCGTAACGTTCGGCCTGCTC
CGCCAGTTTCGCCTTGTAGACGTTATCCTCCCGTTCAGACATATTGCGATGATTACCGAT
TGAAACTCCACAACAACACTTTATTCACTCGAGACACTCCGCGTACTGCCAATATGGCCG
CGCCCGAGATCG

ID list file like this:

comp2352_c1_seq1
comp3842_c0_seq2
comp3842_c0_seq4
comp3842_c0_seq6
comp2145_c1_seq3
comp2145_c1_seq4
comp5304_c1_seq5
comp5696_c0_seq18
comp5237_c0_seq5
comp5237_c0_seq7
comp7076_c0_seq2

I am new at kind of work. could any one give me some suggestions!

How could I make the samtools work? should I change the form of sequences ID in both file?

e.g. for the first file(fasta file) just keep the >comp904_c0_seq2 part, erase the rest from the ID?

For ID.txt file should I add > before the each sequence's ID?

If that works, how can I do that? because there r many sequences in fasta file , it is hard to do that one by one?

sequence • 2.6k views
ADD COMMENT
0
Entering edit mode

The error is because you do not have '>' in your IDs file. Instead of the cut, just use a cat.

ADD REPLY
0
Entering edit mode

thanks @RamRS, yeah, after add ">" , it worked fine.

ADD REPLY
0
Entering edit mode

You could use sed:

sed -re 's/^/>/' <EXAMPLE.TXT | xargs -n 1 samtools faidx EXAMPLE.FA
ADD REPLY
0
Entering edit mode
9.4 years ago
Ram 43k

I wrote a piece of Perl code for this some time back. Here you go: https://github.com/RamRS/myPerlScripts/blob/master/filterMulFa.pl

If you prefer Python (or would like to extract a sequence multiple times if an ID is seen multiple times in the ID file), this should help: https://github.com/RamRS/myPyScripts/blob/master/extract_fa_subset.py

ADD COMMENT
0
Entering edit mode

Thanks @RamRS.

but could u give it a look at this :

kurban@kurban-X550VC:~/Desktop$ ls
22.pl  22.pl~  gene.fasta  my.pl  my.pl~  tf.IDs
kurban@kurban-X550VC:~/Desktop$ perl my.pl gene.fasta tf.IDs
Can't locate Getopt/ArgParse.pm in @INC (you may need to install the Getopt::ArgParse module) (@INC contains: /etc/perl /usr/local/lib/perl/5.18.2 /usr/local/share/perl/5.18.2 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.18 /usr/share/perl/5.18 /usr/local/lib/site_perl .) at my.pl line 5.
BEGIN failed--compilation aborted at my.pl line 5.

what is it say?

by the way I am using ubuntu 14.04

ADD REPLY
0
Entering edit mode

Ah, you need to install Getopt::Argparse using CPAN. Open up a new terminal, and type:

$ cpan
cpan> install cpan
cpan> reload cpan
cpan> install Getopt::Argparse
cpan> install CJFIELDS/BioPerl-1.6.901.tar.gz
ADD REPLY
0
Entering edit mode

thanks @RamRS.

ADD REPLY

Login before adding your answer.

Traffic: 2530 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6