Question: A complex situation to retreive subset of sequence data
0
gravatar for mforthman
3.4 years ago by
mforthman30
mforthman30 wrote:

I have a file of all the exons from a genome in fasta format, for which I only want to retrieve a subset of (based on a another file with sequence IDS). The file contents look like so:

>OFAS000001-RA-EXON01
ATGGGAAACATGAAGAGAGATCTGACCAGCATGGCTTTATCTCCAGACTCGACATACAAGATCCTTGAAGAAGTCAAAGAGGAATCCATTGAATCTAATCTTTCCTCTCTTGATGGAGTCTTAACTCTTAAAGCACCTAAAAAGGCTCTAGGGGATGAAATA
>OFAS000001-RA-EXON02
AGCATTGAATATAAGGTAGAACCAAGGAGCCTGGACAATGTACCTGGTAGTGGACGTAAGTATTTTACAAATTTCAGTCCATTCTTGGGGAGATTCTAA
>OFAS000002-RA-EXON01
ATGATCAAGCCATCTTGCCGT
>OFAS000002-RA-EXON02
TTGCAGAGGGCCTTACTGCTCCTCGTGGCAGTACTATTTTTCCAGGACCCAAATGGCATCGCATCGGCTCCACTCAGCGACGAGTCTAGTACTAAACTCAATGAAATAGACGTGACCCGCGAGGATTCACCTATAGAAAATGGATTAATTGCCAGTTCTGGGACACCCATCATCTCCTCTATAATCAGTAATAATTCGGGACTCCCAGAAGAAGCTGAGCCATCACTTTCAGAACCTTTAAATGAAGAGAGATTAGAAACGAATCAAGAAATTCAAGAACCAATCGGGAAAAACCCTCCACAAGAAAACAACATTGAAGAGCA
>OFAS000003-RA-EXON01
ATGCAGGACACAGATGTGCTCACCTGCGGTGCCTGCCAGAAGGCGTTTGCTCTTGGCGACATCGTCAAGTTCATCCAGCACAAGGTGCTCGCGTGCAACAAGGAGAACTACGGCTGCAACGAGGGCGGAGGCCCTGCAGCGAACCACCACGAGTCCGGGGGTGACTCCGACGGCGAGGTGGGAGTCCTCGCGAGGCGCCCGTCCATCTCGGCGCCGATGAAGAGGACCGGCGATGAGCTGCAGCAGTCCCAACAGCAGCAGCCCAGGTGCTCGACGCCCAAGCGTCGCCCGAGCCCCGGCTCTGCCTCGCCCACCCCTATCAAGGCCGAGCTCGACTCCACACCCCCGTCCTCCTCTCCGGAGGAAGGCCCTTGCAAGAAGCTGAGGACCGATGTAGCCGATGCCACCTCCAACACCACCAACTCAGGTGAGGATCACTTTCTTTCTCTTGGACCCTCTTACGGTCTCTCAATAGTAGTCTAA
>OFAS000004-RA-EXON01
ATGGCACTTGAAATTGCCTTATCAGATATCCTACGTGGGGCGGCCTTGGCATTGATTGCAGTCAGTCAGGGAAATGGGTCGGAAGAAATTATAAATCACATTTGGGG
>OFAS000004-RA-EXON02
AGCCCAGCAATTATGTGTGTTCGACGTGCAAGGCGAGACTGCACTCCGCTTGGAGGCTGGTCCAGCACGTTCAGAACAGCCACGGTATCAAGATTTACGTAGAGAGTAG
>OFAS000005-RA-EXON01
ATGCCTCTCGAACCGGCCCGCCATCCGACCTTCGCCCCGACCTCTCTCGCCAGCGGCGGATCTCTTTTCGCGAGGCCCTCTAGTCATAGCGACCACCACTTCAGGATGGAACACCTCGTCTCAGAACAGTTCAGGCACCACCCTTTAGGGTTGGCGGCGGCGGTGGCCGGGGCGGTGCCACCCCCGCCTTTCGGGCCTCCGGCCGCCGAACGGGCACCTCCGACTACGAGGTCCCTACCTCCACCGCCTTCGCTACCGCTCTCTCTCGAGCCACAGATAGATTTCTACTCGCAGAGGCTGCGGCAGCTAGCTGGTACCACCAGTCCAGGCGCCGCCAACGGCAACTCTTCTCCTTCTCCGAGGAAACTGACGCCTCCTTTCACTAGTCCTAATAACAGCATTCCGACGCCCGTCAGTATGGCACCTTCCTTAATGTTTAACAACAATAACAATAGCAGTACCAATAACAATAACATCCCACCCGGTACTGTCAGGTCACCGACTCCCAGGCCTCAGCAGTCTACTCCTCCTGCTGAACAGAAGCCCGCCCCTGCTAAGGATGCCCCCCAAGAGGAAACTGTTGATCTCACGTCCACTCCTAGATCTGCTTCGACGCCTCCCGCCAAACCGG

The other file that has a subset of CDS sequence IDs I'd like to use to retrieve the corresponding exonic sequences looks like such:

OFAS000003-RA
OFAS000004-RA
OFAS000005-RA
OFAS000006-RA
OFAS000007-RA
OFAS000008-RA
OFAS000009-RA
OFAS000010-RA
OFAS000013-RA

As you can see, the CDS seq IDs are very similar to the exon seq IDs (only missing "–EXON##"). Is there some way I can use the CDS seq ID file to retrieve the corresponding exon seqs?

dna data retrieval exon • 822 views
ADD COMMENTlink modified 3.4 years ago by Malcolm.Cook1.1k • written 3.4 years ago by mforthman30
0
gravatar for venu
3.4 years ago by
venu6.3k
Germany
venu6.3k wrote:

You can do something like following

  • Take the sequence IDs from fasta file
  • Match them with CDS ids file, if matched write them to new file
  • Subset the fasta with new ids file

Download faSomeRecords

 grep '^>' Exon_seq.fa | sed -e 's/>//' -e 's/\(.*\)-/\1 /' | grep -Fwf CDS_ids.txt - | awk '{print $1"-"$2}' > Required_IDS.txt

 chmod +x faSomeRecords

./faSomeRecords Exon_seq.fa Required_IDS.txt output.faa
ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by venu6.3k

Thank you! I ended up using the first set of commands you provided to get my required exon IDs outputted. I then used makeblastdb and blastdbcmd to pull out the corresponding exon sequences.

ADD REPLYlink written 3.4 years ago by mforthman30
0
gravatar for Malcolm.Cook
3.4 years ago by
Malcolm.Cook1.1k
kansas, usa
Malcolm.Cook1.1k wrote:

Is your shell bash or otherwise support <( process substitution )?

Do all your fasta files only have one line of sequence?

If yes and yes, try this one-liner:

> grep -A 1 -f <(perl -ln -e 'print ">$_-EXON"' ids.txt) exons.fa
ADD COMMENTlink written 3.4 years ago by Malcolm.Cook1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1977 users visited in the last hour