A complex situation to retreive subset of sequence data
2
0
Entering edit mode
7.8 years ago
mforthman ▴ 50

I have a file of all the exons from a genome in fasta format, for which I only want to retrieve a subset of (based on a another file with sequence IDS). The file contents look like so:

>OFAS000001-RA-EXON01
ATGGGAAACATGAAGAGAGATCTGACCAGCATGGCTTTATCTCCAGACTCGACATACAAGATCCTTGAAGAAGTCAAAGAGGAATCCATTGAATCTAATCTTTCCTCTCTTGATGGAGTCTTAACTCTTAAAGCACCTAAAAAGGCTCTAGGGGATGAAATA
>OFAS000001-RA-EXON02
AGCATTGAATATAAGGTAGAACCAAGGAGCCTGGACAATGTACCTGGTAGTGGACGTAAGTATTTTACAAATTTCAGTCCATTCTTGGGGAGATTCTAA
>OFAS000002-RA-EXON01
ATGATCAAGCCATCTTGCCGT
>OFAS000002-RA-EXON02
TTGCAGAGGGCCTTACTGCTCCTCGTGGCAGTACTATTTTTCCAGGACCCAAATGGCATCGCATCGGCTCCACTCAGCGACGAGTCTAGTACTAAACTCAATGAAATAGACGTGACCCGCGAGGATTCACCTATAGAAAATGGATTAATTGCCAGTTCTGGGACACCCATCATCTCCTCTATAATCAGTAATAATTCGGGACTCCCAGAAGAAGCTGAGCCATCACTTTCAGAACCTTTAAATGAAGAGAGATTAGAAACGAATCAAGAAATTCAAGAACCAATCGGGAAAAACCCTCCACAAGAAAACAACATTGAAGAGCA
>OFAS000003-RA-EXON01
ATGCAGGACACAGATGTGCTCACCTGCGGTGCCTGCCAGAAGGCGTTTGCTCTTGGCGACATCGTCAAGTTCATCCAGCACAAGGTGCTCGCGTGCAACAAGGAGAACTACGGCTGCAACGAGGGCGGAGGCCCTGCAGCGAACCACCACGAGTCCGGGGGTGACTCCGACGGCGAGGTGGGAGTCCTCGCGAGGCGCCCGTCCATCTCGGCGCCGATGAAGAGGACCGGCGATGAGCTGCAGCAGTCCCAACAGCAGCAGCCCAGGTGCTCGACGCCCAAGCGTCGCCCGAGCCCCGGCTCTGCCTCGCCCACCCCTATCAAGGCCGAGCTCGACTCCACACCCCCGTCCTCCTCTCCGGAGGAAGGCCCTTGCAAGAAGCTGAGGACCGATGTAGCCGATGCCACCTCCAACACCACCAACTCAGGTGAGGATCACTTTCTTTCTCTTGGACCCTCTTACGGTCTCTCAATAGTAGTCTAA
>OFAS000004-RA-EXON01
ATGGCACTTGAAATTGCCTTATCAGATATCCTACGTGGGGCGGCCTTGGCATTGATTGCAGTCAGTCAGGGAAATGGGTCGGAAGAAATTATAAATCACATTTGGGG
>OFAS000004-RA-EXON02
AGCCCAGCAATTATGTGTGTTCGACGTGCAAGGCGAGACTGCACTCCGCTTGGAGGCTGGTCCAGCACGTTCAGAACAGCCACGGTATCAAGATTTACGTAGAGAGTAG
>OFAS000005-RA-EXON01
ATGCCTCTCGAACCGGCCCGCCATCCGACCTTCGCCCCGACCTCTCTCGCCAGCGGCGGATCTCTTTTCGCGAGGCCCTCTAGTCATAGCGACCACCACTTCAGGATGGAACACCTCGTCTCAGAACAGTTCAGGCACCACCCTTTAGGGTTGGCGGCGGCGGTGGCCGGGGCGGTGCCACCCCCGCCTTTCGGGCCTCCGGCCGCCGAACGGGCACCTCCGACTACGAGGTCCCTACCTCCACCGCCTTCGCTACCGCTCTCTCTCGAGCCACAGATAGATTTCTACTCGCAGAGGCTGCGGCAGCTAGCTGGTACCACCAGTCCAGGCGCCGCCAACGGCAACTCTTCTCCTTCTCCGAGGAAACTGACGCCTCCTTTCACTAGTCCTAATAACAGCATTCCGACGCCCGTCAGTATGGCACCTTCCTTAATGTTTAACAACAATAACAATAGCAGTACCAATAACAATAACATCCCACCCGGTACTGTCAGGTCACCGACTCCCAGGCCTCAGCAGTCTACTCCTCCTGCTGAACAGAAGCCCGCCCCTGCTAAGGATGCCCCCCAAGAGGAAACTGTTGATCTCACGTCCACTCCTAGATCTGCTTCGACGCCTCCCGCCAAACCGG

The other file that has a subset of CDS sequence IDs I'd like to use to retrieve the corresponding exonic sequences looks like such:

OFAS000003-RA
OFAS000004-RA
OFAS000005-RA
OFAS000006-RA
OFAS000007-RA
OFAS000008-RA
OFAS000009-RA
OFAS000010-RA
OFAS000013-RA

As you can see, the CDS seq IDs are very similar to the exon seq IDs (only missing "–EXON##"). Is there some way I can use the CDS seq ID file to retrieve the corresponding exon seqs?

dna data retrieval exon • 1.5k views
ADD COMMENT
0
Entering edit mode
7.8 years ago
venu 7.1k

You can do something like following

  • Take the sequence IDs from fasta file
  • Match them with CDS ids file, if matched write them to new file
  • Subset the fasta with new ids file

Download faSomeRecords

 grep '^>' Exon_seq.fa | sed -e 's/>//' -e 's/\(.*\)-/\1 /' | grep -Fwf CDS_ids.txt - | awk '{print $1"-"$2}' > Required_IDS.txt

 chmod +x faSomeRecords

./faSomeRecords Exon_seq.fa Required_IDS.txt output.faa
ADD COMMENT
0
Entering edit mode

Thank you! I ended up using the first set of commands you provided to get my required exon IDs outputted. I then used makeblastdb and blastdbcmd to pull out the corresponding exon sequences.

ADD REPLY
0
Entering edit mode
7.8 years ago
Malcolm.Cook ★ 1.5k

Is your shell bash or otherwise support <( process substitution )?

Do all your fasta files only have one line of sequence?

If yes and yes, try this one-liner:

> grep -A 1 -f <(perl -ln -e 'print ">$_-EXON"' ids.txt) exons.fa
ADD COMMENT

Login before adding your answer.

Traffic: 2188 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6