Question: concat fasta sequences based on some common header string
0
gravatar for mforthman
28 days ago by
mforthman30
mforthman30 wrote:

I have a single fasta file with multiple exons from various genes. What I would like to do is use the GeneID in each header to find those exon sequences that have the same GeneID and concatenate them into a single sequence. Along with this, I would like to alter the sequence header so that it includes just the GeneID and one of product descriptors (doesn't matter which one where multiple are given). I figure, for the headers some way of delimiting based on ';' and using specific strings to produce the new headers. I also suspect this would have to be a two step approach, probably using python or perl instead of linux.

For example, input fasta file:

>ID=exon-XM_014420980.1-1;GeneID:106680938;product=zinc finger protein 391
TTGAAACAGTAGTCGTCTTAGAAGCGCTTCGGCACCACGTCTCCAACTCTCCCACGTCTAGGACTCgtgttattgttttgtttattttttttatttataaacatcgttcattttttttttttttgtttacttcgaTTTTAACCGTGAACTCGGCATTTAAGTCTCCGACAGCTGTGCGTTGAAGTCTCCAACCAACTGCGGA$
>ID=exon-XM_014420980.1-2;GeneID:106680938;product=zinc finger protein 391
atcaGGCATAGGTTCGGTTGTTATGACGAAATCACAGAAGTCGACAGGCGGCGAGTACGGCACTGCAATTGGGTACGTTTCCTCAGGGTGGTACCCGCTTACACGGATCAAGTGAACCTCattggaacaaaaataaaag
>ID=exon-XM_014420980.1-3;GeneID:106680938;product=zinc finger protein 391
GCGAACCTATTTACGAAGCAGTCAAAAACATCCCAGCGAATTCAGAACTGGTAGTTTACTATCTTCCAGAGAGGCCAGAAGAAGTATTTTTCATGCCGGCAGTCCACTACCTGAGGAACTCACTTTACAGGAGGACAATGGACACGATTCTggaag
>ID=exon-XM_014420980.1-4;GeneID:106680938;product=zinc finger protein 391
ACTCCCCCTTAGACCTCTCGATGTCGCTACTATCCCGCTCGCTCATCACACCTCCTGGTTCGTCGTCGCCCGTGGACACAGACAAATCCTCAGACTCCGAAGGGCAGCGGATGACACAGGCCAGGTCACCCAAGAGGTCTGGGCGGCCGGAGAGGAGCTTGCTGCCCTGCGAGGTGTGCGGCAAGGCATTCGACAGGCCGTC$
>ID=exon-XM_014420980.1-5;GeneID:106680938;product=zinc finger protein 391
gCGAAAAACCCCACGTGTGTATGGTGTGTAACAAAGGCTTCAGCACTAGCAGCTCTCTCAACACCCACAGGCGAATCCACTCCGGCGAAAAGCCCCACCAATGTCAAGTCTGCGGGAAACGGTTCACCGCCTCCTCCAACCTCTACTACCACAGGATGACGCATATCAAG
>ID=exon-XM_014420980.1-6;GeneID:106680938;product=zinc finger protein 391
GAGAAGCCTCACAAATGCAACCTGTGCTCCAAGTCGTTCCCCACCCCCGGAGATTTGAAATCGCACATGTACGTCCACAACGGTTCGTGGCCGTTCAGATGCAACATCTGCTCGCGAGGGTTCAGCAAGCACACCAACCTCAAGAACCACCTCTTCCTTCACACCG
>ID=exon-XM_014420980.1-7;GeneID:106680938;product=zinc finger protein 391
GTGACAAGCCACACGCTTGCGAACTCTGTCACAAGAAATTCGCCCTAGCTTGCAATCTTCGTGCTCACATGAAAACTCATGAAG
>ID=exon-XM_014420980.1-8;GeneID:106680938;product=zinc finger protein 391
GAGAGAATCAGGAAGAATGCTCCAGGTGCTCCAAGAACTGCGGAGGTACCTGCGGAATGCTCAGCGAAGAATCGGCGCCGTCAGTGACAGAAGAGGATCACGAGTAGATTATTTGGAGGAGTTGCCTTTCAGATCTCATACCCCCTCTCAACATCCAAATAgacattattataaacatatagctttatatatcatatataca$
>ID=exon-XM_024358664.1-7;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript variant X3,ID=exon-XM_024358308.1-6;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript vari$
ttctatttttttttttttttttttattaaaaacataaatacaaattcttCATCTACAATTTACAATCcaagatttcattttttgtcgAGCTTTATTCATAGActcacaaaaaatgtttttatctttagtggtttttatttcattgatacaTTTCTTAAGgtaatttctcaaatatttccACTCAGGATCACTTAAATCACTT$
>ID=exon-XM_024358664.1-6;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript variant X3,ID=exon-XM_024358308.1-5;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript vari$
CTCCAAAATCTTCATTGAAAAGAAGACCTTCGGCTACTGGGATGTCCATTGATTTCTGGGCTGAAtcttcactc
>ID=exon-XM_024358664.1-5;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript variant X3,ID=exon-XM_024358308.1-4;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript vari$
CTTTTCTGTTCACGTAAGAATTCTGGATTTCCTAGCCATGATCGCAAGTCACTGGGTAGTTCGTGACCCGATGGAAGCCACTGATCTTTTGATTCCAAACAAGTTAAGTAAGGTATTAATTGTGAATACCATTCACAAACCTGAAGCAGCAAATATCGCGAAAGTGTC
>ID=exon-XM_024358664.1-4;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript variant X3,ID=exon-XM_024358308.1-3;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript vari$
CATACCCTGCTGAGGCATGCCAAGAACAATAGAGCCAAATGCCATCTATTTCCTGTAGTTATAtgattataaagataataagcTGTTTCAGTACATGTTCTTTCTAATCGATGGAACAAGCCAGCAAACCTTTGCAAGGTAATTAGGCACCACTCTAACATTTCTTTTGTAGGACCTTCACCATTATATGTGTAATTTCGGT$
>ID=exon-XM_024358664.1-3;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript variant X3,ID=exon-XM_024358308.1-2;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript vari$
ctttTGAAAACCTTTATATCCTTTGTCATTCCTTAAACTTTTTCTGGTACGAAATAACAAACGCGATATAAGAGCGGCATCAAAGTGGAGTAAATCTTGATCCTCTAAAACTGatgaaattttatccaaaatgtTTGCGAACTTTGCACTTGTTTTCC
>ID=exon-XM_024358505.1-1;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript variant X2
CTGGATTTTTGGTTTTGATAGCCAAAGACATCAGAGATGGAGGTTccaaattccttttattcaaaaaaggtacaaaaatatcttcaaaagaCATAGTTTGAGGGAGGATTCAACCTCATACACTTATTGAGTTTAATACTAATAACCATAGGGAAGGTAGATGTTGAAAGAAACGTGACAGAAAGGttacaaataagttaaa$
>ID=exon-XM_024358664.1-2;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript variant X3,ID=exon-XM_024358308.1-1;GeneID:106680947;product=uncharacterized LOC106680947%2C transcript vari$
CTGTATTATGCTCCGACAAAGCAGTTGCAGCTATTAACATCGTGTAAACTAAGACATTTGCTGTCGTACTATCAGGACTTACTTGGTGGTATACATGCTGTCTGAATGAATACATTATCTGGTTAATATGGTTACTACAATCACCTGCTATACCTctctatttatgaaataaaagctTATTCTTTCTCAATTTTTGCATTAC$

Desired outcome in a new fasta file:

>GeneID106680938_zinc_finger_protein_391
TTGAAACAGTAGTCGTCTTAGAAGCGCTTCGGCACCACGTCTCCAACTCTCCCACGTCTAGGACTCgtgttattgttttgtttattttttttatttataaacatcgttcattttttttttttttgtttacttcgaTTTTAACCGTGAACTCGGCATTTAAGTCTCCGACAGCTGTGCGTTGAAGTCTCCAACCAACTGCGGA$atcaGGCATAGGTTCGGTTGTTATGACGAAATCACAGAAGTCGACAGGCGGCGAGTACGGCACTGCAATTGGGTACGTTTCCTCAGGGTGGTACCCGCTTACACGGATCAAGTGAACCTCattggaacaaaaataaaagGCGAACCTATTTACGAAGCAGTCAAAAACATCCCAGCGAATTCAGAACTGGTAGTTTACTATCTTCCAGAGAGGCCAGAAGAAGTATTTTTCATGCCGGCAGTCCACTACCTGAGGAACTCACTTTACAGGAGGACAATGGACACGATTCTggaagACTCCCCCTTAGACCTCTCGATGTCGCTACTATCCCGCTCGCTCATCACACCTCCTGGTTCGTCGTCGCCCGTGGACACAGACAAATCCTCAGACTCCGAAGGGCAGCGGATGACACAGGCCAGGTCACCCAAGAGGTCTGGGCGGCCGGAGAGGAGCTTGCTGCCCTGCGAGGTGTGCGGCAAGGCATTCGACAGGCCGTC$gCGAAAAACCCCACGTGTGTATGGTGTGTAACAAAGGCTTCAGCACTAGCAGCTCTCTCAACACCCACAGGCGAATCCACTCCGGCGAAAAGCCCCACCAATGTCAAGTCTGCGGGAAACGGTTCACCGCCTCCTCCAACCTCTACTACCACAGGATGACGCATATCAAGGAGAAGCCTCACAAATGCAACCTGTGCTCCAAGTCGTTCCCCACCCCCGGAGATTTGAAATCGCACATGTACGTCCACAACGGTTCGTGGCCGTTCAGATGCAACATCTGCTCGCGAGGGTTCAGCAAGCACACCAACCTCAAGAACCACCTCTTCCTTCACACCGGTGACAAGCCACACGCTTGCGAACTCTGTCACAAGAAATTCGCCCTAGCTTGCAATCTTCGTGCTCACATGAAAACTCATGAAGGAGAGAATCAGGAAGAATGCTCCAGGTGCTCCAAGAACTGCGGAGGTACCTGCGGAATGCTCAGCGAAGAATCGGCGCCGTCAGTGACAGAAGAGGATCACGAGTAGATTATTTGGAGGAGTTGCCTTTCAGATCTCATACCCCCTCTCAACATCCAAATAgacattattataaacatatagctttatatatcatatataca$
>GeneID106680947_uncharacterized_LOC106680947%2C_transcript_variant_X3
ttctatttttttttttttttttttattaaaaacataaatacaaattcttCATCTACAATTTACAATCcaagatttcattttttgtcgAGCTTTATTCATAGActcacaaaaaatgtttttatctttagtggtttttatttcattgatacaTTTCTTAAGgtaatttctcaaatatttccACTCAGGATCACTTAAATCACTT$CTCCAAAATCTTCATTGAAAAGAAGACCTTCGGCTACTGGGATGTCCATTGATTTCTGGGCTGAAtcttcactcCTTTTCTGTTCACGTAAGAATTCTGGATTTCCTAGCCATGATCGCAAGTCACTGGGTAGTTCGTGACCCGATGGAAGCCACTGATCTTTTGATTCCAAACAAGTTAAGTAAGGTATTAATTGTGAATACCATTCACAAACCTGAAGCAGCAAATATCGCGAAAGTGTCCATACCCTGCTGAGGCATGCCAAGAACAATAGAGCCAAATGCCATCTATTTCCTGTAGTTATAtgattataaagataataagcTGTTTCAGTACATGTTCTTTCTAATCGATGGAACAAGCCAGCAAACCTTTGCAAGGTAATTAGGCACCACTCTAACATTTCTTTTGTAGGACCTTCACCATTATATGTGTAATTTCGGT$ctttTGAAAACCTTTATATCCTTTGTCATTCCTTAAACTTTTTCTGGTACGAAATAACAAACGCGATATAAGAGCGGCATCAAAGTGGAGTAAATCTTGATCCTCTAAAACTGatgaaattttatccaaaatgtTTGCGAACTTTGCACTTGTTTTCCCTGGATTTTTGGTTTTGATAGCCAAAGACATCAGAGATGGAGGTTccaaattccttttattcaaaaaaggtacaaaaatatcttcaaaagaCATAGTTTGAGGGAGGATTCAACCTCATACACTTATTGAGTTTAATACTAATAACCATAGGGAAGGTAGATGTTGAAAGAAACGTGACAGAAAGGttacaaataagttaaa$CTGTATTATGCTCCGACAAAGCAGTTGCAGCTATTAACATCGTGTAAACTAAGACATTTGCTGTCGTACTATCAGGACTTACTTGGTGGTATACATGCTGTCTGAATGAATACATTATCTGGTTAATATGGTTACTACAATCACCTGCTATACCTctctatttatgaaataaaagctTATTCTTTCTCAATTTTTGCATTAC$
dna sequence fasta • 126 views
ADD COMMENTlink modified 28 days ago by Pierre Lindenbaum124k • written 28 days ago by mforthman30

a good start is going through the biopython tutorial http://biopython.org/DIST/docs/tutorial/Tutorial.html it explains how to read in a fasta file . Although you would maybe also need to understand the basics of python -> split the header , look up the relevant parts in a dictionary etc ..

ADD REPLYlink written 28 days ago by Ido Tamir5.0k

Yes, python, perl or any other program languages could solve this problem.

ADD REPLYlink written 28 days ago by shoujun.gu250
3
gravatar for Pierre Lindenbaum
28 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:
cat jeter.fa  | paste - - | cut -f 1 | tr ";" "\n" | grep -F 'GeneID:' | cut -d ':' -f 2  | sort | uniq | while read G; do cat jeter.fa | paste  - -  | grep -F "GeneID:${G};" | tr "\t" "\n" | awk -v G=$G -F '[,;]' 'NR==1 {printf(">%s %s\n",G,$3);} (NR%2==0) {print}' ; done

>106680938 product=zinc finger protein 391
TTGAAACAGTAGTCGTCTTAGAAGCGCTTCGGCACCACGTCTCCAACTCTCCCACGTCTAGGACTCgtgttattgttttgtttattttttttatttataaacatcgttcattttttttttttttgtttacttcgaTTTTAACCGTGAACTCGGCATTTAAGTCTCCGACAGCTGTGCGTTGAAGTCTCCAACCAACTGCGGA$
atcaGGCATAGGTTCGGTTGTTATGACGAAATCACAGAAGTCGACAGGCGGCGAGTACGGCACTGCAATTGGGTACGTTTCCTCAGGGTGGTACCCGCTTACACGGATCAAGTGAACCTCattggaacaaaaataaaag
GCGAACCTATTTACGAAGCAGTCAAAAACATCCCAGCGAATTCAGAACTGGTAGTTTACTATCTTCCAGAGAGGCCAGAAGAAGTATTTTTCATGCCGGCAGTCCACTACCTGAGGAACTCACTTTACAGGAGGACAATGGACACGATTCTggaag
ACTCCCCCTTAGACCTCTCGATGTCGCTACTATCCCGCTCGCTCATCACACCTCCTGGTTCGTCGTCGCCCGTGGACACAGACAAATCCTCAGACTCCGAAGGGCAGCGGATGACACAGGCCAGGTCACCCAAGAGGTCTGGGCGGCCGGAGAGGAGCTTGCTGCCCTGCGAGGTGTGCGGCAAGGCATTCGACAGGCCGTC$
gCGAAAAACCCCACGTGTGTATGGTGTGTAACAAAGGCTTCAGCACTAGCAGCTCTCTCAACACCCACAGGCGAATCCACTCCGGCGAAAAGCCCCACCAATGTCAAGTCTGCGGGAAACGGTTCACCGCCTCCTCCAACCTCTACTACCACAGGATGACGCATATCAAG
GAGAAGCCTCACAAATGCAACCTGTGCTCCAAGTCGTTCCCCACCCCCGGAGATTTGAAATCGCACATGTACGTCCACAACGGTTCGTGGCCGTTCAGATGCAACATCTGCTCGCGAGGGTTCAGCAAGCACACCAACCTCAAGAACCACCTCTTCCTTCACACCG
GTGACAAGCCACACGCTTGCGAACTCTGTCACAAGAAATTCGCCCTAGCTTGCAATCTTCGTGCTCACATGAAAACTCATGAAG
GAGAGAATCAGGAAGAATGCTCCAGGTGCTCCAAGAACTGCGGAGGTACCTGCGGAATGCTCAGCGAAGAATCGGCGCCGTCAGTGACAGAAGAGGATCACGAGTAGATTATTTGGAGGAGTTGCCTTTCAGATCTCATACCCCCTCTCAACATCCAAATAgacattattataaacatatagctttatatatcatatataca$
>106680947 product=uncharacterized LOC106680947%2C transcript variant X3
ttctatttttttttttttttttttattaaaaacataaatacaaattcttCATCTACAATTTACAATCcaagatttcattttttgtcgAGCTTTATTCATAGActcacaaaaaatgtttttatctttagtggtttttatttcattgatacaTTTCTTAAGgtaatttctcaaatatttccACTCAGGATCACTTAAATCACTT$
CTCCAAAATCTTCATTGAAAAGAAGACCTTCGGCTACTGGGATGTCCATTGATTTCTGGGCTGAAtcttcactc
CTTTTCTGTTCACGTAAGAATTCTGGATTTCCTAGCCATGATCGCAAGTCACTGGGTAGTTCGTGACCCGATGGAAGCCACTGATCTTTTGATTCCAAACAAGTTAAGTAAGGTATTAATTGTGAATACCATTCACAAACCTGAAGCAGCAAATATCGCGAAAGTGTC
CATACCCTGCTGAGGCATGCCAAGAACAATAGAGCCAAATGCCATCTATTTCCTGTAGTTATAtgattataaagataataagcTGTTTCAGTACATGTTCTTTCTAATCGATGGAACAAGCCAGCAAACCTTTGCAAGGTAATTAGGCACCACTCTAACATTTCTTTTGTAGGACCTTCACCATTATATGTGTAATTTCGGT$
ctttTGAAAACCTTTATATCCTTTGTCATTCCTTAAACTTTTTCTGGTACGAAATAACAAACGCGATATAAGAGCGGCATCAAAGTGGAGTAAATCTTGATCCTCTAAAACTGatgaaattttatccaaaatgtTTGCGAACTTTGCACTTGTTTTCC
CTGGATTTTTGGTTTTGATAGCCAAAGACATCAGAGATGGAGGTTccaaattccttttattcaaaaaaggtacaaaaatatcttcaaaagaCATAGTTTGAGGGAGGATTCAACCTCATACACTTATTGAGTTTAATACTAATAACCATAGGGAAGGTAGATGTTGAAAGAAACGTGACAGAAAGGttacaaataagttaaa$
CTGTATTATGCTCCGACAAAGCAGTTGCAGCTATTAACATCGTGTAAACTAAGACATTTGCTGTCGTACTATCAGGACTTACTTGGTGGTATACATGCTGTCTGAATGAATACATTATCTGGTTAATATGGTTACTACAATCACCTGCTATACCTctctatttatgaaataaaagctTATTCTTTCTCAATTTTTGCATTAC
ADD COMMENTlink written 28 days ago by Pierre Lindenbaum124k

Thanks Pierre for the recommended commands. I tried to use it, adding a command to direct output to a new file, but it resulted in an empty file. Without that included, it prints sequences it concatenates to the screen (it does not print sequences that do not get concatenated with anything else). Perhaps I'm not providing this command in the correct place?

cat jeter.fa  | paste - - | cut -f 1 | tr ";" "\n" | grep -F 'GeneID:' | cut -d ':' -f 2  | sort | uniq | while read G; do cat jeter.fa | paste  - -  | grep -F "GeneID:${G};" | tr "\t" "\n" | awk -v G=$G -F '[,;]' 'NR==1 {printf(">%s %s\n",G,$3);} (NR%2==0) {print}'  > jeter_merged.fa; done
ADD REPLYlink written 28 days ago by mforthman30

actually, I need to make it 'append' (>>) to the output file or else it just overwrites the output file with the last set of exons it concatenated.

ADD REPLYlink written 28 days ago by mforthman30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 946 users visited in the last hour