How to merge exons based on gene ids?
1
0
Entering edit mode
3.7 years ago
akashbala0 ▴ 10

I have multiple exons with header gene_id, I wanted to merge the multiple exons sequence to a single string based on header gene_id. The file I have,

   >"gene_id:XLOC_000656"::KB317696.1:3536054-3536229
   AGACCATTTAAATTCTAACTACCAACTGTCACATATAACTCAGCTTCACATAATGTTTGGATCTCGTCGCTCTCGTGCTCCCGCTGCTGGAATGGGTGGCGGAGGTTATGGAGCTGCCCCCCCTCGACGAAGAGGATTTGGAATGGGAGGCGGGCGTCGCGGCGCTGGCTGTGGG
   >"gene_id:XLOC_000656"::KB317696.1:3536284-3536346
   CGCAACCCATGGGAGGGGGAATGGGTGCTGGAAGAACCGGTGGCATGGGGATGGCTCCTGGT
   >"gene_id:XLOC_000656"::KB317696.1:3536407-3536551
   GGGTTGTTGCTCCTCGCGTTAGGACCCGTGATCGCATCAAGCGACTCTTTGGATTTGGCCGTTCGCGTCGTAGGACGGCTATGTACTAAACCAGCCCGTGGTGCAAGCGTGCTTTGAAAGGAAGAAGTGAATAATGTATCATCC
   >"gene_id:XLOC_001528"::KB317696.1:3526472-3526569
   GCTGATCAGAAGTTTGCAGATGATAAAACACGCGCTGCAAAGCTAGAGCAACGCTCTCGGAAGGTATGTCATCCTCGGTAGGTATCTGGTACACGGC
   >"gene_id:XLOC_001528"::KB317696.1:3526622-3526730
   CTTCGGAAATACCGTGTGCAATAGAGCGATTGTAGTAGTGAATTCATATAACATGTTGCACCTTGGTTCTTCAATCCAACGAAACCCGTCTCCTTCTTTGAGTCATAA
   >"gene_id:XLOC_001528"::KB317696.1:3526786-3526969
   TTACAAAGTTATGCCATAGTACACCAGTTGGATCTTTCAGTACTCGAACGAACACGGTTATGTCAGCCTCGTCGCCTTCAATGATCGGTCTACGTCCTTCAATGCCTGTAAAGAGCTTTCGGAGGTCATGGAATCGTGTGAAACCCCAATCGCATTCCTCGGCAATGAACCGATGGTGCGCGT
   >"gene_id:XLOC_001528"::KB317696.1:3527023-3527170
   ACTACACACAAAGTTCGTAGGATCTTGCGGGTTAGACATGACAAGCGCAAATTGAGCGCATGCGTGCCATCCCTCGGGTGCGTTCTTGGGATTCGCATAGTCAAGATATATGGACACTGTATCGTTTGGCGGCGAATTGGAGTTCCC

expected output

  >"gene_id:XLOC_000656"  
  AGACCATTTAAATTCTAACTACCAACTGTCACATATAACTCAGCTTCACATAATGTTTGGATCTCGTCGCTCTCGTGCTCCCGCTGCTGGAATGGGTGGCGGAGGTTATGGAGCTGCCCCCCCTCGACGAAGAGGATTTGGAATGGGAGGCGGGCGTCGCGGCGCTGGCTGTGGGCGCAACCCATGGGAGGGGGAATGGGTGCTGGAAGAACCGGTGGCATGGGGATGGCTCCTGGTGGGTTGTTGCTCCTCGCGTTAGGACCCGTGATCGCATCAAGCGACTCTTTGGATTTGGCCGTTCGCGTCGTAGGACGGCTATGTACTAAACCAGCCCGTGGTGCAAGCGTGCTTTGAAAGGAAGAAGTGAATAATGTATCATCC
  >"gene_id:XLOC_001528"

GCTGATCAGAAGTTTGCAGATGATAAAACACGCGCTGCAAAGCTAGAGCAACGCTCTCGGAAGGTATGTCATCCTCGGTAGGTATCTGGTACACGGCCTTCGGAAATACCGTGTGCAATAGAGCGATTGTAGTAGTGAATTCATATAACATGTTGCACCTTGGTTCTTCAATCCAACGAAACCCGTCTCCTTCTTTGAGTCATAATTACAAAGTTATGCCATAGTACACCAGTTGGATCTTTCAGTACTCGAACGAACACGGTTATGTCAGCCTCGTCGCCTTCAATGATCGGTCTACGTCCTTCAATGCCTGTAAAGAGCTTTCGGAGGTCATGGAATCGTGTGAAACCCCAATCGCATTCCTCGGCAATGAACCGATGGTGCGCGTACTACACACAAAGTTCGTAGGATCTTGCGGGTTAGACATGACAAGCGCAAATTGAGCGCATGCGTGCCATCCCTCGGGTGCGTTCTTGGGATTCGCATAGTCAAGATATATGGACACTGTATCGTTTGGCGGCGAATTGGAGTTCCC

python biopython • 979 views
ADD COMMENT
1
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better.
code_formatting

Thank you!

ADD REPLY
3
Entering edit mode
3.7 years ago

Since this is still unanswered, I'll post an R solution until someone is kind enough to answer with the requested biopython solution.

Read the fasta file in.

library("Biostrings")
library("tidyverse")

fasta <- readDNAStringSet("test.fasta")

You'll end up with a Biostrings object of the fasta file.

> fasta
DNAStringSet object of length 7:
    width seq                                               names               
[1]   175 AGACCATTTAAATTCTAACTACC...GGCGTCGCGGCGCTGGCTGTGGG "gene_id:XLOC_000...
[2]    62 CGCAACCCATGGGAGGGGGAATG...GTGGCATGGGGATGGCTCCTGGT "gene_id:XLOC_000...
[3]   144 GGGTTGTTGCTCCTCGCGTTAGG...AAGAAGTGAATAATGTATCATCC "gene_id:XLOC_000...
[4]    97 GCTGATCAGAAGTTTGCAGATGA...TCGGTAGGTATCTGGTACACGGC "gene_id:XLOC_001...
[5]   108 CTTCGGAAATACCGTGTGCAATA...CCGTCTCCTTCTTTGAGTCATAA "gene_id:XLOC_001...
[6]   183 TTACAAAGTTATGCCATAGTACA...GGCAATGAACCGATGGTGCGCGT "gene_id:XLOC_001...
[7]   147 ACTACACACAAAGTTCGTAGGAT...TTTGGCGGCGAATTGGAGTTCCC "gene_id:XLOC_001...

You can then concatenate the sequences with the same gene id.

new_fasta <- fasta %>%
  {data.frame(seq=as.character(.), name=names(.))} %>%
  mutate(name=str_extract(name, "(?<=\")gene_id:[[[:alnum:]]_]+")) %>%
  group_by(name) %>%
  summarize(seq=str_c(seq, collapse="")) %>%
  {set_names(.$seq, .$name)} %>%
  DNAStringSet

And now you should have a Biostrings object of the new sequences, where the sequences for each gene_id were concatenated based on the order they appeared in the fasta file.

> new_fasta
DNAStringSet object of length 2:
    width seq                                               names               
[1]   381 AGACCATTTAAATTCTAACTACC...AAGAAGTGAATAATGTATCATCC gene_id:XLOC_000656
[2]   535 GCTGATCAGAAGTTTGCAGATGA...TTTGGCGGCGAATTGGAGTTCCC gene_id:XLOC_001528

You can then save this back to a fasta.

writeXStringSet(new_fasta, "new.fasta")
ADD COMMENT

Login before adding your answer.

Traffic: 1606 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6