Total number of isoforms in fasta file

0

Entering edit mode

20 months ago

pentro • 0

In the protein fasta files I downloaded from the human NCBI genome, I got the longest isoform for each gene from R with the help of the code below.

library(Biostrings)

fasta.s <- readDNAStringSet("homo.faa")

names.fasta <- names(fasta.s)

gene.iso <- sapply(names.fasta,function(j)
cbind(unlist(strsplit(j,'\\.'))[2:3]))

gene.iso.df <- data.frame(t(gene.iso)) colnames(gene.iso.df) <-
c('gene','isoform')

gene.iso.df$width <- width(fasta.s)

gene.iso.df.split <- split(gene.iso.df,gene.iso.df$gene)

best.id <- sapply(gene.iso.df.split,function(x)
row.names(x)[order(x$width)[1]])

fasta.s.best <- fasta.s[best.id]

writeXStringSet(fasta.s, filepath='homo_isoform.faa')

But how can I find out how many isoforms there are in total?

R fasta • 520 views

ADD COMMENT • link updated 20 months ago by Asaf 10k • written 20 months ago by pentro • 0

1

Entering edit mode

You can use sapply as you used to get the longest isoform, only here you want to get the length of each list element so something like:

 n.of.iso <- sapply(gene.iso.df.split, length)

should work

ADD REPLY • link 20 months ago by Asaf 10k

Login before adding your answer.