Question

Tools to check the length of isoforms in reference transcript

0

Entering edit mode

2.9 years ago

shinyjj ▴ 60

Hi biostars,

I want to generate a histogram of reference transcript in here (https://www.ncbi.nlm.nih.gov/projects/genome/guide/human/index.shtml#:~:text=gff3-,RefSeq%20Transcripts,-Fasta).

Can anyone suggest a tool that can generate a histogram of the length of the isoform in this file? Ideally, the x-axis would the distribution of the isoform length and the y-axis would be the number of isoforms counted.

transcript histogram isoform • 2.3k views

ADD COMMENT • link 2.9 years ago by shinyjj ▴ 60

2

Entering edit mode

Use bioawk and then pass the output of bioawk to a simple hist() on R.

ADD REPLY • link 2.9 years ago by Ram 45k

0

Entering edit mode

Thank you! I am unfamiliar with bioawk. Do you know what kind of command line I should use to generate the output? What kind of output is it when I use bioawk?

ADD REPLY • link 2.9 years ago by shinyjj ▴ 60

1

Entering edit mode

Are you familiar with awk? Bioawk is awk customized to work with common bioinformatics formats. For example, (if memory serves me right) the preset "fastx" uses @ and > as record separators instead of the usual new line. You can use awk's functions/variables to get what you want once you understand the underlying concepts.

See the manual: https://github.com/lh3/bioawk

Experiment with it - generate a 2 column output with transcript name and transcript length (although you'd only need the second column for the histogram). In R, run ?hist to understand how to plot a histogram - it is trivial, it simply needs a vector of numbers.

ADD REPLY • link 2.9 years ago by Ram 45k

1

Entering edit mode

Maybe the solution suggested in How to generate sequence length distribution from Fasta file could work? Once you have the lengths, you could plot it in R, python, or your language of choice.

ADD REPLY • link 2.9 years ago by iraun 6.2k

0

Entering edit mode

Thanks everyone! Now, I have a file that looks like this that has the transcript name on the left and its length on the right. It contains 177816 transcripts. What would be a good tool to plot this in R? enter image description here

ADD REPLY • link 2.9 years ago by shinyjj ▴ 60

1

Entering edit mode

Just read the file in R (read.table...) and plot it using hist(), as Ram suggested. Maybe good to try it a bit yourself first, see this. If you get into trouble, just feel free to come back and ask.