Entering edit mode
5 months ago
Jjbox ▴ 40
I want to generate a histogram of reference transcript in here (https://www.ncbi.nlm.nih.gov/projects/genome/guide/human/index.shtml#:~:text=gff3-,RefSeq%20Transcripts,-Fasta).
Can anyone suggest a tool that can generate a histogram of the length of the isoform in this file? Ideally, the x-axis would the distribution of the isoform length and the y-axis would be the number of isoforms counted.
Use bioawk and then pass the output of bioawk to a simple
Thank you! I am unfamiliar with bioawk. Do you know what kind of command line I should use to generate the output? What kind of output is it when I use bioawk?
Are you familiar with
awk? Bioawk is awk customized to work with common bioinformatics formats. For example, (if memory serves me right) the preset "fastx" uses
>as record separators instead of the usual new line. You can use awk's functions/variables to get what you want once you understand the underlying concepts.
See the manual: https://github.com/lh3/bioawk
Experiment with it - generate a 2 column output with transcript name and transcript length (although you'd only need the second column for the histogram). In R, run
?histto understand how to plot a histogram - it is trivial, it simply needs a vector of numbers.
Maybe the solution suggested in How to generate sequence length distribution from Fasta file could work? Once you have the lengths, you could plot it in
python, or your language of choice.
Thanks everyone! Now, I have a file that looks like this that has the transcript name on the left and its length on the right. It contains 177816 transcripts. What would be a good tool to plot this in R?
Just read the file in R (
read.table...) and plot it using
hist(), as Ram suggested. Maybe good to try it a bit yourself first, see this. If you get into trouble, just feel free to come back and ask.
I got the result as I wanted. I am pretty new to R. Thanks Ram and iraun :)