How to draw a coverage vs GC content plot for assembled transcriptome?
2
0
Entering edit mode
6.0 years ago
seta ★ 1.5k

Hi all,

I would like to have a plot of coverage vs GC content for a assembled transcriptome. Could you please introduce me any tool for it?

Thanks

plot transcriptome assembly coverage GC content • 5.2k views
0
Entering edit mode

Does this help Plot Coverage Vs. Gc Content ?

Also, you might want to specify in more detail how you want to plot gc vs. coverage, what do you want to use as bins?

0
Entering edit mode

Thanks for your comment. I also saw this post, but it draw a plot as scatter plot. Actually, I need the plot for quick assessment of contamination in the assembled transcriptome. Here, one of our friends suggested the quast tool for this purpose. I tried both quast and CLC genomic workbench softwares, while the CLC genomic showed a plot with multiple peaks, suggesting the contamination in the assembly, quast generated a plot with a single peak. Also, blastx of the assembly against nr database indicated the contamination. However, the CLC is not free software. So, I am looking for any free tool to this end. Thank you for any suggestion

0
Entering edit mode

Please clarify. Do you want to draw GC content and coverage as separate y values across a specific assembled transcript(s), in which case the position in the transcript is your y value. Or do you want to calculate GC content and coverage values for every single transcript and plot histograms of GC content distribution in your transcripts and Coverage distribution? They are very different things. Something tells me you want the latter, distribution histograms, if you are looking for contamination.

0
Entering edit mode

That's right Adrian. Distribution histogram is what I'm looking for it. If you remember, I posted a topic about contamination in a assembled transcriptome and you kindly suggested to use quast to the quick evaluation of contamination. However, as I mentioned in my previous comment while the CLC software and blastx against nr showed the contamination, quast generated a plot with a single peak. In the quast, I used the command of python quast.py assembly_file.fasta. Could you please let me know if there is anything required?

0
Entering edit mode

2
Entering edit mode
6.0 years ago

Hmmm, I would do the analysis myself as you are trying to. Getting GC content for transcripts is relatively easy. What about coverage? I never used CLC, is coverage part of the fasta headers? Or is there a table stating which transcript has what coverage? If CLC does not provide coverage, I would redo the assembly with something that can give you that information. Also, for transcriptomes, coverage rarely can be used to discriminate between contaminants and organism in question. For GC content histograms, I use read_fasta. You can find it here

It is part of biopieces, an interesting toolkit that I like to use. Once you have read_fasta installed, all you need is your transcripts assembly in fasta format. Assuming your transcript file is called CONTIGS.fasta, you would:

read_fasta -i CONTIGS.fasta | analyze_gc | write_tab -k SEQ_NAME,SEQ_LEN,GC% -x > CONTIGS.stats


This will generate a tab delimited file with stats in it, first row sequence name, second sequence length and third row GC content.

Next I would use R, make sure you have it installed, just type "R", if not, get it. After you start R in the same directory where you have your CONTIG.stats, do:

stats <- read.table("CONTIGS.stats", header = F)
hist(stats[,3])


The hist command should generate your histogram. You can adjust the number of breaks (bars), which will affect your resolution, this should work somewhat:

hist(stats[,3], breaks=25, main="GC content histogram of my transcripts", xlab="GC content (%)", ylab="Transcript (#)")


dev.copy2pdf(file="GC_content_histogram.pdf", useDingbats=FALSE, family="sans")


You can try plotting it as a density too:

plot(density(stats[,3]), main="Density of GC content", xlab="GC content (%)")


Or even plot length vs GC content, to see if shorter transcripts have different GC content:

plot(stats[,3], stats[,2], col="red", ylab="Length of Transcript (bp)", xlab="GC Content (bp)", main="GC Content vs Coverage", type="p")


Let me know how it goes.

0
Entering edit mode

Thank you somuch for your complete response. I try it and back to you.

1
Entering edit mode
6.0 years ago
h.mon 33k

I like blobtools-light, it plots GC vs coverage plots, and colors the plot according to blast taxonomic identification.

edit: in light of your comment to Michael Dondrup, I should reiterate blobtools-light is a very good tools for what you wish to accomplish. A similar tool (Blobology) is described on this paper, at blobology git page the authors recommend blobtools-light. It probably works better to detect contamination for genome assemblies, though.