How to draw a coverage vs GC content plot for assembled transcriptome?
2
0
Entering edit mode
8.4 years ago
seta ★ 1.9k

Hi all,

I would like to have a plot of coverage vs GC content for a assembled transcriptome. Could you please introduce me any tool for it?

Thanks

plot transcriptome assembly coverage GC content • 6.7k views
ADD COMMENT
0
Entering edit mode

Does this help Plot Coverage Vs. Gc Content ?

Also, you might want to specify in more detail how you want to plot gc vs. coverage, what do you want to use as bins?

ADD REPLY
0
Entering edit mode

Thanks for your comment. I also saw this post, but it draw a plot as scatter plot. Actually, I need the plot for quick assessment of contamination in the assembled transcriptome. Here, one of our friends suggested the quast tool for this purpose. I tried both quast and CLC genomic workbench softwares, while the CLC genomic showed a plot with multiple peaks, suggesting the contamination in the assembly, quast generated a plot with a single peak. Also, blastx of the assembly against nr database indicated the contamination. However, the CLC is not free software. So, I am looking for any free tool to this end. Thank you for any suggestion

ADD REPLY
0
Entering edit mode

Please clarify. Do you want to draw GC content and coverage as separate y values across a specific assembled transcript(s), in which case the position in the transcript is your y value. Or do you want to calculate GC content and coverage values for every single transcript and plot histograms of GC content distribution in your transcripts and Coverage distribution? They are very different things. Something tells me you want the latter, distribution histograms, if you are looking for contamination.

ADD REPLY
0
Entering edit mode

That's right Adrian. Distribution histogram is what I'm looking for it. If you remember, I posted a topic about contamination in a assembled transcriptome and you kindly suggested to use quast to the quick evaluation of contamination. However, as I mentioned in my previous comment while the CLC software and blastx against nr showed the contamination, quast generated a plot with a single peak. In the quast, I used the command of python quast.py assembly_file.fasta. Could you please let me know if there is anything required?

ADD REPLY
0
Entering edit mode

Answered fully below, good luck!

ADD REPLY
2
Entering edit mode
8.4 years ago
Adrian Pelin ★ 2.6k

Hmmm, I would do the analysis myself as you are trying to. Getting GC content for transcripts is relatively easy. What about coverage? I never used CLC, is coverage part of the fasta headers? Or is there a table stating which transcript has what coverage? If CLC does not provide coverage, I would redo the assembly with something that can give you that information. Also, for transcriptomes, coverage rarely can be used to discriminate between contaminants and organism in question. For GC content histograms, I use read_fasta. You can find it here

It is part of biopieces, an interesting toolkit that I like to use. Once you have read_fasta installed, all you need is your transcripts assembly in fasta format. Assuming your transcript file is called CONTIGS.fasta, you would:

read_fasta -i CONTIGS.fasta | analyze_gc | write_tab -k SEQ_NAME,SEQ_LEN,GC% -x > CONTIGS.stats

This will generate a tab delimited file with stats in it, first row sequence name, second sequence length and third row GC content.

Next I would use R, make sure you have it installed, just type "R", if not, get it. After you start R in the same directory where you have your CONTIG.stats, do:

stats <- read.table("CONTIGS.stats", header = F)
hist(stats[,3])

The hist command should generate your histogram. You can adjust the number of breaks (bars), which will affect your resolution, this should work somewhat:

hist(stats[,3], breaks=25, main="GC content histogram of my transcripts", xlab="GC content (%)", ylab="Transcript (#)")

When you are ready to save your graph, type:

dev.copy2pdf(file="GC_content_histogram.pdf", useDingbats=FALSE, family="sans")

You can try plotting it as a density too:

plot(density(stats[,3]), main="Density of GC content", xlab="GC content (%)")

Or even plot length vs GC content, to see if shorter transcripts have different GC content:

plot(stats[,3], stats[,2], col="red", ylab="Length of Transcript (bp)", xlab="GC Content (bp)", main="GC Content vs Coverage", type="p")

Let me know how it goes.

ADD COMMENT
0
Entering edit mode

Thank you somuch for your complete response. I try it and back to you.

ADD REPLY
1
Entering edit mode
8.4 years ago
h.mon 35k

I like blobtools-light, it plots GC vs coverage plots, and colors the plot according to blast taxonomic identification.

edit: in light of your comment to Michael Dondrup, I should reiterate blobtools-light is a very good tools for what you wish to accomplish. A similar tool (Blobology) is described on this paper, at blobology git page the authors recommend blobtools-light. It probably works better to detect contamination for genome assemblies, though.

ADD COMMENT

Login before adding your answer.

Traffic: 1503 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6