Identifying BGC's coexpression (antismash)
1
2
Entering edit mode
2.2 years ago

Hello community,

I'm new in working with metagenomic data, I'm trying to identify bacterial/biosynthetic clusters (BCG's) with the aim of determining the co-expression of a certain number of these clusters (once identified previously).

For cluster identification I was using a script to use antismash later supported with Big-Scape, I already ran the antismash algorithm on my gbk files.

I need help/guidance for analyzing the co-expression of these given clusters. Is there a certain platform I can use or should I just write a script to determine the co-expression of these (how to do this/what approach?)

Addtionally I would like to create a list to identify Biosynthetic Gene Cluster (BGC) type.

Thank you so much in advanced :))))

antismash RNA-Seq metagenome bacterial genome • 795 views
1
Entering edit mode
2.2 years ago
Joe 19k

You mean co-expression of the gene cluster itself?

Do you have some RNAseq data?

0
Entering edit mode

I am studying both the genome itself of Strepto species and metagenome of soil extract. For both cases I already ran the antismash algorithm (clustering) to determine Biosynthetic Gene Clusters, (I have output) but yes I believe in other words I do have RNA seq data available to me.

1. In one case I have both metagenome and metatranscriptome data, these come in either format of .fasta or .faa

2. And the other I have just the sequenced genome, basically just genome data itself (fasta).

Hope this can clarify.

0
Entering edit mode

I'm still not 100% on what samples you have, and what you have RNAseq for.

Have I got this right?:

1. You have Streptomyces genome(s), on its own, with associated RNAseq data (in what conditions?)

2. You have both metagenomic, and metatranscriptomic data, for the same sample(s)?

0
Entering edit mode

Sorry for the lack of clarification on my behalf

1. Yes I have the Streptomyces genome(s) on it's own, with associated RNAseq data, I am using the .gbk file for the input in antismash (BGC clustering).

[Example Image] 2. This is also correct, bothe metagenomic and metatranscriptomic data.

0
Entering edit mode

Do you have multiple conditions?

I'm not sure your Strep data and metagenomic data can be meaningfully compared very easily, if that's what you're aiming to do...

0
Entering edit mode

Hello,

To clarify I made a schema (see attached link below). I want to create BGC (clusters) for the strepto data and metagenomic data separately and then measure if there is co-expression of these, once again separately. !

Thank you so much for your help!

1
Entering edit mode

Please use the instructions here to post images on biostars: How to add images to a Biostars post

0
Entering edit mode

Hey Joe, any follow ups with this? I'm still stuck I've tried various approaches but need orientation to be honest. Thanks

0
Entering edit mode

Sorry, you caught be in a very busy couple of weeks!

I think more clarification is needed still (apologies if I’m just being dense here!). Let me spell this out fully and see if I’m on the right track:

1. Strepto

• You have one or more genomes of Streptomyces, assembled, annotated, and have predicted BGCs in these genomes from antiSMASH.

• You have some Strepto only RNAseq data, corresponding to these genomes.

• Q1) How many replicates are there for this?
• Q2) How many, and what, conditions is this RNAseq derived from?

2. Metagenomics

• You have a bunch of sequenced metagenomic data.

• Q3) Do you have Metagenome-assembled-genomes (MAGs), already?
• Q4) You have already annotated, called BGCs etc on these MAGs?
• You have transcriptomic data that also corresponds to these MAGs.

• Q5) What conditions does the metagenomic sample correspond to? e.g. direct extraction from soil, or cultured (if so how, etc.)
• Q6) Is your transcriptomic data already mapped on a genome-by-genome basis?

—————

So if I’m understanding right, you want to compare the expression of any given BGC in Strepto, with what expression of that same BGC is seen in the metagenomic sample?

This is where I think the issues appear. I’m no expert on meta transcriptomics, but I can’t see a way that studying the expression of those clusters between all these samples is at all meaningful. Just off the top of my head, these are some of the question marks I can see, perhaps others can weigh in:

1. Your sequencing will be radically different between the metagenomic samples and the single strepto samples, not least because there will be more mapping reads in the strepto run as the run capacity of the machine you used isn’t ‘diluted’ across many different genomes. Normalising the data would be a bit of a nightmare I think.

2. How do you classify a single cluster, when the sequence identity of a cluster that is putatively functionally the same will differ between genomes?

3. What conditions are you comparing between the data? Are these comparisons meaningful?

0
Entering edit mode

Thank you so much actually for taking the time, I see you on so many threads and I highly appreciate you taking the time in helping me.

The general idea is to obtain BGC's for both the Strepto data and Metagenome data separately, I don't want to compare these actually. The idea is to measure the co_expression of these for both Strepto and Metagenome data individually and not between these.

1. Strepto

Q1. There are 3 replicates per genome, I ran Antismash on all of these

Q2. There are a total of 1000 genomes (each a separate .gbk file), these are genomes that we sequenced here at the lab and have also obtained from the EMBL database, as we want to create a big molecular network to compare all of these. The origin of all of these are from the rhizosphere of soy bean plants with different treatments (resistant vs susceptible to disease)

Note: We will create separate networks of these as well (both treatments), as we want to see how Strepto behaves for each of these cases.

2. Metagenome

Q3. I am currently running algorithms to create the MAG's (this is still in process).

Q4. We have not annotated the BGCs to the MAGs, I'm not sure how this would be done.

Q5. This is a direct extraction from the rhizosphere of various plant types as well (we want to compare microbiome of different plant types).

Q6. This has not been done, can you also guide here?

On a sidenote I would like to do something similar mentioned in this thread, to count the BGC's and relate them to species type in the case of the microbiome data. I don't know if this would be possible as mentioned below.

C: Counting Gene Clusters from Antismash Output

For co_expression I have read about different algorithms such as CLUST and WGCNA (R Package), I don't know how useful this would be.

Thank you very much once again for your help, as mentioned in my research group with limited resources we are all pure biologists-chemists and new to bioinformatics tools, you have already helped quite a lot. This is helping push a research project that has been quite on hold.

Best regards

John

0
Entering edit mode

Ok I think I have some ideas now. To continue to keep things clear:

Strepto

My advice for this would be to do a pretty 'normal' RNAseq analysis. I'm guessing you dont have RNAseq for all 1000 genomes? The first step is to map the RNAseq againt the genomes they originate from and obtain counts. Someone with more transcriptomic experience than me here can feel free to suggest alternatives, but it should simply be a case of setting up a DEseq2 experiment with the various conditions, to get expression profiles and quantitative differences between the genomes.

Once you have all the reads mapped and count data available, you can extract just that which relates to the BGCs, and then feed the expression data in to a network analysis to find correlated clusters. This is where my expertise runs out entirely though, so I don't know how easy this is.

Metagenomes

Once you have the MAGs, you should have a collection of incomplete genomes, but it will still be possible to annotate them and call BGCs for everything.

For the metatranscriptomic data, you need to figure out what transcripts originate from what organisms. I suspect the only real way to do this is to map all of the reads to all of the MAGs as you normally would, with quite stringent mapping criteria to avoid multi-mapping etc. This is uncharted territory for me though I'm afraid. I'm still not entirely sure how useful a metatranscriptomic comparison between unequal populations is going to be though. You would end up with lots of infinite values (i'd guess) for expression when one BGC is found in one sample, but not in another because the species that carries it isn't present or something.

I would strongly advise you try to find a local friendly neighbourhood bioinformatician to help out on this though, as this is not at all a simple endeavour.

1
Entering edit mode

Thank you very much Joe, you have been of great assistance! This has pointed me in the correct direction!

Cheers and best regards!

John

0
Entering edit mode

Hey biohacker_tobe, please do not close threads where somebody has already provided useful assistance. The information can be useful to others in the future. Thanks.

0
Entering edit mode

Sorry for the inconvenience, I have kept the thread open