Question: Libraries or packages to identify technical and biological replicates
gravatar for curious
4.5 years ago by
curious50 wrote:


I'm trying to determine if a replicate is a technical or a biological one. I have ~3000 SRR (runs) files from the Roadmap project. The file structure of the directories I downloaded is in the following format:


where SRX006235 represents an experiment's accession ID and SRR018454, a run's accession ID. An experiment could've more than one run. My assumption is that these runs could be technical or biological. If I were to consolidate multiple replicates (for a single mark, H3k27me3 in a cell line for instance) to analyze the reads later on, I need to categorize the runs as technical or replicate. Since I've 3000 of these, I would like to automate the process.

I used GEOquery, GEOmetadb and SRAmetadb (R packages) to determine replicate information for a run but haven't been able to find it. SRAmetadb ( and GEOmetadb ( use supporting SQLite3 databases from Meltzer lab.

Does anyone know of a way to achieve this?

Thanks for reading!

R rna-seq roadmap chip-seq python • 1.6k views
ADD COMMENTlink modified 4.4 years ago by Biostar ♦♦ 20 • written 4.5 years ago by curious50

Assuming you want to forgo asking the Roadmap project what they think is a biological and what is a technical replicate, your best method is to cluster the data by how similar they are on a PCA plot or similar.

I think deepTools can tell you how similar two sequencing files are, however I don't know if that approach is sensitive enough to distinguish biological and technical duplicates. I dont know what the metric of 'similarity' is, if it's possible to input normalize, etc.

Alternatively, there might be the flow-cell/assay/machine ID in the metadata of the files. If two samples come from the same library, they will be technical replicates. However, this might be the same as asking Roadmap what they think it should be, if the metadata was added based on some file structure for example. It's not conclusive. A PCA of the data is really the only way to go.

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by John12k

thanks for the reply John.

Deeptools does support PCA. Although I'm not sure how to interpret the results. For three runs from an experiment, the PCA results are in ( and I ran PCA for two similarly valued runs which are captured in ( There is a lot of variation (if I'm reading it right) within the two runs.

Regarding the metadata: I am using SRAmetadb.sqlite file. While there is information on library name, strategy and source, assay/flow-cell/machine ID isn't apparent from the tables. I'll explore this more to see if it yields anything.

I will reach out to Roadmap project for their definition. Thanks much for the help.

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by curious50

That PCA looks pretty good to me :) I mean, you'll have to load a lot more data into it - at least something that you know is a biological replicate and something that is a technical replicate. But yeah, it should worry you that with two they spread apart but with three, the same two cluster together - that's just how PCA works. But definitely dont mix assay types, and double-check that its read-count and input normalizing, otherwise they will simply cluster on read count or coverage or something, and thats no good.

ADD REPLYlink written 4.5 years ago by John12k

Okay, got it. The plots I shared earlier are for H3k14ac in hESC H1 cell line. I was under the assumption that I could run PCA on such replicates to observe variation. If they're together/similar they're technical replicates else biological. In the above case, SRR179713 and SRR179715 are technical replicates and the other one is a biological replicate. Is this accurate understanding?

ADD REPLYlink written 4.5 years ago by curious50
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1644 users visited in the last hour