Question

Interpreting Estimate Library Complexity Output

2

Entering edit mode

9.9 years ago

komal.rathi ★ 4.1k

Hi everyone,

I am using the EstimateLibraryComplexity utility from Picard Tools to calculate the complexity of my paired-end RNAseq libraries.

This is my command line:

java -jar /picard/EstimateLibraryComplexity INPUT=sample.bam OUTPUT=sample_libcomp.txt VERBOSITY=ERROR VALIDATION_STRINGENCY=SILENT

This generates a sample_libcomp.txt file. This is the truncated output:

## HISTOGRAM    java.lang.Integer
duplication_group_count    P01311

1       23739815
2       3633946
3       870509
4       426481
5       202751
6       171461
7       93221
8       83632
9       58171
10      50066
11      34938
12      36788
13      24277
14      24100
15      19388
16      18345
17      13640
18      14480
...
456     1
457     1
458     1
459     1
460     2
464     3
468     1
470     2
471     2
473     1
477     2
480     1
484     1
488     1

Can anyone explain to me what these values mean? I couldn't find an explanation of the output anywhere. I plan to plot these values as a density histogram (maybe convert the values to log2). So I really need to understand what these values are in order to interpret the histogram that I will create later.

Thanks!

Picard EstimateLibraryComplexity • 5.6k views

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 9.9 years ago by komal.rathi ★ 4.1k

score 3 · Accepted Answer · 2014-06-13

3

Entering edit mode

9.9 years ago

Dan D 7.4k

The first column is the number of duplicates. The second column is the number of reads having the corresponding number of duplicates.

So in your output, there are 426,481 sequences which have exactly four duplicates.

ADD COMMENT • link 9.9 years ago by Dan D 7.4k