Problems with zero count values and mapping % in MaGeCK
0
0
Entering edit mode
11 days ago
DragosV • 0

Hello! I apologise for what will be a longer post. I'm a beginner when it comes to bioinformatics, but I seem to be experiencing a few issues with my dataset and decided to turn to the community to see if any improvements can be done, as well as address some of my questions/worries.

Experiment Overview:

Sample Types:

    Wild-Type (WT): Untreated (x1), LD50 treated (x2), LD90 treated (x2)

    Knockout (KO): Untreated (x1), LD50 treated (x2), LD90 treated (x2)

    KO generated via CRISPR-Cas9 using an sgRNA not present in the Vienna library.

Experimental Consistency:

    All samples used the same cell line, CRISPR library, and experimental procedure, but were plated in separate dishes.

Library Preparation and Sequencing:

    The Vienna sgRNA library was reformatted to match the Yusa library format (per MaGeCK documentation) without deduplicating sgRNAs, relying on MaGeCK to handle duplicates.

    Libraries were PCR-amplified and subjected to paired-end sequencing.

Data Processing Workflow:

    Per-lane sequencing files were concatenated by sample, with R1 and R2 reads kept separate. Index files (I1 and I2) were not used.

    MaGeCK count was run in paired-end mode using the following command:
    mageck count -l (...) --fastq (...) --fastq-2 (...) --sample-label (...) -n counts --count-pair True --norm-method total --sgrna-len 20

Identified Problems and Related Questions:

Problem A: Low Mapping Rates

WT samples: ~50% mapping (drops to ~34% when using --trim 23).

KO samples: ~26% mapping (drops to ~18% with --trim 23).

Q1: Should these samples be treated as technical or biological replicates for analysis?

Since the same cell line and procedure were used across different dishes, I'm inclined to treat them as technical replicates.

Q2: Did we miss any critical steps in creating the library .csv file, such as deduplication?

Q3: Can MaGeCK analysis proceed with mapping rates below the recommended 60%?

Concern: Low mapping rates and high zero-count percentages (see problem C) could compromise data integrity.

Problem B: KO sgRNA Contamination in KO Samples

The sgRNA used to generate the KO appears in ~45-50% of KO reads across all conditions.

Q4: Can KO samples be analyzed despite high KO sgRNA presence, especially for KO-only comparisons?

I am leaning towards resequencing KO samples to exclude the KO sgRNA. Computational removal is another option, but it risks subsampling and coverage issues, potentially introducing false negatives. My main worry here is sequencing coverage (i.e. not being able to detect sgRNAs at low levels), given the abundance of the KO sgRNA.

Q5: Would computational removal of the KO sgRNA adequately address the issue, or could it introduce bias?

My worry is that removing the KO sgRNA may not fully mitigate coverage problems and could bias the analysis.

Problem C: High Zero-Count sgRNAs

~60% of sgRNAs have zero counts across all samples (recommended: 1-5%, as per documentation).

Q6: Can analysis continue with such a high zero-count rate?

My concerns here relate to data validity, given that a high proportion of sgRNAs are not present in any of the samples.

Q7: Could the high zero counts represent true dropouts, or do they indicate experimental error?

I suspect this is an experimental issue rather than genuine dropouts, given the consistency across samples.

Problem D: Potential Library Contamination

In a random check of 25 WT untreated sequences, 9 had inserts that did not map to the library.

Reverse complementing sgRNAs and running MaGeCK with the reverse complement flag further reduced mapping rates (to just ~1-2%).

Q8: Could contamination during library preparation explain the mapping issues, and how can this be confirmed?

This pattern suggests contamination, but more investigation is needed.

Q9: Given that preliminary results align with known biology, could data quality issues introduce confirmation bias or skew findings?

Although results align with expectations, I am concerned that underlying data problems (low mapping and high zero counts) may have led to biased gene enrichment.

Many thanks for your patience and thank you in advance for your input! Dragos.

CRISPR screen MaGeCK • 283 views
ADD COMMENT
0
Entering edit mode

Please use > to format questions as quotes and then leave the sentences after this as "normal" text. As formatted now, you have put emphasis on comments rather than questions. So do something like:

Q8: Could contamination during library preparation explain the mapping issues, and how can this be confirmed?

This pattern suggests contamination, but more investigation is needed.

or simply use a list

8: Could contamination during library preparation explain the mapping issues, and how can this be confirmed?

This pattern suggests contamination, but more investigation is needed.

ADD REPLY
0
Entering edit mode

Thank you for the formatting suggestions!

ADD REPLY

Login before adding your answer.

Traffic: 2834 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6