mapped reads (m aligned) is higher for a subset sample?
1
0
Entering edit mode
2 days ago

Hi Biostars Community,

These are from patient rheumatoid arthritis samples.

My % aligned was at about 85% for all samples, and mapped reads (M aligned) roughly about 6-9 (million) for most of the samples. However one of the samples and all of its technical replicates had mapped reads in the 16-18 (million) range. This was also similiar before mapping to the transcriptome using salmon. What could be an explanation for this?

This picture is before mapping.

enter image description here


This picture is after mapping:

enter image description here

Thank you in advance. I will pay it forward!

salmon fastqc multiqc RNA-seq • 374 views
ADD COMMENT
2
Entering edit mode
2 days ago
GenoMax 102k

Secondary alignments or perhaps those samples simply have more data? Which aligner are you using?

ADD COMMENT
1
Entering edit mode

You know I found something interesting through the analysis process. Those 4 processed samples (SRR3350551 through SRR3350554) with double the mapped reads than the other samples are the only samples with only 4 technical replicates. While all the other samples have 8 technical replicates!! I think that's what the problem is... Somehow maybe the replicates got combined??? resulting in almost double the mapped reads!

ADD REPLY
0
Entering edit mode

As I said maybe this is just based on how these were sequenced. A HiSeq has 8 lanes, so some samples maybe got split over one entire flowcell, while others have not. This is normal, I see things like that all the time, both in public data and in our own. It is not always possible logistically to ensure very similar read depth across all samples especially if some samples come into the library prep process much later than others, or some samples failed initially and had to be repeated. Normalization will take care of that.

ADD REPLY
0
Entering edit mode

Does it matter to Salmon whether lanes are merged prior to alignment? I remember back in the day that it was better to merge the raw FastQ files for tools like TopHat, as it could affect how the initial transcript model was built..

ADD REPLY
0
Entering edit mode

I just cat the FASTQs, for example:

salmon-latest_linux_x86_64/bin/salmon quant \
  --index=reference/gencode.v37.transcripts \
  --threads=2 \
  --seqBias \
  --gcBias \
  --libType=A \
  -1 <(cat $(grep -e "${samid}" R1.list) | gunzip -c) \
  -2 <(cat $(grep -e "${samid}" R2.list) | gunzip -c) \
  --reduceGCMemory \
  --validateMappings \
  --output="out/salmon/""${samid}" ;

...or:

salmon-latest_linux_x86_64/bin/salmon quant \
  --index=reference/gencode.v37.transcripts \
  --threads=2 \
  --seqBias \
  --gcBias \
  --libType=A \
  -1 <(cat S1_R1_L001.fastq.gz S1_R1_L002.fastq.gz S1_R1_L003.fastq.gz S1_R1_L004.fastq.gz | gunzip -c) \
  -2 <(cat S1_R2_L001.fastq.gz S1_R2_L002.fastq.gz S1_R2_L003.fastq.gz S1_R2_L004.fastq.gz | gunzip -c) \
  --reduceGCMemory \
  --validateMappings \
  --output="out/salmon/S1/" ;
ADD REPLY
0
Entering edit mode

Thank you for responding so quickly GenoMax.

I am using salmon.

Hmmm... maybe, you're right... (they have more data) It could potentially be that the patient had a rheumtoid arthritis flair up (this is a knee sample), therefore there were more cells in the synovial tissues -> more mRNA?

Others are knee samples or hip samples, but maybe they weren't flairing up?

I'm curious to find out how to check for secondary alignments. Googling now.

EDIT: So there is the --validateMappings flag in salmon which improves sensitivity and specificity. This I did do, but what I didn't do was -z / --writeMapping which, I think, reports alignment scores in a SAM file? I think that's what I need to see to determine if they are secondary alignments?

I'm going to run salmon again with this flag...

ADD REPLY
1
Entering edit mode

More RNA is a possible cause but could also just be more sequencing. Balancing sequencing pools perfectly is tough - variations in extraction efficiency, concentration measurements, primer efficiencies and pipetting errors can all lead to variations in sequencing depth for subsets of samples.

This isn't usually a big problem as long as the difference is not vast (which it doesn't appear to be). Most common quantification / analysis methods should normalise for variations in sequencing depth.

I would definitely run some pre-alignment QC (eg. FastQC) to check the raw data and raw counts. If you're worried you can always subsample the data to get equal read counts (eg. using seqtk).

ADD REPLY
1
Entering edit mode

Adding on this, it is not even guaranteed that all these samples were sequenced on the same flow cell at the same time. Maybe some samples came in (for whatever reason) on a different flow cell together with other samples so the entire pooling strategy was different. There is no way to tell, and I think it does not matter too much. Raw read numbers are not really a good QC criterium unless they're unexpectedly low or vastly different. As Phil Ewels says the difference here is not large and normalization should take care of it. Be sure to perform some QC like PCA to see whether clustering is reasonable or some samples show evidence of batch effect.

ADD REPLY
0
Entering edit mode

Thank you for this guys. The -z / --writeMapping was taking forever! It's still going for like 3 hours now. I don't think it's parallelized.

Regardless, going to follow your suggestions!

ADD REPLY
1
Entering edit mode

If you are not using multiple threads with salmon then it will take longer.

ADD REPLY
0
Entering edit mode

It's weird... I was using-j28 / --cores 28, and I didn't hear all my fans going as using that many threads usually does. The server usually gets loud. I didn't check my system monitor though - so can't say for sure how many CPUs were actually being used! I guess I will check this (if multiple CPUs are being used via system montior), if I run salmon with the -z / --writeMappings flags again with --cores set.

If someone wants to transfer their question to the answer, I would accept?

ADD REPLY
1
Entering edit mode

What do you need that for? Keep in mind that the coordinates of the SAM file are in spliced-transcriptome space so you cannot easily visualize this bam on a browser.

ADD REPLY
0
Entering edit mode

Honestly, ATpoint, I thought it would be a way to see if the samples in question had more secondary alignments, since Genomax had suggested that as a potential reason for the greater number of mapped reads. Not sure if that would be accomplish-able now, but as you and Phil Ewens recommended -> don't worry about them and carry on with the analysis (normalization should take care of this).

Currently, about to transfer from tximport to DESeq2 and then going to:

Be sure to perform some QC like PCA to see whether clustering is reasonable or some samples show evidence of batch effect.

Excited to finally have results. I've been trudging through this process for like 3 days now lol

ADD REPLY

Login before adding your answer.

Traffic: 2035 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6