How to change depth of sequence in RNA-seq fastq files
0
1
Entering edit mode
6.0 years ago
statfa ▴ 680

Hi,

Is there a way to change depth of sequence in RNA-seq fastq files in Galaxy?

I want to investigate the effect of depth of sequence in further analysis. So I have to change them artifically.

Thanks

depth of sequence RNA-Seq • 2.5k views
2
Entering edit mode

If you have an expression table, you can also down-sample at this step. In R, I use the rarefy and rrarefy functions from the vegan package. rrarefy will do a true sub-sampling (thus the result will change every time you run it, unless you fix the random seed), and rarefy will calculate a richness index, which can be used to estimate the number of genes that would be detected in a library that would only contain a given amount of read counts.

0
Entering edit mode

Thank you very much. I will try that. But now I have another question on my mind:

I have the expression table which has 4 replicates at four time points. I mean that the RNA has been sequenced for each person 4 times. Is it ok if I randomly delete one of the repilicates? For example I delete replicate number 4 at all 4 time points. Is it down-sample that you mentioned? Is it what I'm looking for? Then I can obtain library size and normalize my data and continue the process? So if that is the case, why should I use those commands you mentioned? Unless I have understood you wrongly.

3
Entering edit mode

Replicates and depth are very different things so no, it's not ok to remove one replicate if you want to investigate the effect of depth of sequencing.

How do I down sample a fastq in Galaxy?

0
Entering edit mode

2
Entering edit mode

Note that normalising by library size is not the same as down-sampling. Both operations compensate for the difference of sequencing depth, but normalisation keeps all the data intact, while down-sampling discards data. Thus, one should only down-sample when normalisation is not enough. Here are two examples: to compare expression levels between two libraries, normalisation is enough; to compare the number of genes detected in two libraries, down-sampling is required.

0
Entering edit mode

My aim is to investigate the effect of sequencing depth on detection of DE genes. I want to find out that if we reduce the depth of sequence, will the number of detected DE genes increase or decrease? I think my question is now clear to you. Sorry I still don't follow you :( I'm not very knowledgable. Before I ask my question on biostars, I removed replicate number 4 in my 4 time points and I then normalized the data by size factor and found the DE genes again. I mean that, first I had 4 replicates in 4 time points (a time course study) then I removed replicate number 4 and now I have 3 replicates in 4 time points. And I compared the detected DE genes between these two conditions. I still don't understand you and I don't want to bother you :(

2
Entering edit mode

If the question is about depth of sequence, then you should not reduce the number of sequences by removing replicates. Instead, you need to remove sequences from each replicate. You can either down-sample each replicate to a fraction (1/2, 1/4, 1/8, ...) of their original depth, or you can down-sample each replicate to a common number of reads (1,000,000; 100,000; 10,000; ...), and then re-run your DGE analysis to see how the number of DE genes decrease. (See the answer for Carlo for how to down-sample in Galaxy, if it is more convenient than doing it in R.)

2
Entering edit mode

I think Charles clarified the issue in its last post. Indeed, downsampling in NGS is usually made at the read level, not the replicate level.

Note that it can be faster and more efficient to downsample the mapped reads (.bam or .sam files) rather than the raw reads in order to avoid mapping multiple times. You can do that with samtools view -s but I don't know if galaxy supports it.

Also it is recommended to downsample multiple times at the same depth to account for the randomness of the sampling. For instance, if you sample 10 000 random reads 1 time for each replicate of each condition, then proceed to analysis, you would get 5 DEG. If you do it again using 10 000 other random reads, you could get lets say 10 DEG. Now if you do it 100 times, then you have an idea of the distribution of the number of DEG you get by sampling 10 000 random reads. This is much more robust than doing one single random sampling.