Question: Reduce number of reads from 100M to 50M in the sample from the existing RNA-Seq fastq files
0
gravatar for mohammedtoufiq91
9 months ago by
mohammedtoufiq9180 wrote:

Hi,

We have generated a set of RNA-seq samples from blood tissue. These are human paired-end samples with 100M reads and with read length of 150bp. I want to know if it's possible to reduce the number of reads from 100M to 50M in the sample(s) from the existing RNA-Seq fastq files generated the facility. Please let me know most appropriate tool and methods to perform.

We are currently in the doing technical validation to check if the RNA-seq analysis and results would be comparable or different with 100M or 50M reads including alignment, quantification, normalization and differential expression analysis.

Thank you, Toufiq


Updated question:

Thank you all for the suggestions and advise. Let me rephrase the question once again.

We have generated a set of RNA-seq samples from blood tissue (non globin depleted). These are human paired-end samples with read length of 150bp with 100 million read depth. After the alignment against hg19 genome, the alignment range is between 84-91% for different samples and the total corresponding reads for each sample range between 19 to 28 million reads. Quantification process provided a rough estimate of read counts corresponding to the globin genes (HBA1, HBA2, HBB) between 30 - 62% which would later filtered before the normalization process. In this scenario, I would like to know if we sequence the samples to 50 million read depth from 100 million read depth, do we get the still get similar results as obtained with 100 million read depth. We are currently in the doing technical validation to check if the RNA-seq analysis and results would be comparable or different with 100M or 50M reads including alignment, quantification, normalization and differential expression analysis.

As per the suggestions, sub-sampling total reads from each sample provided similar results. Is total reads obtained after sequencing each sample (fastq file) and read depth are similar concepts?

rna-seq sequence fastq reads • 503 views
ADD COMMENTlink modified 9 months ago by chen1.9k • written 9 months ago by mohammedtoufiq9180
2

From seqtk (also proposed below by genomax):

Subsample 10000 read pairs from two large paired FASTQ files (remember to use the same random seed to keep pairing):

seqtk sample -s100 read1.fq 10000 > sub1.fq 
seqtk sample -s100 read2.fq 10000 > sub2.fq
ADD REPLYlink modified 9 months ago • written 9 months ago by geek_y10k

Thank you. What is -s100 parameter? Is it the read length?

My data files are compressed in the fastq.gz format. Is this supported by seqtk? Are there any commands and arguments?

ADD REPLYlink written 9 months ago by mohammedtoufiq9180

Thank you all for the suggestions and advise. Let me rephrase the question once again.

We have generated a set of RNA-seq samples from blood tissue (non globin depleted). These are human paired-end samples with read length of 150bp with 100 million read depth. After the alignment against hg19 genome, the alignment range is between 84-91% for different samples and the total corresponding reads for each sample range between 19 to 28 million reads. Quantification process provided a rough estimate of read counts corresponding to the globin genes (HBA1, HBA2, HBB) between 30 - 62% which would later filtered before the normalization process. In this scenario, I would like to know if we sequence the samples to 50 million read depth from 100 million read depth, do we get the still get similar results as obtained with 100 million read depth. We are currently in the doing technical validation to check if the RNA-seq analysis and results would be comparable or different with 100M or 50M reads including alignment, quantification, normalization and differential expression analysis.

As per the suggestions, sub-sampling total reads from each sample provided similar results. Is total reads obtained after sequencing each sample (fastq file) and read depth are similar concepts?

ADD REPLYlink written 9 months ago by mohammedtoufiq9180
1

These are human paired-end samples with read length of 150bp with 100 million read depth.

and

total corresponding reads for each sample range between 19 to 28 million reads

Does that mean your total dataset had 100 M reads and there are ~5 samples with reads ranging from 19-28M? Or you actually have 100M reads for each sample. Otherwise something does not add up. Note: Word depth is not being used appropriately here. Clarification of context dependent explanation of depth can be found in this thread: What Is The Sequencing 'Depth' ?

I would like to know if we sequence the samples to 50 million read depth from 100 million read depth, do we get the still get similar results as obtained with 100 million read depth.

Yes if you are thinking about % of globin reads. Because you would be sampling randomly the percentage of globin reads in 50M set should remain at the same relative level as 100M set.

ADD REPLYlink modified 9 months ago • written 9 months ago by genomax74k

Does that mean your total dataset had 100 M reads and there are ~5 samples with reads ranging from 19-28M? There are 13 samples. Each sample was sequenced for 100 million read depth. However, the fastq files generated shows varying total number of reads for each sample as mentioned below for instance,

Sample 1: 20M Sample 2: 28 M Sample 3: 25 M Sample 4: 19M . . . . . . . . . Sample 13: 22M

Or you actually have 100M reads for each sample. I do not have 100 million reads for each sample, however, each sample was sequenced for 100 million read depth by the facility and the fastq files for each sample has varying total reads. Does total reads obtained after sequencing each sample (fastq file) and read depth in general are similar or different concepts?

Yes if you are thinking about % of globin reads. Because you would be sampling randomly the percentage of globin reads in 50M set should remain at the same relative level as 100M set.

Generally, if we sequence the same set of sample again to 50 million read depth from 100 million read depth, do we get the still get similar results as obtained with 100 million read depth.

ADD REPLYlink written 9 months ago by mohammedtoufiq9180

Does total reads obtained after sequencing each sample (fastq file) and read depth in general are similar or different concepts?

Would you mind taking a look at the "What is the sequencing depth" post from biostars that I linked above? That should help clarify depth and the context that is needed when you refer to depth. Are you thinking in these terms: A: What Is The Sequencing 'Depth' ?

Each sample was sequenced for 100 million read depth.

This does not have a logical meaning. What are you basing the calculation on?

ADD REPLYlink modified 9 months ago • written 9 months ago by genomax74k

Thank you. It was helpful.

Does sub-sampling of a particular unaligned fastq file and aligned sam/bam file produce similar outputs?

ADD REPLYlink written 9 months ago by mohammedtoufiq9180

Does sub-sampling of a particular unaligned fastq file and aligned sam/bam file produce similar outputs?

Probably not. Even if you have unaligned reads in your aligned files. Whole idea of random sampling is to be just that. If you sub-sampled the original data and the aligned file with a specific seed you would get the same output but they would likely not be the same across the two file types.

If you need to do any sub-sampling you should probably do it at the fastq read level.

ADD REPLYlink modified 9 months ago • written 9 months ago by genomax74k

Thank you. In seqtk, what is -s100 parameter? Is it the read length?

My data files are compressed in the fastq.gz format. Is this supported by seqtk? Are there any commands and arguments?

ADD REPLYlink written 9 months ago by mohammedtoufiq9180

Options: -s INT RNG seed [11]

It is the seed parameter. If you set it to the same number you would be guaranteed to get the same output for multiple runs.

ADD REPLYlink written 9 months ago by genomax74k
2
gravatar for genomax
9 months ago by
genomax74k
United States
genomax74k wrote:

You can use reformat.sh from BBMap suite (look at sampling options) or seqtk (sample option) from Heng Li to sub-sample data randomly.

Data quality can be sample/library dependent so I am not sure you can generalize this type of analysis if you expect to get more human blood samples.

ADD COMMENTlink modified 9 months ago • written 9 months ago by genomax74k
0
gravatar for 5heikki
9 months ago by
5heikki8.6k
Finland
5heikki8.6k wrote:

This outputs every second read from a fastq file:

cat in.fq | paste - - - - - - - - | awk 'BEGIN{FS="\t";OFS="\n"}{print $1,$2,$3,$4}' > out.fq
ADD COMMENTlink modified 9 months ago • written 9 months ago by 5heikki8.6k
0
gravatar for swbarnes2
9 months ago by
swbarnes27.0k
United States
swbarnes27.0k wrote:

Reads in the fastq have contained in their names the tile origin of each read. In this read, it's tile 1106

SNL154:328:CCBTGACXX:1:1106:3995:2249

You could use awk or grep to filter for reads on only half the tiles.

ADD COMMENTlink written 9 months ago by swbarnes27.0k
0
gravatar for chen
9 months ago by
chen1.9k
OpenGene
chen1.9k wrote:

Suggest you to use fastp ( https://github.com/OpenGene/fastp ).

fastp -i R1.fq.gz -I R2.fq.gz -o out.R1.fq.gz -O out.R2.fq.gz -A -Q -L -G --reads_to_process=50000000

-A -Q -L -G options are used to disable quality trimming and adapter trimming, since fastp performs quality and adapter trimming by default.

fastp has a prebuilt binary so that you can use it in a very easy way.

ADD COMMENTlink modified 9 months ago • written 9 months ago by chen1.9k

So, I understand that fastp could be used for the downsampling and subsampling data with compressed fastq files.

ADD REPLYlink written 9 months ago by mohammedtoufiq9180

Yes, but the downsampling is not random. fastp just uses the begining reads.

ADD REPLYlink written 9 months ago by chen1.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1801 users visited in the last hour