Question: Problem guessing what fastq from paired-end rna-seq is forward and what is reverse
0
gravatar for v82masae
3.3 years ago by
v82masae140
v82masae140 wrote:

Hi everyone, I have what seems a conceptual or a strange problem with some fastq I have just recieved.

Theoretically, I assumed that when recieven those fastq from a paired-end NGS analyisis from small-RNA fraction, there would be some kind of identification in both pairs of fastqs for each sample, identificating which of them is the forward one and which is the reverse one.

I have these data:

/160712_700470R_0449_BHVHH7BCXX/ which has inside:

7005_S4_L001_R1_001.fastq.gz 7182_S26_L002_R1_001.fastq.gz 7006_S29_L002_R1_001.fastq.gz 7183_S27_L002_R1_001.fastq.gz 7007_S43_L002_R1_001.fastq.gz 7184_S17_L001_R1_001.fastq.gz 7008_S30_L002_R1_001.fastq.gz 7185_S25_L002_R1_001.fastq.gz 7087_S8_L001_R1_001.fastq.gz 7190_S12_L001_R1_001.fastq.gz 7088_S3_L001_R1_001.fastq.gz 7191_S16_L001_R1_001.fastq.gz 7089_S28_L002_R1_001.fastq.gz 7192_S14_L001_R1_001.fastq.gz 7090_S9_L001_R1_001.fastq.gz 7193_S19_L001_R1_001.fastq.gz 7139_S32_L002_R1_001.fastq.gz 7194_S22_L001_R1_001.fastq.gz 7140_S31_L002_R1_001.fastq.gz 7195_S5_L001_R1_001.fastq.gz 7141_S45_L002_R1_001.fastq.gz 7196_S1_L001_R1_001.fastq.gz 7144_S47_L002_R1_001.fastq.gz 7197_S23_L001_R1_001.fastq.gz 7145_S15_L001_R1_001.fastq.gz 7219_S39_L002_R1_001.fastq.gz 7146_S20_L001_R1_001.fastq.gz 7220_S34_L002_R1_001.fastq.gz 7147_S13_L001_R1_001.fastq.gz 7221_S41_L002_R1_001.fastq.gz 7151_S10_L001_R1_001.fastq.gz 7222_S40_L002_R1_001.fastq.gz 7152_S42_L002_R1_001.fastq.gz 7236_S21_L001_R1_001.fastq.gz 7153_S35_L002_R1_001.fastq.gz 7237_S6_L001_R1_001.fastq.gz 7154_S11_L001_R1_001.fastq.gz 7238_S24_L001_R1_001.fastq.gz 7160_S18_L001_R1_001.fastq.gz 7239_S2_L001_R1_001.fastq.gz 7178_S44_L002_R1_001.fastq.gz 7242_S7_L001_R1_001.fastq.gz 7179_S46_L002_R1_001.fastq.gz 7243_S33_L002_R1_001.fastq.gz 7180_S36_L002_R1_001.fastq.gz 7247_S37_L002_R1_001.fastq.gz 7181_S48_L002_R1_001.fastq.gz 7248_S38_L002_R1_001.fastq.gz

and

/160713_700470R_0450_BHVKV5BCXX/ which has inside:

7005_S4_L001_R1_001.fastq.gz 7182_S26_L002_R1_001.fastq.gz 7006_S29_L002_R1_001.fastq.gz 7183_S27_L002_R1_001.fastq.gz 7007_S43_L002_R1_001.fastq.gz 7184_S17_L001_R1_001.fastq.gz 7008_S30_L002_R1_001.fastq.gz 7185_S25_L002_R1_001.fastq.gz 7087_S8_L001_R1_001.fastq.gz 7190_S12_L001_R1_001.fastq.gz 7088_S3_L001_R1_001.fastq.gz 7191_S16_L001_R1_001.fastq.gz 7089_S28_L002_R1_001.fastq.gz 7192_S14_L001_R1_001.fastq.gz 7090_S9_L001_R1_001.fastq.gz 7193_S19_L001_R1_001.fastq.gz 7139_S32_L002_R1_001.fastq.gz 7194_S22_L001_R1_001.fastq.gz 7140_S31_L002_R1_001.fastq.gz 7195_S5_L001_R1_001.fastq.gz 7141_S45_L002_R1_001.fastq.gz 7196_S1_L001_R1_001.fastq.gz 7144_S47_L002_R1_001.fastq.gz 7197_S23_L001_R1_001.fastq.gz 7145_S15_L001_R1_001.fastq.gz 7219_S39_L002_R1_001.fastq.gz 7146_S20_L001_R1_001.fastq.gz 7220_S34_L002_R1_001.fastq.gz 7147_S13_L001_R1_001.fastq.gz 7221_S41_L002_R1_001.fastq.gz 7151_S10_L001_R1_001.fastq.gz 7222_S40_L002_R1_001.fastq.gz 7152_S42_L002_R1_001.fastq.gz 7236_S21_L001_R1_001.fastq.gz 7153_S35_L002_R1_001.fastq.gz 7237_S6_L001_R1_001.fastq.gz 7154_S11_L001_R1_001.fastq.gz 7238_S24_L001_R1_001.fastq.gz 7160_S18_L001_R1_001.fastq.gz 7239_S2_L001_R1_001.fastq.gz 7178_S44_L002_R1_001.fastq.gz 7242_S7_L001_R1_001.fastq.gz 7179_S46_L002_R1_001.fastq.gz 7243_S33_L002_R1_001.fastq.gz 7180_S36_L002_R1_001.fastq.gz 7247_S37_L002_R1_001.fastq.gz 7181_S48_L002_R1_001.fastq.gz 7248_S38_L002_R1_001.fastq.gz

as you all can see, both directories have the same number of fastq representing each sample, labelled with the same name. Theoretically, I should assume that the forward fastq is one of them, and the reverse is the other one, in each pair of samples.

Taking a look inside each pair of fastq, they appear to be different, as you can see:

/

160712_700470R_0449_BHVHH7BCXX$ zcat 7005_S4_L001_R1_001.fastq.gz | head

@700470R:449:HVHH7BCXX:1:1107:1493:1874 1:N:0:TGACCA
NGCGCCGCGGCTGGACGAGAGATCGGAAGAGCACACGTCTGAACTCCAGTC
+
#<<DDIIIIIIHHHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@700470R:449:HVHH7BCXX:1:1107:1465:1915 1:N:0:TGACCA
NGCGACCTCAGATCAGACGAAGATCGGAAGAGCACACGTCTGAACTCCAGT
+
#<<DDHIHHHIFEHIHHIIIIIIHIHIIGIIIIIGIIHIIIIHIIIGIIIG
@700470R:449:HVHH7BCXX:1:1107:1971:1937 1:N:0:TGACCA

/160713_700470R_0450_BHVKV5BCXX$ zcat 7005_S4_L001_R1_001.fastq.gz | head

@700470R:450:HVKV5BCXX:1:1101:1664:1955 1:N:0:TGACCA
NTTGGTCCCCTTCAACCAGCTGTAGATCGGAAGAGCACACGTCTGAACTCC
+
#<<DDHIHIHIIIIIIHIIIHIIIIIIIIIIIIIIIIIHIIIIIIHIIIII
@700470R:450:HVKV5BCXX:1:1101:1940:1935 1:N:0:TGACCA
NGGAATGTAAAGAAGTATGTACAGATCGGAAGAGCACACGTCTGAACTCCA
+
#<DDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@700470R:450:HVKV5BCXX:1:1101:2588:1943 1:N:0:TGACCA
NCGTACCGTGAGTAATAATGCGAGATCGGAAGAGCACACGTCTGAACTCCA

I was assuming to see any kind of guide in the header of each read, as in the final section of each header's read (1:N:0:TGACCA ) there should be an identification determining whether this is a forward strand (1), or a reverse strand (2), but surprisingly, there's a 1 in both of them. So I kinda freaked out...

Debating about this with my pillow, I reached two possible conclusions:

1) The lab that provided this information should tell me which sample is the forward and which is the reverse (I sent them an email related with this issue but I haven't recieved any answer yet).

2) There's no such difference or importance in identificating the forward with a 1 and the reverse with a 2, and presumably I could assing arbitrarily a 1 or a 2 to each one of the paired-samples, and procced to further analysis, but this second theory appeared to my mind in hat seems to be a very silly solution.

SO, any help about this?... I can't start analyzing my samples until I solve this problem...

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by v82masae140

normally "at least in my case" it should be in file name for example machine_lan.1.fastq machine_lan.2.fastq, or both in one file but each read will be distinguished 1 for forward and 2 for reverse (is this Illumina platform?)

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Medhat8.5k

if it's paired-end then each _R1_ file should have an associated _R2_ file :

7006_S29_L002_R1_001.fastq.gz 
7006_S29_L002_R2_001.fastq.gz 

if not: some files are missing.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Pierre Lindenbaum124k

as pierre and medhat, pair end reads will be name R1 and R2. As you have mentioned these are from smallRNA, pair end data is not needed, it seem same samples have run in multiple run. Regarding analysis, you can just concatenate same sample files and proceed.

ADD REPLYlink written 3.3 years ago by Prasad1.6k

Oooook thanks a lot to everyone, yeah... I'm used to do paired-end RNA-seq and I'm new in the microRNA world so I assumed things wrong...

So yes, they are single-end reads, now I see it XD, but they repeated the run over all the samples twice. Now I'm not sure whether to procced concatenating both runs in each sample to merge new fastqs with two runs in one fastq file, or just do the analyses separately in both runs to compare, or just do both :S

ADD REPLYlink written 3.3 years ago by v82masae140

If it is the same sample run multiple times you can concatenate the files (unless one of the replicates was deemed not suitable and the pool was re-run for that reason).

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by genomax74k

if the rerun was due to read deficiency then you can concatenate and do the analysis. If they are replicates, then analyze individually

ADD REPLYlink written 3.3 years ago by Prasad1.6k
3
gravatar for Devon Ryan
3.3 years ago by
Devon Ryan92k
Freiburg, Germany
Devon Ryan92k wrote:

Either the samples weren't sequenced paired-end (it rarely makes sense to do so for smallRNAseq unless you're doing single-cell sequencing) or they forgot to deliver those files. The former is more likely. You might ask them why they ran the samples on a second flow cell, i.e., was there a problem with the first run that you should know about or was it just for depth?

ADD COMMENTlink written 3.3 years ago by Devon Ryan92k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1772 users visited in the last hour