Question

How to process paired end data (using fastq-dump) to get Fw and Rv files

0

Entering edit mode

5.9 years ago

2405592M ▴ 140

Hi guys

I downloaded a fastq file using the fast-dump command (sra toolkit) to get paired end data that I want to analyse. However, the fastq file comes up as one file (was expecting two; Fw and Rv). I want to use trimmomatic which needs two input files. How do I get around this?

New to the scene as you can tell!

Thanks in advance!

Fastq RNA-Seq SRAtoolkit trimmomatic • 3.2k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 5.9 years ago by 2405592M ▴ 140

2

Entering edit mode

Check this thread for related answers How to split paired end SRA file into 2 correct fastq files .

ADD REPLY • link 5.9 years ago by Arup Ghosh 3.2k

0

Entering edit mode

Hi guys! Sorry this is a few months later but my follow up question is related to this thread. Is it at all possible that the above SRR fastq file is in fact an interleaved fastq file. When I more the SRR file, this was the first few lines:

@SRR1909108.1 1 length=151
TGCTCTGATGAAATCACTAATAGGAAGTGCCGTCAGAAGCGATAACTGACGAGGACTACTCCTGTCTGATTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACACTTGAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAATTTTTT
+SRR1909108.1 1 length=151
A1AAAFFFFFFFGGGGFG31FDB1110B11A0EA00111A00B//BF11A/////AF0BGEAG1GAG11DG211DB/////00>1F0B/B?G/@DFGHGE1F@1@@1BF111B1FD0?/F0BB/EGHGDG1DGG1BBDGFCC?########
@SRR1909108.2 2 length=151
TTCGTGATCGATGTGGTTACGTCTTTCTATTTCTTATTTTCACTCTTCTTTACTCCATTCTCTCTTTTTTCTCTTTTTCCTTCTTCTTCTTTTTTAAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTCTCTTTTTT
+SRR1909108.2 2 length=151
AA@AAA1FFA?ADF11F1111B0003AD333BA3122222222211A1B112AB1A11222221BF111/0A22D1D1011D1FG2B12D@############################################################

To my follow up point, if these fastq file are indeed interleaved, what would be the best way to trim off the 5' and 3' adapter sequences? Would you have to treat the fastq file as single end reads or is there another proceedure?

ADD REPLY • link 5.7 years ago by 2405592M ▴ 140

1

Entering edit mode

SRR1909108 is a single end sample (here is ENA entry for it) so there is no interleaving here. Those are just reads in sequence (1,2,3,4 etc).

ADD REPLY • link 5.7 years ago by GenoMax 141k

score 3 · Answer 1 · 2018-06-13

3

Entering edit mode

5.9 years ago

Devon Ryan 104k

You need --split-3, also use ENA rather than SRA if you can, it's much faster.

ADD COMMENT • link 5.9 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks Devon! 2 questions. 1) what would the command be? I've tried fastq-dump --split-3 SRR1909107 but I'm still getting 1 fastq file ? 2) With regards to the ENA, can I download directly from the command line or would I have to manually download these files from the ENA website? Appreciate the help!

ADD REPLY • link 5.9 years ago by 2405592M ▴ 140

2

Entering edit mode

SRR1909107 is indeed single-end. Not uncommon that people mislable files that are uploaded to the NCBI. Also not uncommon that some lane replicates would be paired and other single, because who cares about confounding effects and things, right :-D Anyway, for the ENA, there is a good documentation for downloads here.

ADD REPLY • link 5.9 years ago by ATpoint 81k

1

Entering edit mode

The person who uploaded those samples either mislabeled them or only uploaded one of the two reads, it's unclear which. I suggest you contact whoever uploaded that and ask them.

You can use wget or curl or ascp with ENA too, just like SRA. The main difference is that you will directly get fastq files and not the silly SRA files.

ADD REPLY • link 5.9 years ago by Devon Ryan 104k