Question

Merging xxx.F and xxx.R illumina reads

0

Entering edit mode

4.5 years ago

Sbrillo ▴ 10

I'm trying to merge F and R files from Illumina sequencing in order to use them in the redkmer pipeline ( https://github.com/genome-traffic/redkmer-hpc).

I tried many options including:

-Download the reads from the SRA archive in the merged format (failed)

-Merge the reads using pandaseq (done)

-Use fastq-dump --split function (impossible to install SRA toolkit correctly)

Since I'm having some problem with the reads merged using pandaseq i want to try to use other strategies.

Do you have any suggestions?

Also, do you know how to download SRA-toolkit in the correct way? Is it possible to use just sudo command in ubuntu instead of downloading the zipped folder?

alignment Assembly sequence next-gen • 2.3k views

ADD COMMENT • link 4.5 years ago by Sbrillo ▴ 10

0

Entering edit mode

What do you mean by "merge", like append to each other, or interleave so alternating order R1/R2/R1/R2... in one file? You can download data directly as fastq from sra-explorer.info by the way. Please add some details.

ADD REPLY • link 4.5 years ago by ATpoint 88k

0

Entering edit mode

Thank you for replying,

by 'merge' i mean append to each other (F and R) and not in alternating order R1/R2 ecc..

This will be the input file for the pipeline and it's mandatory to use just two fastq files from male (m.fastq) and female (f.fastq) samples (not 4, m.F.fastq-m.R.fastq ; f.F.fastq-f.R.fastq).

I already used sra-explorer.info but it's the same as downloading the file from the ENA archive...

Do you ever used USEARCH or PEAR to combine F and R fastq file?

This is an example of what i did using pandaseq:

pandaseq -F -f SRR1509742_1.fastq.gz -r SRR1509742_2.fastq.gz -d rbfkms -u unmerged_pandaseq.fa 2> pandastat.txt 1> merged_mandaseq_pacbio.fastq

ADD REPLY • link 4.5 years ago by Sbrillo ▴ 10

1

Entering edit mode

Hi! So, I'm still a little confused on the format that you want to have the reads. There are two possible options:

1) Do you need the reads concatenated on a single file? For example, if you have 10,000 reads, you want to generate a file that will have 20,000 reads (the first half with the forward reads, the second half with the reverse reads).

If this is the case, you can concatenate the two files using cat (or zcat if they are compressed).

2) Or, do you need each read pair merged? I think this is what you are trying to achieve by using Pandaseq, which will take each pair of forward and reverse read, find some overlap between them, and merged them. This means that if you have 10,000 forward and 10,000 reverse reads, in theory, you will obtain 10,000 merged reads...assuming that all of the reads have enough overlap to merge them. If you are doing this, you need to be sure that the reads actually have some overlap between them, if not you will not obtain too many reads after

Without knowing exactly what is the input for redkmer, I think what you need in this case is option 1, to concatenate your read files.

ADD REPLY • link 4.5 years ago by Juan Ugalde ▴ 10

0

Entering edit mode

This is what they say in the paper:

'For the short-read libraries, data must be generated from both male and female samples independently and pro-vided in fastq format as a single file (paired-end reads can be merged into one file for each sex).

I will try bot the strategies (cat and pandaseq/SRA-toolkit).

When i merged the file using pandaseq the input sizes were:

maleF.fastq.gz = 13gb maleR.fastq.gz = 13gb

after merging:

male.merged.fastq = 34 gb unmergedmale.fastq = 2gb

I think that pandaseq worked properly since most of the reads were in the merged file. Do you ever use paired-end reads merged with zcat ( F and R ) for alignments? I was just worried about the size of the input file if i use zcat (will be more that 60gb).

ADD REPLY • link 4.5 years ago by Sbrillo ▴ 10

score 3 · Accepted Answer · 2021-01-02

It is definitely possible to install SRA toolkit correctly. The download page is here and installation instructions are here. It boils down to: 1) downloading correct binaries; 2) unpacking the archive; 3) adding archive's bin directory to $PATH variable.

Alternatively, after unpacking the archive the contents of the bin directory can be moved to another directory that is already part of $PATH for your account. Type echo $PATH to find out what directories are already included. A partial list of my $PATH directories looks like this:

/usr/bin:/home/programs/bbmap:/home/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

You could move the binaries to any of the directories separated by colons, but not all of them are meant for random programs. For example, from the SRA toolkit's bin directory you could issue this command:

sudo mv * /usr/local/bin

After that you may need to log in and out or open a new terminal window, and typing which fastq-dump should output something like /usr/local/bin/fastq-dump. From that point on it is a matter of reading about program's options and downloading the files as interleaved.