How to resync paired-end data?
2
1
Entering edit mode
5.0 years ago
germelcar ▴ 20

I am working with SRR867646 reads, which contains 10919266 sequences for R1 and 2177589 for R2, and in fact, I am not able to do an alignment.

I found another Biostars' post where Devon Ryan suggest using the "repair.sh" script from the BBMap toolkit. I ran the script, but it generates the "r1" and "r2 files empty, and "singletons" file contains the same number of sequences as R1+R2 (10919266+2177589=13096855). My question is, how should I handle that? Should I treat "singletons" files as the reads were single-end?

~g

paired-end transcriptome hisat2 bowtie2 de novo • 4.1k views
0
Entering edit mode

Hello germelcar!

We believe that this post does not fit the main topic of this site.

I had the wrong reads. I fully recommend to use the latest version of fastq-dump and its --split-3 argument when working with paired-end data.

For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.

If you disagree please tell us why in a reply below, we'll be happy to talk about it.

Cheers!

0
Entering edit mode

@germelcar: Closing a post is an action used by moderators to manage posts that do not fit the topics of this site. Users do not close a post after a solution is found.

The appropriate response would be to the "accept" an answer (use the check mark against the answer).

0
Entering edit mode

Thanks for the observation @genomax2. I have opened it again and sorry for the inconvenience.

0
Entering edit mode
5.0 years ago
GenoMax 107k

I ran the "repair.sh" script and the output is only the "singletons.fq" file, the "r1" or "r2" files are empty.

That does not make sense. What was the full command you used?

Am I right that you did not do trimming on the original files that you got from SRA and the files as submitted may not be matched for number of reads in them?

0
Entering edit mode

Yes, you are right. I have not applied some trimming and the reads subitted don't match in the number of sequences. Also, it seems like they are trimmed because all the reads are with a quality above 28.

I ran the following command:

cat SRR867646_1.fastq SRR867646_2.fastq | repair.sh -Xmx15g in=stdin.fq out1=r1.fq out2=r2.fq outs=singletons.fq

The command above generates "r1" and "r2" files empty, but "singletons" is not empty (it's 2.9 GB). I don't know how to interpret that. Should I treat the reads as single-end?

0
Entering edit mode

You need to run repair.sh as follows. You need to specify two inputs and two corresponding outputs.

repair.sh in1=SRR867646_1.fastq in2=SRR867646_2.fastq out1=SRR867646_clean_1.fastq out2=SRR867646_clean_2.fastq outsingle=singleton.fastq

0
Entering edit mode

I ran the command that you instructed me, with the difference that I have added the "-Xmx12g" argument because of problem with memory. Again, the script has generate "SRR867646_clean_1.fastq" and "SRR867646_clean_2.fastq" files empty and "singleton.fastq" file with 13096855 sequences (like before). The output was the following:

Set INTERLEAVED to false Started output stream.

Input: 13096855 reads 1178716950 bases. Result:
13096855 reads (100.00%) 1178716950 bases (100.00%) Pairs:
0 reads (0.00%) 0 bases (0.00%) Singletons: 13096855 reads (100.00%) 1178716950 bases (100.00%)

Time: 82.731 seconds. Reads Processed: 13096k 158.31k reads/sec Bases Processed: 1178m 14.25m bases/sec

Does the script's output correct? How do I interpret and handle the data? Like single-end pair?

0
Entering edit mode

I believe that repair.sh requires that the reads use standard Illumina read name format. Would you post the first ~20 lines of the SRR867646 R1 and R2 data so we can see it?

0
Entering edit mode

Sure, here are the first 20 lines of the R1 and R2.

The first 20 lines of R1 (SRR867646_1.fastq)

@SRR867646.1.1 1 length=90 TGTNCGTTTCAAGCTTTGAACTGTTTGGATGATACCTTCTTGTTCTTGCAGTTCGGTTTCCATGGCTTCTTTTTCTTTTTCTAGATTTTT +SRR867646.1.1 1 length=90 @@@#4=ABFHGFHIIIIFFHGDDHI?GH>HHIIIIIIIG@BGGHIEHIIHHFHIIHFBEIGFHGGIIIIBCEHHEEFFBEEBD@CCECEE @SRR867646.2.1 2 length=90 CGTNGTACATAAAAATTATTCTGTTTTATCTTTTGGTAAGTCCTTATGATGAATAAAAACAAAACTGAACAGAAACACTCCAAGCATCAT +SRR867646.2.1 2 length=90 @BB#4=ADHHHGHJJJJJJIJJJJJJJJJJJJJJIJIIGIFIJJJJJJHJI>GIIGIIIIIGGIIIJHIHIGDHGAHHEHFFFFFEEEEC @SRR867646.3.1 3 length=90 GTGAAGTCAGAAGGATCGTGATAGACTTTGCCATCAGGGCCCTGGAATTCGTGTCTTTTCTCGCCATCGAAGGTACCGCAGATGCCATGT +SRR867646.3.1 3 length=90 CCCFFFFFHHHHHJIJJJGHEHIIIJJJJJJJJJJJJJJJJJJJJEFGIJJHIIIJJJJIJGJJJHFHHDDFFCEEEDDDDDDDDDCDDE @SRR867646.4.1 4 length=90 CTTNGAGTTTGTTCAGTTGGTCCACTTGGTCGCCCATTTCGGCAACGGTGTCAGTGTGTTTCTTGCGCAGGTTGGACAAGGTGGACTCAT +SRR867646.4.1 4 length=90 CCC#4ADDHHHHHIIIEHIIHIIIIIIIIGIIIIIIIIIIHIIIIIIIFHIIIIIIEH?EHFFFFFCCCCC?CCBCCCBCC@A@BCCCCC @SRR867646.5.1 5 length=90 CTGNAATTCGTGTCTTTTCTCGCCATCGAAGGTACCGCAGATGCCATGTTGGTTTTCTCTGTAAAAGTTGGATGCCGAGATGATACTGTT +SRR867646.5.1 5 length=90 @@@#4ABDHHDFDEHIHIIGHGGIEHIIEGHF?9BGDD?@6=CFHACGFDEF;@AGHGHGHCE7;?;;@DE@;@CAB;B8<>>C@CCEC#

The first 20 lines of R2 (SRR867646_2.fastq)

@SRR867646.1.2 1 length=90 TAAATTTGGTGAATTGTTAGATCGTCTGCTCGAAAGTGAGGCATCTGAACAGAGACTTAGAGACAGAATTTTTGAGTTGGATCAGAAAGA +SRR867646.1.2 1 length=90 @B@FFFFFGHHHHJJJHIGIIJJJGHIIC@HIEFDE*?FHFGGIJJJBGCDGFGIHGEDAGGIIJIGHI>?EEBFFFDEECEEBDCCCCC @SRR867646.2.2 2 length=90 CACAGATCTCGGTAAGGCCGTTGCAACACACGCAGTACGAACGCTACATTGTCTCTGCGTATCCATATTATGCCAGCGCTTTTTCGATGA +SRR867646.2.2 2 length=90 CCCFFFFFHHHHFHIJJIJJIJJJIGIJJJJIIJJHIIJIJJJJJJJHCHHHHHFFFFFACDDDDDDDEDDDDDDDDDDDDDDDDDDEEB @SRR867646.3.2 3 length=90 GTTACATCATCAAACCTAAGGGCTATGGCTTGAAAATCTACTTCGACGGAAACAACAGTATCATCTCGGCATCCAACTTTTACAGAGAAA +SRR867646.3.2 3 length=90 CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJ@CFHI><c?bbhbfffehb@fhhhc=c=aceeeedc?b?<a>?>AC>CDD>A@CCCCDD @SRR867646.4.2 4 length=90 TGGAGCCACAGCTGCCCAAATCGAGATGAACAAGAAGCGGGAAGCTGAAATGGCCAAACTGAGACGGGACCTCGAGGAAGCCAATGTTCA +SRR867646.4.2 4 length=90 CCCFFFFFHHHHHJJJJJJJJJJJJJIJGIJJJJJJJIJJJIJJJJJJJJJIIJHHHHFFFFFEEDDDDDDDDDDDDDDDDDDDCDDEED @SRR867646.5.2 5 length=90 CAAGGTCAACGACAAAGTTGTTGGAATCGTTTACAAATACGGTGAAGGTTACATCATCAAACCTAAGGGCTATGGCTTGAAAATCTACTT +SRR867646.5.2 5 length=90 @@?DFBDDFDHFHIIGHFEFADGD?EEGH?BGAHGG>DHHHG?F6?FHHGCEFHHJGJJDHHIIHCHHHFFB?C;?CEB?CA@-5>@###

As you can see, the identifier line looks like: @SRR867646.N.RX N length=90, where "N" is the spot number and "RX" is the read number (R1 or R2), I don't know if this is OK.

Do you see something weird with the reads? I have extracted the R1 and R2 with fastq-dump (2.7.0) using the "--split-files" and "--readids" arguments as explained by Edward's Labs

1
Entering edit mode

Get the fastq reads directly from ENA and forget about SRA. I am not sure why there are three fastq files at that link. Perhaps submitters have provided an interleaved reads file (which may be the third file). Get the _1 and _2 files to be safe.

0
Entering edit mode

Thanks you @genomax2, I have downloaded the files and I will work with them. About the third file, I really don't know about that, but maybe is of interest the answer that I made above it, where fastq-dump with "-F" or "--gzip" argument generates three files as ENA.

Thanks.

~g

1
Entering edit mode

There is something strange about these files since SRR86764.fastq appears to be non-overlapping with other two but does not contain interleaved reads.

SRR867646_1.fastq and SRR867646_2.fastq are in sync though. I am able to run repair.sh and get clean output. There are no singletons that result.

If you must use this dataset then as suggested by @Charles below an email to SRA tech support may be in order.

UPDATE

If I start with the SRA file from ENA and fastq-dump the reads then I get this (you may need to rename the file to .sra)

fastq-dump -F --split-files -v SRR867646.sra
Written 10919266 spots for SRR867646.sra


The files that result are not the same size so running repair.sh on them gets the following output

repair.sh in1=./SRR867646_1.fastq in2=./SRR867646_2.fastq out1=./SRR867646_clean_1.fastq out2=./SRR867646_clean_2.fastq outsingle=./single.fastq

Result:                     13096855 reads (100.00%)    1178716950 bases (100.00%)
Pairs:                      4355178 reads (33.25%)  391966020 bases (33.25%)
Singletons:                 8741677 reads (66.75%)  786750930 bases (66.75%)

0
Entering edit mode

Thanks @genomax2.

It seems like fastq-dump with --split-3 argument does the same as "repair.sh" script does. I say that based on the following:

I have executed the same command as you mentioned above and another one with the "--split-files" changed to "--split-3":

## with --split-files

R1 = 10919266 sequences

R2 = 2177589 sequences

## with --split-3

SRR867646.fastq = 8741677 sequences (the same as R1 - R2 with --split-files)

R1 = 2177589 sequences

R2 = 2177589 sequences

Also, checking the fastq-dump's documentation with the --help argument (because the webpage is for 2.5.X version), and it says it:

Multiple File Options Setting these options will producemore than 1 file, each of which will be suffixed according to splitting criteria.
--split-3 Legacy 3-file splitting for mate-pairs: First biological reads satisfying dumping conditions are placed in files *_1.fastq and *_2.fastq If only one biological read is present it is placed in *.fastq Biological reads and above are ignored.

I think that explains the behaviour or the three files from the ENA. So in conclusion, for paired-end, is recommended (or "forced") to use --split-3 argument.

Thanks very much to all of you for the help and recommendations.

~g

0
Entering edit mode

If your examples are displayed accurately, then the line breaks have fragmented your reads so they do not conform to the FASTQ format (see here). Each read should consist of four lines:

It may be that fastq-dump outputs a fixed wrap length by default, which can fragment the data, but it's hard to envision how that would produce the line breaks shown in your examples. Therefore, it appears that you have not copied the data format exactly in your examples (also, the examples contain only 19 and 18 lines, but the data are for five reads, which would/should be 20 lines).

I suspect that the real problem is that the read IDs are SRA rather than standard Illumina. You should download in the original format, using the -F flag.

Finally, it's good practice to post the command that you used (anonymize filenames if you prefer), rather than referring to some webpage you used for guidance. It's easier for us to help you troubleshoot.

0
Entering edit mode

Thanks for reply @harold.smith.tarheel. Thanks very much for your recommendations. I have reinstalled sra-tools, but it wasn't the problem. I have downloaded the fastq files from ENA as @genomax2 suggested and it seems weird why is a third file. Anyway, I have made some tests which I would like to share.

I have surfed the web and I found that SRA-tools (also knows as SRA toolkit) have made some changes and NCBI have not doccumented it very well (as mentiones in Edwards Labs). Well, I decided to try some combinations with fastq-dump.

First, I downloaded the two files from ENA and extracted those (in order to manipulate the raw fastq files), later, I downloaded the SRA file with prefetch tool. Finally, I ran fastq-dump with the following combinations:

## test1

fatq-dump --split-3 SRR867646

R1 = 528M, 2177589 sequences (the sames as the fastq files from ENA)

R2 = 528M, 2177589 sequences

The defline sequences are the same for both (R1 and R2).

528M means 528 megabytes (the output is from "ls -lh" command)

## test 2

fastq-dump --split-files -I -W -B --skip-technical SRR867646

R1 = 2.7G, 10919266 sequences

R2 = 536M, 2177589 sequences

Where READ_ID is 1 for R1 and 2 for R2.

## test 3

fastq-dump --split-3 -I SRR867646

R1 = 536M, 2177589 sequences

R2 = 536M, 2177589 sequences

The defline sequences are the same as test2.

## test 4

fastq-dump --split-files -I SRR867646

The same as test 2

## test 5

fastq-dump --split-files -I -W --skip-technical SRR867646

The same as test 2

## test 6

fastq-dump --split-3 -I -W -B --skip-technical SRR867646

R1 = 536M, 2177589 sequences

R2 = 536M, 2177589 sequences

The defline sequences are equals as test 2

## test 7

fastq-dump --split-3 -I -W --skip-technical SRR867646

The same as test 6

## test 8

fastq-dump --gzip --split-3 -I SRR867646

SRR867646.gz = 634M, 8741677 sequences

R1 (SRR867646_1.gz) = 536M

R2 (SRR867646_2.gz) = 536M

Defline the same as test 2

NOTE that with gzip argument, it has generated three files instead of two.

## test 9

fastq-dump --split-3 -I -F SRR867646 --> This is because the suggestion you made

SRR867646.fastq = 1.7G, 8741677 sequences

R1 = 414M, 2177589 sequences

R2 = 414M, 2177589 sequences

The defline sequence (first line of the spot) has the following format: @READ_ID

The defline quality (third line of the spot) has the following format: +READ_ID

## test10

fastq-dump --split-3 -F SRR867646

The same as test 9

Finally, I would like to comment that fastq-dump has the following arguments:

--helicos Helicos style defline
--defline-seq <fmt> Defline format specification for sequence. --defline-qual <fmt> Defline format specification for quality. <fmt> is string of characters and/or variables. The variables can be one of: $ac - accession,$si spot id, $sn spot name,$sg spot group (barcode), $sl spot length in bases,$ri read number, $rn read name,$rl read length in bases. '[]' could be used for an optional output: if all vars in [] yield empty values whole group is not printed. Empty value is empty string or for numeric variables. Ex: @$sn[_$rn]/$ri '_$rn' is omitted if name is empty

I tried to define the deflines but I did have success.

It seems that the test 3 will works fine, the difference is that the defline sequence and defline quality are different (@ACC_NUMBER.SPOT_NUMBER READ_ID SPOT_NUMBER for test3 and @ACC_NUMBER.SPOT_NUMBER SPOT_NUMBER/READ_ID for ENA reads) and defline quality (more verbose for test3), and that is what makes the difference (in size) between the files generated from test3 and ENA's files.

Altought the files generated from test3 are slightly the same as ENA's files, I'll work with ENA's file because those have the standar format (with the slash), and I am not sure if the software out there are able to handle the defline generated from test3.

Thanks to all for your help and reccommendations.

~g

0
Entering edit mode
5.0 years ago
Charles Plessy ★ 2.7k

If the R1 and R2 files were originally paired, and only R1 had reads removed, then you can try https://github.com/mmendez12/sync_paired_end_reads.

0
Entering edit mode

Hello Charles Plessy, thanks for reply.

Sorry but... how can I know if only R1 have reads removed? For me, it seems like both (R1 and R2) have reads removed since theirs number of sequences (10919266 for R1 and 2177589 for R2), and all above a score of 28, does that makes sense or am I missing something?

0
Entering edit mode

If the files that you have are not directly the sequencer's output (which we assume is synchronised), and if there is no explanation on how they were produced, then it is hard to answer the question, except by checking if there are R2 IDs that are not found in R1.

Actually, I would have expected such kind of information to be available in the SRA record, but I found nothing. I then visited http://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR867646, selected the "Reads" tab, and briefly browsed the read sequences to see where are the unpaired reads. They are all after read number 2177589. I do not know if it means that the authors deposited the files like this. Perhaps you can ask to the SRA helpdesk (since there is no information about who the authors are).

0
Entering edit mode

Thanks Charles Plessy.

It seems like the three files available from ENA's files were produced as I suggest in my answer above. It seems like fastq-dump (at least the version 2.7.0) has added the --split-3 argument which does the work for you.

Thanks.

~g