Question

SRA/ENA library layout is inconsistent with the data source

0

Entering edit mode

2.9 years ago

tomas4482 ▴ 420

project number: PRJNA505380 An example of Run accession: SRR8244780 Issue: Inconsistency between the library layout of Run and data source.

As the library layout both in ENA and SRA labeled, Runs in Bioproject PRJNA505380 should be pair-end reads data. But some of them only have a single fastq and without underscore "_1" or "_2" to indicate the pair-ended setting.

I took the example data for a closer look using following code under Ubuntu 18:

grep @SRR8244780.1000510 SRR8244780.fastq

Print some of the results here:

@SRR8244780.10005100 10005100/2
@SRR8244780.10005101 10005101/2
@SRR8244780.10005102 10005102/2
@SRR8244780.10005103 10005103/2
@SRR8244780.10005104 10005104/2
@SRR8244780.10005105 10005105/2
@SRR8244780.10005106 10005106/2
@SRR8244780.10005107 10005107/2
@SRR8244780.10005108 10005108/2
@SRR8244780.10005109 10005109/2
@SRR8244780.100051087 100051087/1
@SRR8244780.100051088 100051088/1
@SRR8244780.100051089 100051089/1
@SRR8244780.100051090 100051090/1
@SRR8244780.100051091 100051091/1
@SRR8244780.100051092 100051092/1
@SRR8244780.100051093 100051093/1
@SRR8244780.100051094 100051094/1
@SRR8244780.100051095 100051095/1
@SRR8244780.100051096 100051096/1
@SRR8244780.100051097 100051097/1

Because I can't see the original id for each read. I can only assume that all these ID I "grep"ed are unique read ID. There is no duplicated read ID showing like this:

@SRR123456789.123 123/1
@SRR123456789.123 123/2

Generally, if you have a single-end read with illumina identifier, it should look like this:

grep HWI-ST337R:419:C1NFJACXX:2:1101:13942:2686 a.fastq

Output:

@SRRxxxx.xxxx HWI-ST337R:419:C1NFJACXX:2:1101:13942:2686/1

For single-end read fastq, you should only get one read ID and no /2 tag (if I'm correct).

Clearly the read ID in my case has both /1 and /2 tags. What makes me confused is that there is no duplicated ID but contains pair-end tags in the same fastq. I'm not sure whether this data is a concatenated fastq or interleaved fastq. Someone previously use tree command to seperate interleaved fastq to two fq. I tried either, it is very time-consuming so I didn't finish it.

My question is : How to deal with this kind of data? Can I just treat it as a single-end data? Or these data cannot be used for downsteam analysis?

I used fastp for quality assessment of this data by setting it as a single-read fq. The N reads and N bases are equal as reported in ENA.

Thank you.

BTW, this is not the first time I met this problem. I don't understand why ENA and SRA both allow the submitters to mistakenly upload this kind of data as "pair-end" without simply checking how many files they uploaded. Not to mention that NCBI SRA does not allow concatenated raw fq to be uploaded.

ENA fastq illumina SRA • 1.7k views

ADD COMMENT • link 2.9 years ago by tomas4482 ▴ 420

1

Entering edit mode

Last day I don't have time to test it. I made it today. Unfortunately, this file is proved not to be interleaved.

No output stream specified.  To write to stdout, please specify 'out=stdout.fq' or similar.
Input is being processed as paired
Names do not appear to be correctly paired.
SRR8244780.1 1/2
SRR8244780.2 2/2

I checked the SRA run browser. Apparently, it is labeled as "pair-end" reads but do not have any paired reads. It is more like a single read fq.

>gnl|SRA|SRR8244780.8.1 8 (Technical)
Empty read
>gnl|SRA|SRR8244780.8.2 8 (Biological)
TTGGTAGGATATAGTCACCTTGAGATTCCAGTTGGATTTTTAAAGATTATTTCTTCTTAA
AGGAAATATTCATGTATATGACAATTTTTT

Another example was discussed here: SRR2969254

>gnl|SRA|SRR2969254.52180835.1 52180835 (Technical)
Empty read
>gnl|SRA|SRR2969254.52180835.2 52180835 (Biological)
GGCGGGGGTAACTATGACTCTCTTAAGGTAGCCAAATGCCTCGTCATCTAATTAGTGACG
CGCATGCATGGATTAACGAGATTCCCACTTTCCCTACCTAC

The same error and same pattern.

I also check the correctly labeled pair-end fq reads and single-end fq reads to ensure that I didn't misunderstand or misread anything.

For pair-end reads, it looks like:

>gnl|SRA|SRR8244781.1.1 HWI-ST458R:229:C43DDACXX:5:1101:1421:1920 forward (Biological)
NTTTGTTATTTTCCCTTGGATCCTGTTGAACACATGTGGCATCCTGCCTGAACCCCTCCC
GCAGCTGCCATGCGATCCAGAGCTCCATCC
>gnl|SRA|SRR8244781.1.2 HWI-ST458R:229:C43DDACXX:5:1101:1421:1920 reverse (Biological)
AACTTTCAGAAATGCGGACCACTAGGGGGCAGTAAGTGAACAGCGATTCCACCTCCTGGA
GCCCGGATTGTGGAGCCAGCGTCCGAAAGA

In conclusion, if you have the same issue, first thing you need to do is to check the interleaved status by reformat.sh input.fq vint from BBMap.

If true, you can directly input it to bwa mem using -p option under bwa v0.7.11 or higher.

Otherwise, it should be treated as a single-end read.

In the meanwhile, I've written a message to ENA service and they may answer this question as well. I'll update if there is anything new.

The original read ID is quite different from those correctly labeled pair-end/single-end fq as it only gives a number SRR8244780.1 1 followed by a tag /2, unlike this HWI-ST458R:229:C43DDACXX:5:1101:1421:1920 forward. I don't know and have no time digging up this data entry error. A personal guess is that the mismatched label somehow discard the original read ID but only left a numeric label with it. This kind of mistakes should be avoid. It may cause unnecessary misunderstanding and waste a plenty of time!

Case closed.

ADD REPLY • link 2.9 years ago by tomas4482 ▴ 420

0

Entering edit mode

2.9 years ago

Juke34 8.6k

I guess this is interleaved data.

https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/

Paired-end data submitted in FASTQ format should be submitted in one of two formats: As separate files for forward and reverse reads, in which the reads are in the same order. As interleaved, or "8-line", FASTQ, in which forward and reverse reads alternate in the file and are in order (i.e., read "1F", followed by read "1R", then read "2F", then "2R"). SRA supports the following forward/reverse read indicators: '/1' and '/2' at the end of the read name or newer Illumina style '1:Y:18:ATCACG' and '2:Y:18:ATCACG'.
Concatenated FASTQ (in which all forward reads are followed by all reverse reads) is not supported.Concatenated

ADD COMMENT • link 2.9 years ago by Juke34 8.6k

1

Entering edit mode

I missed that part about SRA support read indicators. So it's very likely the interleaved one.

For those who may have the same question or new guys studying WES and WGS in the future, I could share some options to test this and move forward.

I checked the latest release log of bwa and recent answers in biostar. One way to test this is to use reformat.sh from BBmap

Before BWA 0.7.11, you need to split the interleaved fq. Afterward, once -p option is set, you could input interleaved fq to bwa mem.

A description about -p option

Smarter option -p. Given an interleaved FASTQ stream, old bwa-mem identifies the 2i-th and (2i+1)-th reads as a read pair. The new verion identifies adjacent reads with the same read name as a read pair. It is possible to mix single-end and paired-end reads in one FASTQ.

ADD REPLY • link 2.9 years ago by tomas4482 ▴ 420

score 2 · Accepted Answer · 2021-08-27

I come here for update.

Abovementioned "conclusion" is WRONG.

Such kind of files are truly pair-ended. But for some reasons, they lose original pairing information. When ENA or users use sratoolkits to perform a pairing detection e.g -split-3 or -split-file, it will always fail to create two correctly paired _1.fq and _2.fq.

For detailed explanation about how ENA uploading the fastq could check here.

In short, for pair-end experiment with one fq, it means all reads are unpaired.

According to a kindly staff from ENA, Eugene, he/she suggested that this file is more likely to be invalid for one of the following reasons:

1) The original FASTQ files submitted to NCBI have single-end reads but marked as paired. The files are only available on GS/S3 cloud so I can't check them.

2) The original FASTQ files were submitted to NCBI without following the submission guidelines for paired files https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/#pairedend-fastq (e.g. the FASTQs were unsorted).

3) An error occurred during FASTQ to SRA conversion on NCBI side.

4) SRA toolkit does SRA to FASTQ conversion incorrectly.

5) We (ENA) use SRA toolkit with incorrect parameters.

It's unlikely that these are options (3), (4) or (5), since most SRA files in PRJNA505380 were correctly converted to FASTQ so he/she suspect edit's (2).

Because both I and Eugene can see only forward (/1) or reverse (/2) part in that file. So it seems that pairing information was lost and the forward/reverse reads can't be matched.

The ONLY solution for this situation should be

Get the valid data is to request Cloud Data Delivery of the original submitted FASTQs from NCBI.

I've downloaded the original fq from cloud to my bucket. But It may take several weeks to validate. I'll update later.