Interpretation of SRA files
1
0
Entering edit mode
4.0 years ago

I just started using the SRA toolkit but I am quite puzzled about what I have. I downloaded an SRA file with fasterq-dump. The data is indicated to be paired-end and I do get two fastq files. However, the headers of each read are not in the standard format but rather in SRA format, I suppose.

The first 2 headers of the first fastq file are shown here

@ERX2240357.19 SBS123:200:C3PFWACXX:6:1101:2133:1988 length=101

@ERX2240357.20 SBS123:200:C3PFWACXX:6:1101:2347:1955 length=101

And the first 2 headers of the second fastq file are shown here

@ERX2240357.1 SBS123:200:C3PFWACXX:6:1101:1462:1956 length=101

@ERX2240357.2 SBS123:200:C3PFWACXX:6:1101:1487:1970 length=101

Normally I can see the pairs by looking at /1 and /2 in the header, but now this is missing. How can tools like bwa mem recognize that two reads form a pair based on the headers in the above SRA format?

I looked up the documentation but it doesn't get more clear for me...

sra fastq headers paired-end reads • 2.2k views
ADD COMMENT
1
Entering edit mode

EBI-ENA has the fastq files clearly marked as R1 and R2. You can get the data from there.

ADD REPLY
1
Entering edit mode
  1. it is the updated version of illumina software.

to my understand, the latest illumina output fastq file contains two parts in the name line, separated by a "white space", the name + index information. the name part are identical for read1 and read2. Only the first part will be used in further analysis, that refer to this read.

The following are what I get from HiSeq XTen platform.

$ zcat demo_1.fq.gz | head -n 1
@ST-E00144:1057:H5L7WCCX2:8:1101:5690:1467 1:N:0:NTAGGCAT
$ zcat demo_2.fq.gz | head -n 1
@ST-E00144:1057:H5L7WCCX2:8:1101:5690:1467 2:N:0:NTAGGCAT
  1. To my experience, these paired end reads are compatible with aligners like bwa, bowtie.

Last, your "first 2 headers" are not like the real output:

The name of reads in read1 and read2 must in the same order, or else, the aligner will report errors and stop working.

ADD REPLY
2
Entering edit mode
4.0 years ago
wm ▴ 550

Reads are compatible for aligners like BWA, bowtie2, etc (I tested), even if the /1, /2 suffix not exists.

You need to be caution, make sure the first part of name in read1 and read2 are identical, and in the same order.

What you paste in the post does not like the first two headers, the order is not correct.

I checked the fastq files from NCBI-SRA and EBI-ENA for the first two read name in read1 and read2

EBI-ENA version

As @genomax pointed out in EBI-ENA, ERR2184190_1.fastq.gz, ERR2184190_2.fastq.gz

found /1 and /2 suffix in tail of read name

==> read1.fq <==
@ERR2184190.1 SBS123:200:C3PFWACXX:6:1101:1462:1956/1
@ERR2184190.2 SBS123:200:C3PFWACXX:6:1101:1487:1970/1

==> read2.fq <==
@ERR2184190.1 SBS123:200:C3PFWACXX:6:1101:1462:1956/2
@ERR2184190.2 SBS123:200:C3PFWACXX:6:1101:1487:1970/2

SRA-toolkit download version

not found /1, /2 suffix

$ prefetch ERR2184190
$ fasterq-dump --threads 8 --split-3 ERR2184190.sra

==> read1.fq <==
@ERR2184190.sra.1 SBS123:200:C3PFWACXX:6:1101:1462:1956
@ERR2184190.sra.2 SBS123:200:C3PFWACXX:6:1101:1487:1970

==> read2.fq <==
@ERR2184190.sra.1 SBS123:200:C3PFWACXX:6:1101:1462:1956
@ERR2184190.sra.2 SBS123:200:C3PFWACXX:6:1101:1487:1970

Read name in alignment file

You may notice that, both read names are separated by a white space. for general purpose, only the first part of read name (eg: ERR2184190.sra.1) are saved in alignment file (bam file).

Here are example for your reads. the first line is read1, and the second line is read2. the read name are identical (1-column), and they are separated by the FLAG field, (2-column).

# subset 100 reads from the file
$ bwa mem Oaureus.fa read1.fq read2.fq | samtools view -Sub -f 2 - | samtools sort -o aln.bam - 
$ samtools view aln.bam | head -n 2

ERR2184190.sra.37   99      VASH01007726.1  2202973 60      101M    =       2203373 501 ...
ERR2184190.sra.37   147     VASH01007726.1  2203373 60      101M    =       2202973 -501 ...
ADD COMMENT
0
Entering edit mode

Thanks, this cleared it up for me!

ADD REPLY
0
Entering edit mode

for general purpose, only the first part of read name (eg: ERR2184190.sra.1) are saved in alignment file (bam file).

This is aligner dependent. It is correct that some aligners will truncate the names after first space but not all (e.g. bbmap).

ADD REPLY
0
Entering edit mode

Thanks so much, I got it.

ADD REPLY

Login before adding your answer.

Traffic: 2431 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6