In my project, I have to convert several SRA files to fastq files. These SRA files are paired end. I read a previous post about how to use fastq-dump to do so. However, I am still confused about the split step.
For example, after I ran
fastq-dump ERR011087.sra, I got ERR011087.fastq which contains paired end reads with the length of 88. The first read looks like
@ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=88 TTCANATATGGAAAAACAGGGAGCGGAAATCACGTTACTTGCGTATCATCGGAAAAGGCAGGCTGTCCATGCTCCAACCGGTTAATGA +ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=88 IIII"9I;III<*+<-45CI13;-=93+046/0<1:-06>4.2+4:I86III0.863;GA@7I:5./2$62110='0(2(0$+++&+(
After I ran
fastq-dump --split-files ERR011087.sra, I did get 2 fastq files, ERR011087_1.fastq and ERR011087_2.fastq. The first read of ERR011087_1.fastq is
@ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=44 TTCANATATGGAAAAACAGGGAGCGGAAATCACGTTACTTGCGT +ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=44 IIII"9I;III<*+<-45CI13;-=93+046/0<1:-06>4.2+
The first read of ERR011087_2.fastq is
@ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=44 ATCATCGGAAAAGGCAGGCTGTCCATGCTCCAACCGGTTAATGA +ERR011087.1 I330_1_FC30JM6AAXX:4:1:0:199 length=44 4:I86III0.863;GA@7I:5./2$62110='0(2(0$+++&+(
It seems that
fastq-dump --split-files just splits each read whose length is 88 in ERR011087.sra into 2 reads whose length is 44. Is just spliting the first half and the last half of a read equal to spliting a paired end read into two fragments?
If so, it is very strange to find that the amount of reads in ERR011087_1.fastq and ERR011087_2.fastq is different. I ran
grep "@ERR" ERR011087.fastq |wc -l and got
grep "@ERR" ERR011087_1.fastq |wc -l and got
grep "@ERR" ERR011087_2.fastq |wc -l and got
11640358. I think these numbers represent the amount of reads in each file. However, three numbers are NOT the same. I felt very confused because if
fastq-dump --split-files just splits each read whose length is 88 in ERR011087.sra into 2 reads whose length is 44, then the amount of reads in ERR011087_1.fastq and ERR011087_2.fastq should be equal. There must be something wrong with it.
Could anyone explain that?