Question

How to extract equal number of R1 and R2 read from raw Illumina data

0

Entering edit mode

6.8 years ago

Bioinfonext ▴ 460

Dear All,

Somehow, We got different number of R1 and R2 reads in Illumina paired end sequencing data. Is there any way to extract equal number of R1 and R2 reads from these raw files. these are just pair end library not strand specific.

[root@psgl data_new]# grep -c '^@'  SS_5W_R1.fastq

26623063

[root@psgl data_new]# grep -c '^@' SS_5W_R2.fastq

25803102

[root@psgl data_new]# grep -c '^@' SS_7W_R1.fastq

42474961

[root@psgl data_new]# grep -c '^@' SS_7W_R2.fastq

41089376

Thanks

RNA-Seq • 2.2k views

ADD COMMENT • link updated 6.8 years ago by st.ph.n ★ 2.7k • written 6.8 years ago by Bioinfonext ▴ 460

1

Entering edit mode

If you want to use this method always include a few characters that follow @ sign (which are generally the machine serial) in line 1 (e.g. grep -c "^@M1023" file_name).

ADD REPLY • link 6.8 years ago by GenoMax 141k

0

Entering edit mode

6.8 years ago

st.ph.n ★ 2.7k

cat SS_7W_R1.fastq | echo ((`wc -l `/4))

ADD COMMENT • link 6.8 years ago by st.ph.n ★ 2.7k

score 4 · Accepted Answer · 2017-06-25

4

Entering edit mode

6.8 years ago

Brian Bushnell 20k

Quality score strings can contain or start with "@" so this is not a reliable method. Please use "wc" instead, or use an actual bioinformatics tool to count the reads and test formatting.

ADD COMMENT • link 6.8 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks a lot.

with this command, it is coming correct number:

awk '{s++}END{print s/4}' fastq file name

Thanks again!

ADD REPLY • link 6.8 years ago by Bioinfonext ▴ 460