Question

Fastq File Split Using Corresponding Id Identifier In Tag File, Not Barcode Sequence

0

Entering edit mode

10.7 years ago

Tonyzeng ▴ 310

HI, I have a question and need to solve it.

Followings are two fastq files.

File 1 includes all the forward read sequence (more than 400,000,000) produced by NGS Illumina platform.

File 2 includes all sequence reads (more then 4,800,000) from one specified barcode (TGACCTTG).

File 1 does not include barcode sequence in the ID identifier as showed as (#0/1) and File 2 has barcode sequence in the second line of sequence. So file 1 can not be split by barcode directly but by file 2 because file2 has similar ID identifier as file1.

Does anybody has script or tools to split file1 using corresponding ID identifiers in file2? I do not have strong bioinformatics background on this.

File1

@IPAR1:2:1:4029:1196:1#0/1 
ATTTTGCCACATACAAAAGAATCTACGTTCTTCTCAGCACCTCATGGAATCTTCTCTAAAATATATCATATAATAGGACACAAAAGAA 
+ 
BHGHHHHHHHHGDDFHHHGGDGHFHFHHHHGD>GEEG>GFHHHHFHBBHFHHHHEHHHHHHBAFHHBBEHHHFEHGBECEHFHHFAHF

File2

@IPAR1:2:1:4029:1196:1#0/2   
TGACCTTGATCTCGT 
+ 
HIHIIGIIIH8CCDC

fastq • 4.0k views

ADD COMMENT • link updated 9 months ago by Ram 43k • written 10.7 years ago by Tonyzeng ▴ 310

score 0 · Answer 1 · 2013-08-26

0

Entering edit mode

10.7 years ago

Tonyzeng ▴ 310

Sorry I am afraid i did not explain my question clear.

I need to pull the reads from File 2 (more then 400,000,000) which are corresponding all reads (more then 4,800,000)in File 2.

The similar ID idenfier of file1 and file2 can help us to split/pull reads of file1. Does anyone has experience to do this and has script to me to do it?

File1 @IPAR1:2:1:4029:1196:1#0/1

File2 @IPAR1:2:1:4029:1196:1#0/2

ADD COMMENT • link 10.7 years ago by Tonyzeng ▴ 310

0

Entering edit mode

do not add an answer to your question!

that just makes it look like it has been answered. Edit your question and add the new information then delete this answer!

ADD REPLY • link 10.7 years ago by Istvan Albert 100k

score 0 · Answer 2 · 2013-08-26

A fast way,

1 - fetch headers from file2 that contains only your barcode(s) of interest and replace "#0/2" at the end with "#0/1" as it is in file1:

awk 'NR%4==1 {sub(/\#0\/2/,"\#0\/1"); print}' file2.fastq > file2.IDs

2 - use one of those ultra fast programs described here: How to efficiently parse a huge fastq file? to subset your file1 according to IDs, or simply use my program which is made from the python version found in the mentioned thread: https://code.google.com/p/bioman/source/browse/fastqID.py , which you can use like this:

cat file1.fastq | python fastqID.py file2.IDs > file1.sub.fastq