Fastq File Split Using Corresponding Id Identifier In Tag File, Not Barcode Sequence
2
0
Entering edit mode
10.7 years ago
Tonyzeng ▴ 310

HI, I have a question and need to solve it.

Followings are two fastq files.

File 1 includes all the forward read sequence (more than 400,000,000) produced by NGS Illumina platform.

File 2 includes all sequence reads (more then 4,800,000) from one specified barcode (TGACCTTG).

File 1 does not include barcode sequence in the ID identifier as showed as (#0/1) and File 2 has barcode sequence in the second line of sequence. So file 1 can not be split by barcode directly but by file 2 because file2 has similar ID identifier as file1.

Does anybody has script or tools to split file1 using corresponding ID identifiers in file2? I do not have strong bioinformatics background on this.

File1

@IPAR1:2:1:4029:1196:1#0/1 
ATTTTGCCACATACAAAAGAATCTACGTTCTTCTCAGCACCTCATGGAATCTTCTCTAAAATATATCATATAATAGGACACAAAAGAA 
+ 
BHGHHHHHHHHGDDFHHHGGDGHFHFHHHHGD>GEEG>GFHHHHFHBBHFHHHHEHHHHHHBAFHHBBEHHHFEHGBECEHFHHFAHF

File2

@IPAR1:2:1:4029:1196:1#0/2   
TGACCTTGATCTCGT 
+ 
HIHIIGIIIH8CCDC
fastq • 4.0k views
ADD COMMENT
0
Entering edit mode
10.7 years ago
Tonyzeng ▴ 310

Sorry I am afraid i did not explain my question clear.

I need to pull the reads from File 2 (more then 400,000,000) which are corresponding all reads (more then 4,800,000)in File 2.

The similar ID idenfier of file1 and file2 can help us to split/pull reads of file1. Does anyone has experience to do this and has script to me to do it?

File1 @IPAR1:2:1:4029:1196:1#0/1

File2 @IPAR1:2:1:4029:1196:1#0/2

ADD COMMENT
0
Entering edit mode

do not add an answer to your question!

that just makes it look like it has been answered. Edit your question and add the new information then delete this answer!

ADD REPLY
0
Entering edit mode
10.7 years ago

A fast way,

1 - fetch headers from file2 that contains only your barcode(s) of interest and replace "#0/2" at the end with "#0/1" as it is in file1:

awk 'NR%4==1 {sub(/\#0\/2/,"\#0\/1"); print}' file2.fastq > file2.IDs

2 - use one of those ultra fast programs described here: How to efficiently parse a huge fastq file? to subset your file1 according to IDs, or simply use my program which is made from the python version found in the mentioned thread: https://code.google.com/p/bioman/source/browse/fastqID.py , which you can use like this:

cat file1.fastq | python fastqID.py file2.IDs > file1.sub.fastq
ADD COMMENT
0
Entering edit mode

Thank you so much, Manu Prestat, my problem has been solved using your script.

ADD REPLY

Login before adding your answer.

Traffic: 2011 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6