How to split fastq file contains two different header information
2
0
Entering edit mode
5.9 years ago
goh ▴ 10

Dear all,

I have a 16s NGS fastq file which comprise of two separate NGS run with the heading information as below:-

Run 1 heading : @HISEQ:791:HCGHLBCX2:1:1101:5407:3025 1:N:0:AGTGGTCA

Run 2 heading: @HISEQ:781:HCMFJBCX2:1:2101:15224:51333 1:N:0:ATTGAGGA

I would like to split the sequences based on the heading information into two separate files.

I have tried looking in the forum for any QIIME and sed script but couldn't manage to done it.

Any idea how to do it ?

Thanks.

sequencing next-gen split file • 3.6k views
ADD COMMENT
0
Entering edit mode

You can grep them out..some thing like this:

grep -A 3 -w "HISEQ:791:HCGHLBCX2" test.fastq > test.out.fastq

But these are read headers. Flow cells and run IDs seem to be different (from OP). You can grep based on them.

ADD REPLY
0
Entering edit mode

Thanks for the reply. I have tried it. However, it just give me a list without sequences or quality. I think i did not mention my question very clear initially, i will try to clarify below. Thanks.

I have a fastq file which contains a few hundreds of sequences like below. However, there are two separate runs (781 & 791) in one fastq file. I would like to separate them including the sequences and quality into separate fastq files.

@HISEQ:781:HCMFJBCX2:1:1101:16664:3823 1:N:0:ATTGAGGA
TGGGGAATTTTGGACAATGGGCGAAAGCCTGATCCAGCAATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTTGTCCGGAAAGAAATCCTTGGCTCTAATACAGTCGGGGGATGACGGTACCGGAAGAATAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAG
+
HEHIIIIIEHIHGHHIIIIIIIHIH<EHIIIIIIIHIIIIIIIHIIGHIIIIHIIIGIIIIFHIIIGIGHHIIIIIII0FHIIIHIEGHHDHHIGGHEHHHHHHIIIIGHIIIIHHHDHDH?GGEHEHIHDHIHHIHCEGEEHHIIIIH<HDFH@G?GHGIHIHCH@.;??DDH<EHHHCBEHHEHHHHA?CDHHHHIFEHHHHIFHHCF?H<HC@-F-7-B@####
@HISEQ:791:HCGHLBCX2:1:2216:12138:99630 1:N:0:AGTGGTCA
TGGGGAATTTTGGACAATGGGCGAAAGCCTGATCCAGCAATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTTGTCCGGAAAGAAATCCTTGGCTCTAATACAGTCGGGGGATGACGGTACCGGAAGAATAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAG
+
EH@GHHHIEHHHHHIIIHHHEEDECHHGHHIIIIHHHHCHIIIEEHHCCCHHIII?@FHHIIFEHHGGGHDHIEHHHH11CGHHH?CEFDHHHHC@HHH@GHCHHE@HEFHIFHHII<ECHDHHHIEHHH@?GH-??.FCHIHEA.;:E<-:...B.7@..;-@.-85-E-7?,7BG-6-8B-5@6--6-5,,6@8EH-@E-5-B@@-86-FD<5@@@GD-@#####
ADD REPLY
0
Entering edit mode

@OP: I guess the sequences are copy/pasted from webpage or word or some other formatted pages. Sequence qualities are affected by this. Please post as they are. Otherwise, tools fail to parse fastq proper.

input (copy/pasted from above and formatted):

cat test.fq
@HISEQ:781:HCMFJBCX2:1:1101:16664:3823 1:N:0:ATTGAGGA 
TGGGGAATTTTGGACAATGGGCGAAAGCCTGATCCAGCAATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTTGTCCGGAAAGAAATCCTTGGCTCTAATACAGTCGGGGGATGACGGTACCGGAAGAATAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAG 
+ 
HEHIIIIIEHIHGHHIIIIIIIHIH<ehiiiiiiihiiiiiiihiighiiiihiiigiiiifhiiigighhiiiiiii0fhiiihieghhdhhigghehhhhhhiiiighiiiihhhdhdh?ggehehihdhihhihcegeehhiiiih<hdfh@g?ghgihihch@.;??ddh<ehhhcbehhehhhha?cdhhhhifehhhhifhhcf?h<hc@-f-7-b@#### 
@hiseq:791:hcghlbcx2:1:2216:12138:99630="" 1:n:0:agtggtca
tggggaattttggacaatgggcgaaagcctgatccagcaatgccgcgtgtgtgaagaaggccttcgggttgtaaagcacttttgtccggaaagaaatccttggctctaatacagtcgggggatgacggtaccggaagaataagcaccggctaactacgtgccagcagccgcggtaatacgtagggtgcgagcgttaatcggaattactgggcgtaaagcgtgcgcag 
+
eh@ghhhiehhhhhiiihhheedechhghhiiiihhhhchiiieehhccchhiii?@fhhiifehhggghdhiehhhh11cghhh?cefdhhhhc@hhh@ghchhe@hefhifhhii<echdhhhiehhh@?gh-??.fchihea.;:e

output:

 grep -A 3 -i -w  'HISEQ:781' test.fq
@HISEQ:781:HCMFJBCX2:1:1101:16664:3823 1:N:0:ATTGAGGA 
TGGGGAATTTTGGACAATGGGCGAAAGCCTGATCCAGCAATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTTGTCCGGAAAGAAATCCTTGGCTCTAATACAGTCGGGGGATGACGGTACCGGAAGAATAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAG

+ 
HEHIIIIIEHIHGHHIIIIIIIHIH<ehiiiiiiihiiiiiiihiighiiiihiiigiiiifhiiigighhiiiiiii0fhiiihieghhdhhigghehhhhhhiiiighiiiihhhdhdh?ggehehihdhihhihcegeehhiiiih<hdfh@g?ghgihihch@.;??ddh<ehhhcbehhehhhha?cdhhhhifehhhhifhhcf?h<hc@-f-7-b@####
  
ADD REPLY
1
Entering edit mode

with seqkit:

$ seqkit grep -irp "HISEQ:781" test.fq
ADD REPLY
0
Entering edit mode

for ubuntu/parallel users, try (test.fastq - input fastq):

$ parallel --dry-run 'seqkit grep -irp "{1}" {3} -o {1}.fastq.gz; seqkit grep -irp "{2}" {3} -o {2}.fastq.gz' ::: K00193:38 ::: K00193:40 ::: test.fastq

seqkit grep -irp "K00193:38" test.fastq -o K00193:38.fastq.gz; seqkit grep -irp "K00193:40" test.fastq -o K00193:40.fastq.gz
ADD REPLY
2
Entering edit mode
5.9 years ago
Chirag Parsania ★ 2.0k

Here is the R script you can use that.

library(ShortRead)
library(data.table)

## load fastq combined of 791 and 781. I saved your data in the biostar_query.fastq
dd <- readFastq("~/Downloads/biostar_query.fastq")

### get fastq ids
ids <- id(dd)

## get matching index of 791 and 781 
index_791 <- grep("hiseq:791:" , ids)
index_781 <- grep("HISEQ:781:" , ids )

## filter reads by id 
reads_791 <- dd[index_791]
reads_781 <- dd[index_781]

## write files 
writeFastq(object = reads_791 , file = "791.fastq")
writeFastq(object = reads_781 , file = "781.fastq")
ADD COMMENT
0
Entering edit mode

Thanks for the help. However, i'm not a bioinfo based person. This is too advance.

ADD REPLY
1
Entering edit mode
5.9 years ago

One liner using seqkit

seqkit grep -inr -p "HISEQ:791" your.fq  > 791.fq && seqkit grep -inr -p "HISEQ:781" your.fq > 781.fq
ADD COMMENT
0
Entering edit mode

I have tried your approach on cmd administrator but unsuccessful. Below i attach the script and error msg.

seqkit grep -inr -p "HISEQ:791" C:\Users\BIOTECH\Desktop\Shared_Folder\CR052T_S1_L001_R1_001.fastq > C:\Users\BIOTECH\Desktop\Shared_Folder\CR052T_S1_L001_R1_001_791.fastq && seqkit -inr -p "HISEQ:781" C:\Users\BIOTECH\Desktop\Shared_Folder\CR052T_S1_L001_R1_001.fastq > C:\Users\BIOTECH\Desktop\Shared_Folder\CR052T_S1_L001_R1_001_781.fastq

FYI, i installed seqkit as instructed by the website and after unzip i copy the seqkit into windows/system32. When i type seqkit in the command line, a list of functions of seqkit do come out.

Is it the script i type is wrong ? Thanks.

ADD REPLY
0
Entering edit mode

try '-ir -p' instead of -inr -p

ADD REPLY
0
Entering edit mode

Still cant. the same error poped up.

I guess it's easier if i share one of my file with you. click here

ADD REPLY
1
Entering edit mode

Seems chaining && works in windows CMD only, not in power shell. In powershell, try running them individually.

First:

seqkit grep -inr -p "HISEQ:791" C:\Users\BIOTECH\Desktop\Shared_Folder\CR052T_S1_L001_R1_001.fastq > C:\Users\BIOTECH\Desktop\Shared_Folder\CR052T_S1_L001_R1_001_791.fastq

Next:

seqkit -inr -p "HISEQ:781" C:\Users\BIOTECH\Desktop\Shared_Folder\CR052T_S1_L001_R1_001.fastq > C:\Users\BIOTECH\Desktop\Shared_Folder\CR052T_S1_L001_R1_001_781.fastq

instead of

seqkit grep -inr -p "HISEQ:791" C:\Users\BIOTECH\Desktop\Shared_Folder\CR052T_S1_L001_R1_001.fastq > C:\Users\BIOTECH\Desktop\Shared_Folder\CR052T_S1_L001_R1_001_791.fastq && seqkit -inr -p "HISEQ:781" C:\Users\BIOTECH\Desktop\Shared_Folder\CR052T_S1_L001_R1_001.fastq > C:\Users\BIOTECH\Desktop\Shared_Folder\CR052T_S1_L001_R1_001_781.fastq

If you want to run above line, try using command line (from run dialog, type "cmd" and press enter). For me following worked (on windows 10 command line, not powershell):

seqkit grep -inr -p "HISEQ:791" C:\Users\genomics\Desktop\CR052T_S1_L001_R1_001.fastq > C:\Users\genomics\Desktop\791.fastq  && seqkit grep -inr -p "HISEQ:781" C:\Users\genomics\Desktop\CR052T_S1_L001_R1_001.fastq > C:\Users\genomics\Desktop\781.fastq
ADD REPLY
0
Entering edit mode

it works nicely. Thank you.

ADD REPLY

Login before adding your answer.

Traffic: 1694 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6