I have a .txt file with a list of sequence IDs, looks like this:
A00580:377:HMC2FDSXY:3:2251:27389:24314 A00580:377:HMC2FDSXY:3:1506:13575:27571 A00580:377:HMC2FDSXY:3:1540:25934:5509 A00580:377:HMC2FDSXY:3:1439:18276:25160 A00580:377:HMC2FDSXY:3:1366:3161:27602 A00580:377:HMC2FDSXY:3:1555:21531:3959 A00580:377:HMC2FDSXY:3:2412:24261:33301 A00580:377:HMC2FDSXY:3:2444:9317:12931 A00580:377:HMC2FDSXY:3:2223:28619:24064 A00580:377:HMC2FDSXY:3:1112:23782:17347 A00580:377:HMC2FDSXY:3:1439:17987:33082 A00580:377:HMC2FDSXY:3:1113:22797:26757
And I have multiple .fastq.gz files and each contains sequences like this:
@A00580:377:HMC2FDSXY:3:1101:1154:1016 1:N:0:TCTACCATTT+NACTCTCCCG CAAGAGGTCTGCGGACGGGTCATTGGCC + :FFFFFF:F:FFFF,FF:FFFFFFFFFF @A00580:377:HMC2FDSXY:3:1101:1280:1016 1:N:0:TCTACCATTT+NACTCTCCCG GTGCGTGGTAGGTAGCACGTACAGCGTA + FFFFFFFFFFFFFFFFFFFFFFFFFFF: @A00580:377:HMC2FDSXY:3:1101:1298:1016 1:N:0:TCTACCATTT+NACTCTCCCG GAAACCTCATAATGAGCTTCTTGAAACA + FFFFFFFFFFFFFFFFFFFFFFFFFFFF @A00580:377:HMC2FDSXY:3:1101:1371:1016 1:N:0:TCTACCATTT+NACTCTCCCG GGAGGATCAGGTCCCATTGTTCAATTTC + FFFFFFFFFFFFFFFFFFFFFFFFFFFF @A00580:377:HMC2FDSXY:3:1101:1479:1016 1:N:0:TCTACCATTT+NACTCTCCCG ATACCGAAGTAAACGTGACAAGGATCTT + FFFFFFFFFFFFFFFFFFFFFFFFFFFF
I can see that the sequence IDs are first part of the sequence headers. I want to extract the sequences based on the list of sequence IDs, but I cannot figure out how to do that.
Can anyone provide some help?
Thanks so much!!
Have you tried zgrep with the option
Fastq is structured to have 4 lines for each read, having the ID in the first line. You might need to parse the output before storing the entry in a new fastq file.
Alternatively, you might find other solutions like here
Thanks! I have not tried zgrep yet, will try that if the current method doesn't work!