Extracting fastq files, based on their fasta counterparts
3
0
Entering edit mode
5.8 years ago
roblogan6 ▴ 30

I have two files. One is a multifasta file, then other is a multifastq. The same sequences are found in both files, the files are just in different formats. I have subsets of the multifasta file, and would like to find all those sequences in the multifastq file. The subsets are merely small multifasta files (~ 100 sequences) out of the original (~125K sequences).
I feel like grep should be able to do this nicely, but I don't actually know much of anything about grep. I do know, though, that it has a finite memory storage and it might not be the best when working with large files such as two 125K sequence multifasta/q files. I need the sequence and the phred quality scores. A sequence in one file looks like this:

>m160505_031746_42156_c100980652550000001823221307061611_s1_p0/30/0_59
AAGAGAGAGATCCTCTTAAGACTCCCAACACGAATTCTCTATTACGCACA
TTATGTATAA


The same sequence in the other file looks like:

@m160505_031746_42156_c100980652550000001823221307061611_s1_p0/30/0_59 RQ=0.771
AAGAGAGAGATCCTCTTAAGACTCCCAACACGAATTCTCTATTACGCACATTATGTATA
+
&%,--.-)..)&$.),.*&"*'.$&(('(-'))*)-#&$(,+-($&$#%%%,*+$*++'


As you can see, the header IDs are very similar, but not identical. Thanks for the help! -Rob

fastq fasta grep perl database • 1.6k views
0
Entering edit mode

Two supplementary questions.

1. Are the ID's identical in fasta and fastq files?
2. Do you need the full fastq records or just the sequence?
1
Entering edit mode
5.8 years ago

With the BBMap package:

filterbyname.sh in=x.fastq out=y.fastq names=z.fasta include

0
Entering edit mode
5.8 years ago
venu 7.0k

You can do something like following (Note: I've not tested it)

sed '/^>/d' fasta_file.fa | while read -r fasta; do grep -A2 -B1 "\$fasta" fastq.fq >> new_fastq.fq; done