Question: Extracting fastq files, based on their fasta counterparts
0
gravatar for roblogan6
2.8 years ago by
roblogan630
roblogan630 wrote:

I have two files. One is a multifasta file, then other is a multifastq. The same sequences are found in both files, the files are just in different formats. I have subsets of the multifasta file, and would like to find all those sequences in the multifastq file. The subsets are merely small multifasta files (~ 100 sequences) out of the original (~125K sequences).
I feel like grep should be able to do this nicely, but I don't actually know much of anything about grep. I do know, though, that it has a finite memory storage and it might not be the best when working with large files such as two 125K sequence multifasta/q files. I need the sequence and the phred quality scores. A sequence in one file looks like this:

>m160505_031746_42156_c100980652550000001823221307061611_s1_p0/30/0_59
AAGAGAGAGATCCTCTTAAGACTCCCAACACGAATTCTCTATTACGCACA
TTATGTATAA

The same sequence in the other file looks like:

@m160505_031746_42156_c100980652550000001823221307061611_s1_p0/30/0_59 RQ=0.771
AAGAGAGAGATCCTCTTAAGACTCCCAACACGAATTCTCTATTACGCACATTATGTATA
+
&%,--.-)..)&$.),.*&"*'.$&(('(-'))*)-#&$(,+-($&$#%%%,*+$*++'

As you can see, the header IDs are very similar, but not identical. Thanks for the help! -Rob

database grep fastq perl fasta • 933 views
ADD COMMENTlink modified 2.8 years ago by Brian Bushnell16k • written 2.8 years ago by roblogan630

Two supplementary questions.

  1. Are the ID's identical in fasta and fastq files?
  2. Do you need the full fastq records or just the sequence?
ADD REPLYlink written 2.8 years ago by genomax70k
1
gravatar for Brian Bushnell
2.8 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

With the BBMap package:

filterbyname.sh in=x.fastq out=y.fastq names=z.fasta include
ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Brian Bushnell16k
0
gravatar for venu
2.8 years ago by
venu6.2k
Germany
venu6.2k wrote:

You can do something like following (Note: I've not tested it)

sed '/^>/d' fasta_file.fa | while read -r fasta; do grep -A2 -B1 "$fasta" fastq.fq >> new_fastq.fq; done
ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by venu6.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1229 users visited in the last hour