Question

Use Read ID File to parse out reads from a Fasta File

0

Entering edit mode

6.9 years ago

kalanir • 0

How can I use an file of a list of FASTA headers to pull out reads from a FASTA file (header + sequence)?

My IDs.txt file looks like this:

>NR::NT:L28960.1:NS500126:798:HWTLTAFXX:2:21207:14062:16380/1
>NR::NT::NS500126:798:HWTLTAFXX:2:21207:14062:16380/2
>NR::NT::NS500126:798:HWTLTAFXX:2:21207:20748:13870/1

My file.fasta looks like this:

>NR::NT:L28960.1:NS500126:798:HWTLTAFXX:2:21207:14062:16380/1
CCCCACTTTCCTTTACAGTACTTGTTCACTATCGGTCTTTGGTTGTATTTAGCCTTAGGTAAACTCATATCACCTATATTCATACTGCACTACCAAACAGTACTACTCAAATTTAGAGAGTATAATATCGATAACGAGACTATAACTCTC
>NR::NT::NS500126:798:HWTLTAFXX:2:21207:14062:16380/2
CTTAGATTTATTCTAAGTGTTGTATAGGGTAGTCACAAACAAATACACTAAAAATGTGACCATAGAGAGTTATAGTCTCGTTATCGATATTATACTCTCTAAATTTGAGTAGTACTGTTTGGTAGTGCAGTATGAATATAGGTGATATGA
>NR::NT::NS500126:798:HWTLTAFXX:2:21207:20748:13870/1
GTCGTATTCACACTTACAACAGGTAATGACTAACTTCCCAGCTTAGAGGCCGTCAGCTGTATCCCAGAGTTACGCCCTAAAGTCACTAGCAATAGCTGCACCTGCTAACCAAGACTTAGGTCTCCCACCCACAGTAGCTCTATAACCGCC
>NR::NT::NS500126:798:HWTLTAFXX:2:21207:20748:13870/2
CGTCACTACAAGTGCTAGCGTAACGTTAGTGTTTGTGTACGGCTAGCTGGGGCTTAGGTTGAAGACCTGTGGGGCGGTTATAGAGCTACTGTGGGTGGGAGACCTAAGTCTTGGTTAGCAGGTGCAGCTATTGCTAGTGACTTTAGGGCG

I've tried this command cut -c 2- IDs.txt | xargs -n 1 samtools faidx file.fasta but it breaks the sequence up, producing multiple lines uneccessarily.

I've also tried this command cat IDs.txt | awk '{gsub("_","\\_",$0);$0="(?s)^>"$0".*?(?=\\n(\\z|>))"}1' | pcregrep -oM -f - file.fasta but it produce an output.

Does anyone know a solution to this problem?

next-gen rna-seq sequencing • 2.5k views

ADD COMMENT • link updated 5.5 years ago by Biostar 20 • written 6.9 years ago by kalanir • 0

0

Entering edit mode

Please search in future, this is the number 1 biostars question.

ADD REPLY • link 6.9 years ago by Joe 22k

score 0 · Answer 1 · 2018-08-23

0

Entering edit mode

6.9 years ago

cpad0112 21k

Use seqtk or seqkit. You seem to have sequence in one line. You can use grep (ids.txt= text file with unique IDs one per line and input.fa is input fasta):

$ grep -A 1 -f ids.txt input.fa

or remove > from your IDs.txt and use seqkit:

seqkit grep -nf ids.txt test.fa

ADD COMMENT • link 6.9 years ago by cpad0112 21k

0

Entering edit mode

the first function adds extra rows (--) after every pull. Also to translate into pulling fastq files, if I change to -A 3, it does not do the grep properly.
- seqkit produces no output when i run it

ADD REPLY • link 6.9 years ago by kalanir • 0

0

Entering edit mode

Add --no-group-separator to grep command to avoid getting --------- after each record.

ADD REPLY • link 6.9 years ago by GenoMax 152k

0

Entering edit mode

the files are so large that the grep function fails half way through parsing the file

ADD REPLY • link 6.9 years ago by kalanir • 0

0

Entering edit mode

Please add that sort of information (number of seqs etc) to your question in future, it will influence the answers you receive such as this one.

ADD REPLY • link 6.9 years ago by Joe 22k

score 0 · Answer 2 · 2018-08-23

0

Entering edit mode

6.9 years ago

h.mon 35k

Try filterbyname.sh from BBTools / BBMap:

filterbyname.sh -Xmx100m threads=1 in=file.fasta names=IDs.txt ths=t out=out.fasta fastawrap=0

ADD COMMENT • link 6.9 years ago by h.mon 35k

0

Entering edit mode

The resulting out.fasta file is the same as the file.fasta when I run this command.

ADD REPLY • link 6.9 years ago by kalanir • 0

0

Entering edit mode

I tested with your example - deleting the second read name - and it worked as intended. Are there any differences between the example you provided and the actual files you intend to parse?

ADD REPLY • link 6.9 years ago by h.mon 35k

score 0 · Answer 3 · 2018-08-24

0

Entering edit mode

6.9 years ago

kalanir • 0

I solved my problem by:

Deleting the > on the ID list.
seqtk subseq file.fa IDs.txt > out.fasta

ADD COMMENT • link 6.9 years ago by kalanir • 0

0

Entering edit mode

You should have done the same with seqkit as mentioned in my post kratnasiri. As per extra ---, i think you are grepping against multiple files. In addition, OP is for fasta files, not for fastq. Moderate this post to reply to original comment.

ADD REPLY • link 6.9 years ago by cpad0112 21k