Use Read ID File to parse out reads from a Fasta File
3
0
Entering edit mode
5.7 years ago
kalanir • 0

How can I use an file of a list of FASTA headers to pull out reads from a FASTA file (header + sequence)?

My IDs.txt file looks like this:

>NR::NT:L28960.1:NS500126:798:HWTLTAFXX:2:21207:14062:16380/1
>NR::NT::NS500126:798:HWTLTAFXX:2:21207:14062:16380/2
>NR::NT::NS500126:798:HWTLTAFXX:2:21207:20748:13870/1

My file.fasta looks like this:

>NR::NT:L28960.1:NS500126:798:HWTLTAFXX:2:21207:14062:16380/1
CCCCACTTTCCTTTACAGTACTTGTTCACTATCGGTCTTTGGTTGTATTTAGCCTTAGGTAAACTCATATCACCTATATTCATACTGCACTACCAAACAGTACTACTCAAATTTAGAGAGTATAATATCGATAACGAGACTATAACTCTC
>NR::NT::NS500126:798:HWTLTAFXX:2:21207:14062:16380/2
CTTAGATTTATTCTAAGTGTTGTATAGGGTAGTCACAAACAAATACACTAAAAATGTGACCATAGAGAGTTATAGTCTCGTTATCGATATTATACTCTCTAAATTTGAGTAGTACTGTTTGGTAGTGCAGTATGAATATAGGTGATATGA
>NR::NT::NS500126:798:HWTLTAFXX:2:21207:20748:13870/1
GTCGTATTCACACTTACAACAGGTAATGACTAACTTCCCAGCTTAGAGGCCGTCAGCTGTATCCCAGAGTTACGCCCTAAAGTCACTAGCAATAGCTGCACCTGCTAACCAAGACTTAGGTCTCCCACCCACAGTAGCTCTATAACCGCC
>NR::NT::NS500126:798:HWTLTAFXX:2:21207:20748:13870/2
CGTCACTACAAGTGCTAGCGTAACGTTAGTGTTTGTGTACGGCTAGCTGGGGCTTAGGTTGAAGACCTGTGGGGCGGTTATAGAGCTACTGTGGGTGGGAGACCTAAGTCTTGGTTAGCAGGTGCAGCTATTGCTAGTGACTTTAGGGCG

I've tried this command cut -c 2- IDs.txt | xargs -n 1 samtools faidx file.fasta but it breaks the sequence up, producing multiple lines uneccessarily.

I've also tried this command cat IDs.txt | awk '{gsub("_","\\_",$0);$0="(?s)^>"$0".*?(?=\\n(\\z|>))"}1' | pcregrep -oM -f - file.fasta but it produce an output.

Does anyone know a solution to this problem?

next-gen rna-seq sequencing • 2.0k views
ADD COMMENT
0
Entering edit mode

Please search in future, this is the number 1 biostars question.

ADD REPLY
0
Entering edit mode
5.7 years ago

Use seqtk or seqkit. You seem to have sequence in one line. You can use grep (ids.txt= text file with unique IDs one per line and input.fa is input fasta):

$ grep -A 1 -f ids.txt input.fa

or remove > from your IDs.txt and use seqkit:

seqkit grep -nf ids.txt test.fa
ADD COMMENT
0
Entering edit mode
  • the first function adds extra rows (--) after every pull. Also to translate into pulling fastq files, if I change to -A 3, it does not do the grep properly.
    • seqkit produces no output when i run it
ADD REPLY
0
Entering edit mode

Add --no-group-separator to grep command to avoid getting --------- after each record.

ADD REPLY
0
Entering edit mode

the files are so large that the grep function fails half way through parsing the file

ADD REPLY
0
Entering edit mode

Please add that sort of information (number of seqs etc) to your question in future, it will influence the answers you receive such as this one.

ADD REPLY
0
Entering edit mode
5.7 years ago
h.mon 35k

Try filterbyname.sh from BBTools / BBMap:

filterbyname.sh -Xmx100m threads=1 in=file.fasta names=IDs.txt ths=t out=out.fasta fastawrap=0
ADD COMMENT
0
Entering edit mode
  • The resulting out.fasta file is the same as the file.fasta when I run this command.
ADD REPLY
0
Entering edit mode

I tested with your example - deleting the second read name - and it worked as intended. Are there any differences between the example you provided and the actual files you intend to parse?

ADD REPLY
0
Entering edit mode
5.7 years ago
kalanir • 0

I solved my problem by:

  • Deleting the > on the ID list.
  • seqtk subseq file.fa IDs.txt > out.fasta
ADD COMMENT
0
Entering edit mode

You should have done the same with seqkit as mentioned in my post kratnasiri. As per extra ---, i think you are grepping against multiple files. In addition, OP is for fasta files, not for fastq. Moderate this post to reply to original comment.

ADD REPLY

Login before adding your answer.

Traffic: 3040 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6