How can I use an file of a list of FASTA headers to pull out reads from a FASTA file (header + sequence)?
My IDs.txt
file looks like this:
>NR::NT:L28960.1:NS500126:798:HWTLTAFXX:2:21207:14062:16380/1
>NR::NT::NS500126:798:HWTLTAFXX:2:21207:14062:16380/2
>NR::NT::NS500126:798:HWTLTAFXX:2:21207:20748:13870/1
My file.fasta
looks like this:
>NR::NT:L28960.1:NS500126:798:HWTLTAFXX:2:21207:14062:16380/1
CCCCACTTTCCTTTACAGTACTTGTTCACTATCGGTCTTTGGTTGTATTTAGCCTTAGGTAAACTCATATCACCTATATTCATACTGCACTACCAAACAGTACTACTCAAATTTAGAGAGTATAATATCGATAACGAGACTATAACTCTC
>NR::NT::NS500126:798:HWTLTAFXX:2:21207:14062:16380/2
CTTAGATTTATTCTAAGTGTTGTATAGGGTAGTCACAAACAAATACACTAAAAATGTGACCATAGAGAGTTATAGTCTCGTTATCGATATTATACTCTCTAAATTTGAGTAGTACTGTTTGGTAGTGCAGTATGAATATAGGTGATATGA
>NR::NT::NS500126:798:HWTLTAFXX:2:21207:20748:13870/1
GTCGTATTCACACTTACAACAGGTAATGACTAACTTCCCAGCTTAGAGGCCGTCAGCTGTATCCCAGAGTTACGCCCTAAAGTCACTAGCAATAGCTGCACCTGCTAACCAAGACTTAGGTCTCCCACCCACAGTAGCTCTATAACCGCC
>NR::NT::NS500126:798:HWTLTAFXX:2:21207:20748:13870/2
CGTCACTACAAGTGCTAGCGTAACGTTAGTGTTTGTGTACGGCTAGCTGGGGCTTAGGTTGAAGACCTGTGGGGCGGTTATAGAGCTACTGTGGGTGGGAGACCTAAGTCTTGGTTAGCAGGTGCAGCTATTGCTAGTGACTTTAGGGCG
I've tried this command cut -c 2- IDs.txt | xargs -n 1 samtools faidx file.fasta
but it breaks the sequence up, producing multiple lines uneccessarily.
I've also tried this command cat IDs.txt | awk '{gsub("_","\\_",$0);$0="(?s)^>"$0".*?(?=\\n(\\z|>))"}1' | pcregrep -oM -f - file.fasta
but it produce an output.
Does anyone know a solution to this problem?
Please search in future, this is the number 1 biostars question.