Parsing large sequence file into per-sample files after barcodes and adapters have been removed
1
0
Entering edit mode
5.6 years ago
caverill ▴ 40

I have a sequence file, seqs.fas where each sequence has a name like HKRLXUA01AIJQ5|48. The letters are a random unique sequence code, and the number after the | symbol is the sample identifier. Here are 3 sequences from this file:

>HJUCUKY01BZHPG|01
AGATCCGTAGGTGAACCTGCGGAAGGATCATTACCGAGTATTTTTGGGTAAACCGAAAAC
TCCCACCCTTGTTTCAAGTTTTGTTGCTTCGGCAGGCCTACGGTTGATTGTAAAATGGGC
TAGTACCTGCCGAAGAACCACACACTTTTGATTTGTTGTAATATTATAAAATTAAAAACA
AAAACTTTCAACAACGGATCTCTTGGTCCTGGCATCGATGAAGAACGCAGCGAAATGCGA
TAAGTAATGTGAATTGCAGAATTTTGTGAATCATCGAATCTTTGAGCGCACATTGCGCCC
TCTGGTATTCCGGGGGGCACACCTGTTCGAGCGCCGTTGACATACTAAGGCCCAGCCTTG
TGTTGGCCCTTCCCGCTAGGGATCGGTCGAAAACTATGCAGATCCCAGAAAATCGGAAGC
ATACGCAATAGTATAGCGGAAGACGCTCTGAATCTCGATTATCAACCAGTTTGGCCTCGG
ATCAGGTGGGGATACCCGCTGAACTTAAGCATATCAATAAGCGGAGGAGTG
>HJUCUKY01BZ3TS|01
AGATCCGTAGGTGAACCTGCGGAAGGATCATTAACAAGTCCTGCTGTGCGTGCGGAGCAG
TCTGCATGCGCACATCCGTTCCATGTGCCCCGCACATGACATTTCTGCGAATCGTCTGAT
TGTCGCTCTTCAAACCATACAAACTTTTAACAATGGATCTCTAGGCTCTTGCATCGATGA
AGAACGCAGTGAAGTGCGAAAAGTAATGCGATTTGCATGCTCTGTGAGTCATCGAATCTT
TGAACGCATATGGCACCTGCCAGCCCTGCTGGAAGGTATGCCTGTTTGAGAATCACACTA
TCACTGATCCGATTGCACCACGTGCAGTTGGTCGACATGGGCCTTCAGCACTGATCGCCT
CGAATTGCTCGCGATCGTCGTGATATGACAGGGTGTGGCAGCGATGCCGTCCGGTGCTGT
CGACGATCAACCCGTG
>HKRLXUA01AIJQ5|48
AGATCCGTAGGTGAAACCCTGCGGAAGGATCATTAAAAGAGAAAAGAGCGCCTCGCGGCC
TCGCCCTCTTCAACCACTGTGTACCGAATCTCCGTCATCTTTGCGGGTCCGGCGCCCCAG
GCGCCGCCCGCGGAGAGCACCCAAAAACATTCAGTTTGATGAACGTCTGTTTACAATTTA
AAAAGAACAACTTTTAACAATGGATCTCTTGGTTCCGGCATCGATGAAGAACGCAGCGAA
ATGCGATAGCTAGTGTGAATTGCAGATTTCAGTGAATCATCGAGTCTTTGAACGCACATT
GCGCCCCTTGGTATTCCTCGGGGCATGCCTATTCGAGCGTCGTTTCGACCCTCGAGCGCA
AGCTTGGTGTTGAGGGATGCGGCGGCCCCCCGGGGCAGCGGCACCCTTCGAATCCATCGG
CGGCGGCAGCATGGCCCGGACGCAGCGAAATGCGCTCTAGCTCATGCAGCAGCCCGCCGG
AAACTCACCGCCTACGCGGCACTACGC

I would like to break this into per sample sequence files, named 1.fas, 48.fas, etc. Where the file name comes from the label after the | character in the sequence name. Any idea how to do this?

demultiplexing • 748 views
ADD COMMENT
0
Entering edit mode
5.6 years ago
$ awk -F '|' '/^>/ {F=sprintf("%s.fas",$2);} {print >> F;}' in.fa
ADD COMMENT

Login before adding your answer.

Traffic: 2399 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6