Split Fasta file and rename output files with contig names
2
0
Entering edit mode
3.7 years ago
awoods6511 • 0

Hello!

I am trying to split a large fasta file (19,336 lines) into individual contigs. The file set up is as follows:

 >k141_284136 flag=1 multi=3.0000 len=1875
 AGCCTACATTGGCAAGGTACTGCTTTTGTCGCCCATCGTTGGCGAATTTGCTAATGAGAACACACGGAT
 >k141_407195 flag=1 multi=5.0000 len=1723
 GCCAGTAGTTTTCAGATTTTCAATTACTTTCTTTGCTTCTTTTAACGCAGCCGCAAAGTTGTCATCAAGTTCTCCACCCTGTGCAATATGTTTATATAGAATGCTGCTTACTTTGTCAGCAA
 >k141_169332 flag=1 multi=3.0000 len=20
ATTATCCATCCTATTCATCGCTTGATGAAATGTTGCAAAATTCCAAAGATTTTCAGCGTCAAATCGTTCGTATATCCTAATTAAACACCGCTAAAAGTTATGTCTAAGCAATCTTTAA

I am able to split the file but the output files names are meaning less (xaa, xab, xac etc.).

I am trying to split the fasta file so each contig is in an individual fasta file names with the contig name following the >.

For example file one would be titled "k141_284136.fa" and include:

>k141_284136 flag=1 multi=3.0000 len=1875
 AGCCTACATTGGCAAGGTACTGCTTTTGTCGCCCATCGTTGGCGAATTTGCTAATGAGAACACACGGAT

My input file is called vDNA-S1S1.fa! Thank you for any help.

contig fasta split • 3.3k views
ADD COMMENT
2
Entering edit mode
3.7 years ago
JC 13k

Perl one liner:

$ perl -lane '$id = $1 if (/>(\w+)\s/); `echo "$_" >> "$id.fa"`' < seq.fa
$ more k141_*
::::::::::::::
k141_169332.fa
::::::::::::::
>k141_169332 flag=1 multi=3.0000 len=20
ATTATCCATCCTATTCATCGCTTGATGAAATGTTGCAAAATTCCAAAGATTTTCAGCGTCAAATCGTTCGTATATCCTAATTAAACACCGCTAAAAGTTATGTCTAAGCAATCTTTAA
::::::::::::::
k141_284136.fa
::::::::::::::
>k141_284136 flag=1 multi=3.0000 len=1875
AGCCTACATTGGCAAGGTACTGCTTTTGTCGCCCATCGTTGGCGAATTTGCTAATGAGAACACACGGAT
::::::::::::::
k141_407195.fa
::::::::::::::
>k141_407195 flag=1 multi=5.0000 len=1723
GCCAGTAGTTTTCAGATTTTCAATTACTTTCTTTGCTTCTTTTAACGCAGCCGCAAAGTTGTCATCAAGTTCTCCACCCTGTGCAATATGTTTATATAGAATGCTGCTTACTTTGTCAGCAA
ADD COMMENT
1
Entering edit mode
3.7 years ago
Joe 21k

You can use this shell approach:

numseqs=$(grep -c ">" "$1");
numlines=$(wc -l < "$1");
if (( "$numlines" > $(( 2*$numseqs )) )); then
    echo "The fasta file needs to be linearised before this function will work.";
    return 1;
fi;

while read line; do
    if [ "${line:0:1}" == ">" ]; then
        header="$line";
        filename=$(echo "${line#>}" | sed 's/\ .*//g');
        touch "$filename".fasta
        echo "$header" >> "${filename}".fasta;
    else
        seq="$line";
        echo "$seq" >> "${filename}".fasta;
    fi;
done < $1

Note the first block is optional, as it just checks that the sequences are linear which yours appear to be.

Put it in a script and just run: bash scriptname.sh contigs.fasta. The new files will be spit out in the same place.

ADD COMMENT
0
Entering edit mode

This is great! exactly what I need! thank you!

ADD REPLY

Login before adding your answer.

Traffic: 2530 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6