Editing fasta headers with bash
1
0
Entering edit mode
5.6 years ago
James ▴ 20

Hi everyone, I have been trying to edit the sequence titles within a fasta file and have come up short. Please can someone help if they can. Sorry if this is a dupication. I've tried searching around. There ar elots of similar things but i don't know how to change them to my specific situation.

I am doing illumina full genome sequencing of influenza viruses. The script that processes the data outputs a comnsensus sequence fasta file. Within that file there are 8 sequences. The file has a name based on the sample number submitted to the sequencingguuys, and the sequences inside the file have titles that are actually the reference used to map data against. I want to rename the file, and all 8 sequences, with the same name, but the 8 segments also need to say which gene they are.

I have started writing a script but am now stuck, please can anyone point me in the right direction.

This is the script i started Not sure how this will display

#!/bin/bash
set -e

#script to copy, rename, and edit fasta files
#Needs an input file
fastafile=$1

# Get path from file
DIR=$(dirname "$fastafile")
echo "$DIR"

#echo information if script not used properly
if [ $# -ne 1 ]; then
    echo "Use: FluSeqOutputRename.sh <a fasta file you want renamed>
"
    exit 1
fi

echo "
    This script will edit "$fastafile"

    Re-naming the file and editing the fasta headers
"

#check how many sequences are in the fasta file
NoSeqs=$(grep -o '>' "$fastafile" | wc -l)

echo "
There are "$NoSeqs" gene segments
"

if [ "$NoSeqs" -ne 8 ]; then
echo "
There are NOT 8 gene segments

Please Start Again
"
exit 1
fi

#i might be able to just do it with 1 input IF i can replace the / with - later
read -p "Please enter the virus name: " virname
NewFileName=$(echo "$virname" | tr '-' '_' | tr '/' '-')

echo "$virname"
echo "$NewFileName"

read -p "Your file will be named "$NewFileName" is this correct y or n?: " correct

    if [ "$correct" = n ]; then
        echo "Please Start again"
        exit 1
        fi


echo "I am continuing to copy and edit your file"

#Now need to copy the file and rename it
cp -i -v "$fastafile" "$DIR"/"$NewFileName".fasta

#fastacopy="$DIR"/"$flutype-"$species-"$country"-"$identifier"-"$year".fasta
fastacopy="$DIR"/"$NewFileName".fasta
echo "The new file is 

"$fastacopy""

#I can make a new text file which is the header lines i want but does this even help?
echo "$virname" PB2 > "$DIR"/SequenceNames.txt
echo "$virname" PB1 >> "$DIR"/SequenceNames.txt
echo "$virname" PA >> "$DIR"/SequenceNames.txt
echo "$virname" HA >> "$DIR"/SequenceNames.txt
echo "$virname" NP >> "$DIR"/SequenceNames.txt
echo "$virname" NA >> "$DIR"/SequenceNames.txt
echo "$virname" MP >> "$DIR"/SequenceNames.txt
echo "$virname" NS >> "$DIR"/SequenceNames.txt

but thats as far as I can get. Any help gratefully received! Thank you James

fasta headers fasta bash script • 2.6k views
ADD COMMENT
0
Entering edit mode

Hello James

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY
0
Entering edit mode

Many thanks genomax!

ADD REPLY
0
Entering edit mode

Hi James,

More useful than your code at this stage, would be examples of the input files, and the desired output format, since it's not super clear to me so far.

ADD REPLY
0
Entering edit mode

Hi, My input file looks something like this Sample12-top_matches-iter4_consensus.fa

>KP288011-Influenza_seg1-iter4\n
AGAGATTTGATGTCGCAGTCTCGCACTCGCGAGATACTA\n
ATGGCCATAATCAAGAAATATACGTCAGGAAGACAGGAG\n
AAATGGATGATGGCAATGAAATATCCGATTACAGCAGAC\n
CCTGAAAGAAATGAGCAAGGTCAGACTCTTTGGAGCAAG\n
>KU143516-Influenza_seg2-iter4\n
TATGACTGGACACTGAACAGAAACCAACCAGC\n
GTGTTCAGATCGAATGGTCTGACAGCCAGTGA\n
GGATGTGATGGAATCAATGGATAAAGAAGAGAT\n
AAGAAGAGTGAGGGACAACATTACCAAGAAGAT\n
AAGCAGAGGCTGAACAAGAGGAGTTACTTAAT\n
>JX534572-Influenza_seg3-iter4\n
ACTTTGTGCGACAATGTTTCAATCCAATG\n
AATATGGGGAAGATCCGAAAATCGAAACA\n
GAAGTCTGTTTCATGTATTCAGACTTCCAC\n
TAGAATCTGGCGATCCGAACGCATTATTG\n
>KP732646-Influenza_seg4-iter4\n
TACATAGTGGAGAGGGATAAT\n
TGACTATGAAGAACTGAAACACCTG\n
TCATCCCTAAGAGTTCTTGGCCCAAT\n
TCCATACCAGGGAACGCCCTCCTTTTTC\n
>KX977633-Influenza_seg5-iter4\n
TCAGCATCATGGCGTCTCAAGGC\n
AGCGCCAGAATGCCACCGAGAT\n
AGGTTCTACATACAGATGTGCACT\n
>KX978671-Influenza_seg6-iter4\n
GAAGGTAATATGCATTTCAGCCACAGGA\n
AATTGCCAATTTGGGCCTAAACATCGGA\n
>EU735819-Influenza_seg7-iter4\n
CGAAACGTACGTTCTCTCTATTGTCCC\n
ACTTGAAGATGTCTTTGTAGGGAAGAA\n
GACAAGACCAATCCTGTCACCTCTGAC\n
GTGCCCAGTGAGCGAGGACTGCAGCG\n
TGGAGACCCAAACAATATGGACAGGGC\n
>CY028255-Influenza_seg8-iter4\n
GGTAGATTGCTTTCTTTGGT\n
CCCGTTCCTTGACCGGCTTC\n

And my desired output, because I end up in windows after this, would be

A-turkey-Scotland-123_456-2015.fasta

>A/turkey/Scotland/123-456/2015 PB2\r\n
AGAGATTTGATGTCGCAGTCTCGCACTCGCGAGATACTAATGGCCATAATCAAGAAATATACGTCAGGAAGACAGGAGAAATGGATGATGGCAATGAAATATCCGATTACAGCAGACCCTGAAAGAAATGAGCAAGGTCAGACTCTTTGGAGCAAG\r\n
>A/turkey/Scotland/123-456/2015 PB1\r\n
TATGACTGGACACTGAACAGAAACCAACCAGCGTGTTCAGATCGAATGGTCTGACAGCCAGTGAGGATGTGATGGAATCAATGGATAAAGAAGAGATAAGAAGAGTGAGGGACAACATTACCAAGAAGATAAGCAGAGGCTGAACAAGAGGAGTTACTTAAT\r\n
>A/turkey/Scotland/123-456/2015 PA\r\n
ACTTTGTGCGACAATGTTTCAATCCAATGAATATGGGGAAGATCCGAAAATCGAAACAGAAGTCTGTTTCATGTATTCAGACTTCCACTAGAATCTGGCGATCCGAACGCATTATTG\r\n
>A/turkey/Scotland/123-456/2015 HA\r\n
ACATAGTGGAGAGGGATAATTGACTATGAAGAACTGAAACACCTGTCATCCCTAAGAGTTCTTGGCCCAATTCCATACCAGGGAACGCCCTCCTTTTTC\r\n
>A/turkey/Scotland/123-456/2015 NP\r\n
TCAGCATCATGGCGTCTCAAGGCAGCGCCAGAATGCCACCGAGATCAGGTTCTACATACAGATGTGCACT\r\n\
>A/turkey/Scotland/123-456/2015 NA\r\n
GAAGGTAATATGCATTTCAGCCACAGGAAATTGCCAATTTGGGCCTAAACATCGGA\r\n
>A/turkey/Scotland/123-456/2015 MP\r\n
ACGTACGTTCTCTCTATTGTCCCACTTGAAGATGTCTTTGTAGGGAAGAAGACAAGACCAATCCTGTCACCTCTGAGTGCCCAGTGAGCGAGGACTGCAGCGTGGAGACCCAAACAATATGGACAGGGC\r\n
>A/turkey/Scotland/123-456/2015 NS\r\n
GGTAGATTGCTTTCTTTGGTCCCGTTCCTTGACCGGCTTC\r\n
ADD REPLY
0
Entering edit mode

I'd recommend quoting variables so:

#Now need to copy the file and rename it
cp -i -v "$fastafile" "${DIR}/${NewFileName}.fasta"

fastacopy="${DIR}/${NewFileName}.fasta"
ADD REPLY
4
Entering edit mode

read -p "Please enter the virus name: " virname

"everytime you ask for an interaction or everytime you print your messages to stdout, god kills a kitten"

Minimum Standards For Bioinformatics Command Line Tools

ADD REPLY
0
Entering edit mode

Maybe so, but maybe there are too many kittens in the world anyway. Can't think of any other way to tell the script what the name should be, it's not in another file. It needs to be made up by the operator. And I will remove most of the echo lines once I understand it's working. Was just using them to show me it was working how I expected :)

ADD REPLY
1
Entering edit mode

You do it as a flag to the command to start with :)

My answer shows a complicated/robust way to do it, but at its most basic it can be done using the 'special' variables" $1, $2, $3... etc

e.g:

$ bash myscript.sh arg1 arg2 arg3
           ^        ^    ^    ^
           $0       $1  $2   $3

So you could do:

bash script.sh virusname fasta.fa

and virusname == $1... etc.

ADD REPLY
0
Entering edit mode

Thank Ram I'll make those edits

ADD REPLY
2
Entering edit mode
5.6 years ago
Joe 21k

This is a little hacky because I'm missing something with sed about matching occurrence numbers rather than line numbers but I've given up!

NOTE: 2 caveats to this script functioning:

  1. Your sequences must first be linearised if they aren't already. Here's a one liner which will let you do that: awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < myseqs.fasta
  2. That your sequences always occur in the same order your tags are listed in.

Since you seem to want a 'front-end' to the script, I've gone ahead and made it nice and 'commandline-y'.

#!/bin/bash

usage(){
cat << EOF

usage: $0  -i <input_fasta> -n <name_string>

Redirect to an output file:

$0 -i <input_fasta> -n <name_string>  >  Output.fasta

Required:
   -i | --infile      Input fasta file
   -n | --name        Viral name for prepending.

Options:
   -h | --help        Show this message

EOF
}

for arg in "$@"; do
  shift
  case "$arg" in
        "--help")     set -- "$@" "-h"   ;;
        "--infile")   set -- "$@" "-i"   ;;
        "--outfile")  set -- "$@" "-o"   ;;
        "--name")     set -- "$@" "-n"   ;;
        *)            set -- "$@" "$arg" ;;
  esac
done
while getopts "hi:o:n:" OPTION ;do
  case $OPTION in
        i) infile=$OPTARG   ;;
        o) outfile=$OPTARG  ;;
        n) name=$OPTARG     ;;
        h) usage ; exit 0   ;;
  esac
done

numseqs=$(grep -o ">" "$infile" | wc -l)
if [ "$numseqs" != 8 ] ; then
 echo >&2 "There are $numseqs in the file: $infile"
 echo >&2 "There should only be 8 sequences. Check your starting file, and try again."
 exit 1
fi

if [[ -z "$infile" ]]; then
  usage ; echo "No input file provided. Exiting." ; exit 1
fi
if [[ -z "$name" ]]; then
  usage ; echo "No virus name string provided. Exiting." ; exit 1
fi

sed -e "1,/>/ s/^>.*/>${name}_PB2/" \
    -e "3,/>/ s/^>.*/>${name}_PB1/" \
    -e "5,/>/ s/^>.*/>${name}_PA/" \
    -e "7,/>/ s/^>.*/>${name}_HA/" \
    -e "9,/>/ s/^>.*/>${name}_NP/" \
    -e "11,/>/ s/^>.*/>${name}_NA/" \
    -e "13,/>/ s/^>.*/>${name}_MP/" \
    -e "15,/>/ s/^>.*/>${name}_NS/" "$infile"

As a final note, I believe this will only work with GNU sed - i.e. sed that ships with BSD (like MacOS) needs slightly different syntax.

ADD COMMENT
1
Entering edit mode

I’ll add that a better solution would probably be to use a proper parser like Biopython to do this, unless you’ve got a really good reason to do it all with bash.

ADD REPLY
0
Entering edit mode

Wow, thanks for the reply, I will give it a go. 1 The sequences aren't linearised yet but can be, thanks for the commands. 2. The sequences do always appear in the order listed I am using a virtual machine running centos7

No special reason for using bash, it's just the only way I have a vague knowledge of.

ADD REPLY
0
Entering edit mode

Yeah play with it and let me know how you get on, I admit I didn't test it for very long.

CentOS is a RHEL distribution, and my guess would be that it will be using the GNU sed so it's probably fine.

To answer one of your comments above, about moving back to Windows:

  1. Don't ;)
  2. You can use the tool unix2dos which should automatically convert all the line endings etc. (I believe its installed by default on most distributions)

So, your workflow would become:

  1. Get input fasta and linearise it
  2. Pass linear fasta to my script above (or whatever your variant of it becomes)
  3. Output the edited fasta (it'll still be linear, but that shouldn't present any issues.
  4. Adjust the file with unix2dos
ADD REPLY

Login before adding your answer.

Traffic: 1851 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6