1
0
Entering edit mode
3.1 years ago
James ▴ 20

Hi everyone, I have been trying to edit the sequence titles within a fasta file and have come up short. Please can someone help if they can. Sorry if this is a dupication. I've tried searching around. There ar elots of similar things but i don't know how to change them to my specific situation.

I am doing illumina full genome sequencing of influenza viruses. The script that processes the data outputs a comnsensus sequence fasta file. Within that file there are 8 sequences. The file has a name based on the sample number submitted to the sequencingguuys, and the sequences inside the file have titles that are actually the reference used to map data against. I want to rename the file, and all 8 sequences, with the same name, but the 8 segments also need to say which gene they are.

I have started writing a script but am now stuck, please can anyone point me in the right direction.

This is the script i started Not sure how this will display

#!/bin/bash
set -e

#script to copy, rename, and edit fasta files
#Needs an input file
fastafile=$1 # Get path from file DIR=$(dirname "$fastafile") echo "$DIR"

#echo information if script not used properly
if [ $# -ne 1 ]; then echo "Use: FluSeqOutputRename.sh <a fasta file you want renamed> " exit 1 fi echo " This script will edit "$fastafile"

Re-naming the file and editing the fasta headers
"

#check how many sequences are in the fasta file
NoSeqs=$(grep -o '>' "$fastafile" | wc -l)

echo "
There are "$NoSeqs" gene segments " if [ "$NoSeqs" -ne 8 ]; then
echo "
There are NOT 8 gene segments

"
exit 1
fi

#i might be able to just do it with 1 input IF i can replace the / with - later
NewFileName=$(echo "$virname" | tr '-' '_' | tr '/' '-')

echo "$virname" echo "$NewFileName"

read -p "Your file will be named "$NewFileName" is this correct y or n?: " correct if [ "$correct" = n ]; then
exit 1
fi

echo "I am continuing to copy and edit your file"

#Now need to copy the file and rename it
cp -i -v "$fastafile" "$DIR"/"$NewFileName".fasta #fastacopy="$DIR"/"$flutype-"$species-"$country"-"$identifier"-"$year".fasta fastacopy="$DIR"/"$NewFileName".fasta echo "The new file is "$fastacopy""

#I can make a new text file which is the header lines i want but does this even help?
echo "$virname" PB2 > "$DIR"/SequenceNames.txt
echo "$virname" PB1 >> "$DIR"/SequenceNames.txt
echo "$virname" PA >> "$DIR"/SequenceNames.txt
echo "$virname" HA >> "$DIR"/SequenceNames.txt
echo "$virname" NP >> "$DIR"/SequenceNames.txt
echo "$virname" NA >> "$DIR"/SequenceNames.txt
echo "$virname" MP >> "$DIR"/SequenceNames.txt
echo "$virname" NS >> "$DIR"/SequenceNames.txt


but thats as far as I can get. Any help gratefully received! Thank you James

fasta headers fasta bash script • 1.3k views
0
Entering edit mode

Hello James

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.

Thank you!

0
Entering edit mode

Many thanks genomax!

0
Entering edit mode

Hi James,

More useful than your code at this stage, would be examples of the input files, and the desired output format, since it's not super clear to me so far.

0
Entering edit mode

Hi, My input file looks something like this Sample12-top_matches-iter4_consensus.fa

>KP288011-Influenza_seg1-iter4\n
AGAGATTTGATGTCGCAGTCTCGCACTCGCGAGATACTA\n
ATGGCCATAATCAAGAAATATACGTCAGGAAGACAGGAG\n
AAATGGATGATGGCAATGAAATATCCGATTACAGCAGAC\n
CCTGAAAGAAATGAGCAAGGTCAGACTCTTTGGAGCAAG\n
>KU143516-Influenza_seg2-iter4\n
TATGACTGGACACTGAACAGAAACCAACCAGC\n
GTGTTCAGATCGAATGGTCTGACAGCCAGTGA\n
GGATGTGATGGAATCAATGGATAAAGAAGAGAT\n
AAGAAGAGTGAGGGACAACATTACCAAGAAGAT\n
AAGCAGAGGCTGAACAAGAGGAGTTACTTAAT\n
>JX534572-Influenza_seg3-iter4\n
ACTTTGTGCGACAATGTTTCAATCCAATG\n
AATATGGGGAAGATCCGAAAATCGAAACA\n
GAAGTCTGTTTCATGTATTCAGACTTCCAC\n
TAGAATCTGGCGATCCGAACGCATTATTG\n
>KP732646-Influenza_seg4-iter4\n
TACATAGTGGAGAGGGATAAT\n
TGACTATGAAGAACTGAAACACCTG\n
TCATCCCTAAGAGTTCTTGGCCCAAT\n
TCCATACCAGGGAACGCCCTCCTTTTTC\n
>KX977633-Influenza_seg5-iter4\n
TCAGCATCATGGCGTCTCAAGGC\n
AGCGCCAGAATGCCACCGAGAT\n
AGGTTCTACATACAGATGTGCACT\n
>KX978671-Influenza_seg6-iter4\n
GAAGGTAATATGCATTTCAGCCACAGGA\n
AATTGCCAATTTGGGCCTAAACATCGGA\n
>EU735819-Influenza_seg7-iter4\n
CGAAACGTACGTTCTCTCTATTGTCCC\n
ACTTGAAGATGTCTTTGTAGGGAAGAA\n
GACAAGACCAATCCTGTCACCTCTGAC\n
GTGCCCAGTGAGCGAGGACTGCAGCG\n
TGGAGACCCAAACAATATGGACAGGGC\n
>CY028255-Influenza_seg8-iter4\n
GGTAGATTGCTTTCTTTGGT\n
CCCGTTCCTTGACCGGCTTC\n


And my desired output, because I end up in windows after this, would be

A-turkey-Scotland-123_456-2015.fasta

>A/turkey/Scotland/123-456/2015 PB2\r\n
AGAGATTTGATGTCGCAGTCTCGCACTCGCGAGATACTAATGGCCATAATCAAGAAATATACGTCAGGAAGACAGGAGAAATGGATGATGGCAATGAAATATCCGATTACAGCAGACCCTGAAAGAAATGAGCAAGGTCAGACTCTTTGGAGCAAG\r\n
>A/turkey/Scotland/123-456/2015 PB1\r\n
TATGACTGGACACTGAACAGAAACCAACCAGCGTGTTCAGATCGAATGGTCTGACAGCCAGTGAGGATGTGATGGAATCAATGGATAAAGAAGAGATAAGAAGAGTGAGGGACAACATTACCAAGAAGATAAGCAGAGGCTGAACAAGAGGAGTTACTTAAT\r\n
>A/turkey/Scotland/123-456/2015 PA\r\n
ACTTTGTGCGACAATGTTTCAATCCAATGAATATGGGGAAGATCCGAAAATCGAAACAGAAGTCTGTTTCATGTATTCAGACTTCCACTAGAATCTGGCGATCCGAACGCATTATTG\r\n
>A/turkey/Scotland/123-456/2015 HA\r\n
ACATAGTGGAGAGGGATAATTGACTATGAAGAACTGAAACACCTGTCATCCCTAAGAGTTCTTGGCCCAATTCCATACCAGGGAACGCCCTCCTTTTTC\r\n
>A/turkey/Scotland/123-456/2015 NP\r\n
TCAGCATCATGGCGTCTCAAGGCAGCGCCAGAATGCCACCGAGATCAGGTTCTACATACAGATGTGCACT\r\n\
>A/turkey/Scotland/123-456/2015 NA\r\n
GAAGGTAATATGCATTTCAGCCACAGGAAATTGCCAATTTGGGCCTAAACATCGGA\r\n
>A/turkey/Scotland/123-456/2015 MP\r\n
ACGTACGTTCTCTCTATTGTCCCACTTGAAGATGTCTTTGTAGGGAAGAAGACAAGACCAATCCTGTCACCTCTGAGTGCCCAGTGAGCGAGGACTGCAGCGTGGAGACCCAAACAATATGGACAGGGC\r\n
>A/turkey/Scotland/123-456/2015 NS\r\n
GGTAGATTGCTTTCTTTGGTCCCGTTCCTTGACCGGCTTC\r\n

0
Entering edit mode

I'd recommend quoting variables so:

#Now need to copy the file and rename it
cp -i -v "$fastafile" "${DIR}/${NewFileName}.fasta" fastacopy="${DIR}/${NewFileName}.fasta"  ADD REPLY 4 Entering edit mode read -p "Please enter the virus name: " virname "everytime you ask for an interaction or everytime you print your messages to stdout, god kills a kitten" Minimum Standards For Bioinformatics Command Line Tools ADD REPLY 0 Entering edit mode Maybe so, but maybe there are too many kittens in the world anyway. Can't think of any other way to tell the script what the name should be, it's not in another file. It needs to be made up by the operator. And I will remove most of the echo lines once I understand it's working. Was just using them to show me it was working how I expected :) ADD REPLY 1 Entering edit mode You do it as a flag to the command to start with :) My answer shows a complicated/robust way to do it, but at its most basic it can be done using the 'special' variables" $1, $2,$3... etc

e.g:

$bash myscript.sh arg1 arg2 arg3 ^ ^ ^ ^$0       $1$2   $3  So you could do: bash script.sh virusname fasta.fa  and virusname ==$1... etc.

0
Entering edit mode

Thank Ram I'll make those edits

2
Entering edit mode
3.1 years ago
Joe 19k

This is a little hacky because I'm missing something with sed about matching occurrence numbers rather than line numbers but I've given up!

NOTE: 2 caveats to this script functioning:

1. Your sequences must first be linearised if they aren't already. Here's a one liner which will let you do that: awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < myseqs.fasta
2. That your sequences always occur in the same order your tags are listed in.

Since you seem to want a 'front-end' to the script, I've gone ahead and made it nice and 'commandline-y'.

#!/bin/bash

usage(){
cat << EOF

usage: $0 -i <input_fasta> -n <name_string> Redirect to an output file:$0 -i <input_fasta> -n <name_string>  >  Output.fasta

Required:
-i | --infile      Input fasta file
-n | --name        Viral name for prepending.

Options:
-h | --help        Show this message

EOF
}

for arg in "$@"; do shift case "$arg" in
"--help")     set -- "$@" "-h" ;; "--infile") set -- "$@" "-i"   ;;
"--outfile")  set -- "$@" "-o" ;; "--name") set -- "$@" "-n"   ;;
*)            set -- "$@" "$arg" ;;
esac
done
while getopts "hi:o:n:" OPTION ;do
case $OPTION in i) infile=$OPTARG   ;;
o) outfile=$OPTARG ;; n) name=$OPTARG     ;;
h) usage ; exit 0   ;;
esac
done

numseqs=$(grep -o ">" "$infile" | wc -l)
if [ "$numseqs" != 8 ] ; then echo >&2 "There are$numseqs in the file: $infile" echo >&2 "There should only be 8 sequences. Check your starting file, and try again." exit 1 fi if [[ -z "$infile" ]]; then
usage ; echo "No input file provided. Exiting." ; exit 1
fi
if [[ -z "$name" ]]; then usage ; echo "No virus name string provided. Exiting." ; exit 1 fi sed -e "1,/>/ s/^>.*/>${name}_PB2/" \
-e "3,/>/ s/^>.*/>${name}_PB1/" \ -e "5,/>/ s/^>.*/>${name}_PA/" \
-e "7,/>/ s/^>.*/>${name}_HA/" \ -e "9,/>/ s/^>.*/>${name}_NP/" \
-e "11,/>/ s/^>.*/>${name}_NA/" \ -e "13,/>/ s/^>.*/>${name}_MP/" \
-e "15,/>/ s/^>.*/>${name}_NS/" "$infile"


As a final note, I believe this will only work with GNU sed - i.e. sed that ships with BSD (like MacOS) needs slightly different syntax.

1
Entering edit mode

I’ll add that a better solution would probably be to use a proper parser like Biopython to do this, unless you’ve got a really good reason to do it all with bash.

0
Entering edit mode

Wow, thanks for the reply, I will give it a go. 1 The sequences aren't linearised yet but can be, thanks for the commands. 2. The sequences do always appear in the order listed I am using a virtual machine running centos7

No special reason for using bash, it's just the only way I have a vague knowledge of.

0
Entering edit mode

Yeah play with it and let me know how you get on, I admit I didn't test it for very long.

CentOS is a RHEL distribution, and my guess would be that it will be using the GNU sed so it's probably fine.

1. Don't ;)
2. You can use the tool unix2dos which should automatically convert all the line endings etc. (I believe its installed by default on most distributions)

1. Get input fasta and linearise it
2. Pass linear fasta to my script above (or whatever your variant of it becomes)
3. Output the edited fasta (it'll still be linear, but that shouldn't present any issues.
4. Adjust the file with unix2dos

Traffic: 1388 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.