Question

Concatenation of sequences (fasta) with different headers based on reference list

0

Entering edit mode

4.4 years ago

jan • 0

Hello, I have two fasta files (trimmed alignments) and need to concatenate sequences from both of them. The headers of the sequences are NOT identical. The concatenation should be done according to a provided reference list. Any recommendations? Many thanks in advance!

Fasta file_1:

>Apple_1-1.1
AAAAAAAAAAAAACCCCCCCCCCC
>Tiger-A_a.a
ATATATATATATATATATATATAT
>Lemon
AAAAAAAAAAAAAAAAAAAAAAAA

Fasta file_2:

>Fruit_2.2-2
GGGGGGGGGGGGGTTTTTTTTTTT
>Animal-Xya.Za-1
GCGCGCGCGCGCGCGCGCGCGCGC
>Tree
TTTTTTTTTTTTTTTTTTTTTTTT

Reference list:

Apple_1-1.1 Fruit_2.2-2
Tiger-A_a.a Animal-Xya.Za-1
Lemon
            Tree

Required output:

>Apple_1-1.1_Fruit_2.2-2
AAAAAAAAAAAAACCCCCCCCCCCGGGGGGGGGGGGGTTTTTTTTTTT
>Tiger-A_a.a_Animal-Xya.Za-1
ATATATATATATATATATATATATGCGCGCGCGCGCGCGCGCGCGCGC
>Lemon
AAAAAAAAAAAAAAAAAAAAAAAA------------------------
>Tree
------------------------TTTTTTTTTTTTTTTTTTTTTTTT

sequence alignment • 1.3k views

ADD COMMENT • link updated 4.4 years ago by gabt ▴ 120 • written 4.4 years ago by jan • 0

0

Entering edit mode

how do you know how many spaces are required after the Tree sequence? Or before the Lemon sequence?

Also, you did not mention which tool, i.e. R, python, you would like to use!

ADD REPLY • link 4.4 years ago by gabt ▴ 120

0

Entering edit mode

The input files are trimmed alignments, so the length is defined. As for the tool, python, perl or awk would be perfect.

ADD REPLY • link 4.4 years ago by jan • 0

0

Entering edit mode

What components of the headers differentiate/group the sequences?

e.g. will the Tiger sequence always begin "Tiger-A"? Or is it Tiger-A different from Tiger-something else?

ADD REPLY • link 4.4 years ago by Joe 21k

0

Entering edit mode

Yes, the names differ a lot, so it needs to work with the reference list.

ADD REPLY • link 4.4 years ago by jan • 0

0

Entering edit mode

Yes, but what 'connects' the headers? Being on the same line in the key file?

ADD REPLY • link 4.4 years ago by Joe 21k

0

Entering edit mode

No, not the same line. The headers are "connected" by identical names in the reference file.

ADD REPLY • link 4.4 years ago by jan • 0

score 1 · Accepted Answer · 2019-11-12

I'm sorry to tell you that I don't know Python, Perl or Awk well enough to provide an answer. But, I have a working one written in R. You can find it with my comments, so you can understand how it works and, maybe, translate it into Python. Consider that I have three files called fasta_one, fasta_two and refList.

I'm assuming that the fasta sequences are all the same length and that the first row of each fasta file is a reference, and the second row, i.e. [2, 1] means [second row, first column], being a valid fasta sequence which I'm using to compute the length.

refList is tab separated, i.e. it has two columns for those lines that contains two correct references.

Also, notice that I didn't add a condition that tells you if no references are found. Hence, if the files are not compiled accurately, this code may results in errors!

This is the code:

#read files
#stringsAsFactors is set to False, so that R consider the strings as actual strings, the field separator is Tab, and no header is provided in the files!
fastaOne <- read.csv("fasta_one", header=F, sep="\t", stringsAsFactors=F)
fastaTwo <- read.csv("fasta_two", header=F, sep="\t", stringsAsFactors=F)
refList <- read.csv("refList", header=F, sep="\t", stringsAsFactors=F)

#get length of first available fasta sequence, assuming the fasta files are consistent (see comment above about the fasta files)
seqLength <- nchar(fastaOne[2, 1])

#now loop through the refList
for (line in 1:nrow(refList)) {

    #test wether the first line contains actual references
    if (refList[line, 1] != "" && refList[line, 2] != "") {
        #if yes, locate it in fastaOne
        fOne <- which(fastaOne==paste(">", refList[line, 1], sep=""))
        fTwo <- which(fastaTwo==paste(">", refList[line, 2], sep=""))
        print(paste(refList[line, 1], refList[line, 2], sep=" "))
        print(paste(fastaOne[fOne+1, 1], fastaTwo[fTwo+1, 1], sep=""))
    }

    #otherwise the first contains a meaningful reference and the second is empty
    else if (refList[line, 1] != "" && refList[line, 2] == ""){
        #if yes, locate them in fastaOne and fastaTwo
        fOne <- which(fastaOne==paste(">", refList[line, 1], sep=""))
        #then print the refList
        print(paste(refList[line, 1], refList[line, 2], sep=" "))
        #and paste it with dashed-line
        print(paste(fastaOne[fOne+1, 1], paste(rep("-", seqLength), collapse=""), sep=""))

    }
    #or the first is empty and the second contains a meaningful reference
    else if (refList[line, 1] == "" && refList[line, 2] != ""){
        #if yes, locate it in fastaTwo
        fTwo <- which(fastaTwo==paste(">", refList[line, 2], sep=""))
        #then print the refList
        print(paste(refList[line, 1], refList[line, 2], sep=" "))
        #and paste it with dashed-line
        print(paste(paste(rep("-", seqLength), collapse=""), fastaTwo[fTwo+1, 1], sep=""))
    }
}

Example output with a wrong reference:

[1] "Apple_1-1.1 Fruit_2.2-2"
[1] "AAAAAAAAAAAAACCCCCCCCCCCGGGGGGGGGGGGGTTTTTTTTTTT"
[1] "Tiger-A_a.a Animal-Xya.Za-1"
[1] "ATATATATATATATATATATATATGCGCGCGCGCGCGCGCGCGCGCGC"
[1] "Lemon "
[1] "AAAAAAAAAAAAAAAAAAAAAAAA------------------------"
[1] " Tree"
[1] "------------------------TTTTTTTTTTTTTTTTTTTTTTTT"
[1] "wrongRef "
[1] "------------------------"