Question

get size information from an OTU table into a fastafile for use in USEARCH/VSEARCH

0

Entering edit mode

6.0 years ago

EBP91 ▴ 50

I have two files (see below for the actual format): a fasta file with > 7000 sequences and a .txt file consisting of two columns. The first column in the .txt file corresponds with the name in the fasta file (minus the tail ';size=') and the second column gives the total number of sequences corresponding with that name. Now, I would like to add this size information for each sequence to the back of the headers in the fasta file of that same sequence. In other words: I would like to get the number '6047' which corresponds to ZOTU1 in the fasta file like '>Zotu1;size=6047'. The ZOTU's in the text file are not sorted.

I have no clue how to go about this so any pointing in the right direction would be extremely appreciated!

Thanks!

The files:

1) the fasta file looks like this:

>Zotu1;size=
AGCTCCAAAAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAACTTCTGTTCAGGTTCATTTCGACTCGTC
GAGTGAAACTGGACATACGTTTGCAAACTAAAATCGGCCTTCACTGGTTCGTCTTAGGGAGTAAACATTTTACTGTGAAA
AAATTAGAGTGTTCCAGGCAGGTTTTAGCCCGAATACATTAGCATGGAATAATGGAATAGGACTAAGTCCATTTTATTGG
TTCTTGGATTTGGTAATGATTAATAGGGGCAGTTGGGGGCATTAGTATTTAATAGTCAGAGGTGAAATTCTTGGATTTAT
TAAGGACTAACTAATGCGAAAGCATTTGCCAAAGATGTTTTCA

>Zotu2;size=
AGCTCCAATAGCGTATATTTAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATCTTGGGTCGGGGGCAGCGGTCCGCCCC
TTGTGGGTGTGCACTGGTCCACCCGGCCTTACTGCCGGGGACGCGCTCCTGGCCTTCGCTGGTCGGGACGCGGAGTTGGC
GATGTTACTTTGAAAAAATTAGAGTGCTCAAAGCAAGCCTATGCTCTGAATACATTAGCATGGAATAACGTGATAGGACT
...

2) the .txt file looks like this:

Zotu1   604  
Zotu566 1023
Zotu6785        31
Zotu6   111453
Zotu69  10380
Zotu223 3706 
Zotu215 2559
Zotu2697        109   
Zotu3   211288
Zotu742 697

...

fasta otu table header • 1.9k views

ADD COMMENT • link 5.9 years ago by EBP91 ▴ 50

0

Entering edit mode

the second column gives the total number of sequences corresponding with that name

So you have 6047 sequences named Zotu1 in your fasta file ?

Btw your fasta file is not really a fasta file, you do not have > before each header

Is your fasta file really look like this display ?

ADD REPLY • link 6.0 years ago by Bastien Hervé 5.3k

0

Entering edit mode

1) No, I have one sequence with the name Zotu1 in this fasta file. However, based on my reference mapping (-usearch_global command), I know that I have 6047 sequences in my total dataset that have been mapped to the reference 'Zotu1'. These sequences might be, but do not necessarily have to be, identical to Zotu1. I now want to get that size information in the actual file with reference Zotu's.

2) No, there is indeed a '>' in front of the sequences, but that one somehow disappeared in the message.

ADD REPLY • link 6.0 years ago by EBP91 ▴ 50

1

Entering edit mode

You can find answers in these links I guess :

replace fasta headers with another name in a text file

Renaming fasta headers according to a matching name list

ADD REPLY • link 6.0 years ago by Bastien Hervé 5.3k

0

Entering edit mode

Thanks for the tips!

I have been playing around with the python script I found in replace fasta headers with another name in a text file and managed to get it to work on a rather archaic, yet effective way, by 1) getting rid of the 'Zotu' notation in the .txt file, 2) sort the .txt file using the 'sort' command in Linux, 3) add 'Zotu' and 'size;' to the .txt file in their respective places using Excel (yes, I know), 4) getting rid of the tabs, and 5) applying the python script to the new .txt file and the original fasta file. I thought this might be interesting for anyone out there who is also totally new in the field, but needs to move along 'fast' with some new data...

ADD REPLY • link 6.0 years ago by EBP91 ▴ 50

0

Entering edit mode

Did you solve your issue with this python code ?

ADD REPLY • link 6.0 years ago by Bastien Hervé 5.3k

0

Entering edit mode

Yes! Thanks again for your tip.

ADD REPLY • link 5.9 years ago by EBP91 ▴ 50