Question: get size information from an OTU table into a fastafile for use in USEARCH/VSEARCH
0
gravatar for EBP91
11 months ago by
EBP910
EBP910 wrote:

I have two files (see below for the actual format): a fasta file with > 7000 sequences and a .txt file consisting of two columns. The first column in the .txt file corresponds with the name in the fasta file (minus the tail ';size=') and the second column gives the total number of sequences corresponding with that name. Now, I would like to add this size information for each sequence to the back of the headers in the fasta file of that same sequence. In other words: I would like to get the number '6047' which corresponds to ZOTU1 in the fasta file like '>Zotu1;size=6047'. The ZOTU's in the text file are not sorted.

I have no clue how to go about this so any pointing in the right direction would be extremely appreciated!

Thanks!

The files:

1) the fasta file looks like this:

>Zotu1;size=
AGCTCCAAAAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAACTTCTGTTCAGGTTCATTTCGACTCGTC
GAGTGAAACTGGACATACGTTTGCAAACTAAAATCGGCCTTCACTGGTTCGTCTTAGGGAGTAAACATTTTACTGTGAAA
AAATTAGAGTGTTCCAGGCAGGTTTTAGCCCGAATACATTAGCATGGAATAATGGAATAGGACTAAGTCCATTTTATTGG
TTCTTGGATTTGGTAATGATTAATAGGGGCAGTTGGGGGCATTAGTATTTAATAGTCAGAGGTGAAATTCTTGGATTTAT
TAAGGACTAACTAATGCGAAAGCATTTGCCAAAGATGTTTTCA

>Zotu2;size=
AGCTCCAATAGCGTATATTTAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATCTTGGGTCGGGGGCAGCGGTCCGCCCC
TTGTGGGTGTGCACTGGTCCACCCGGCCTTACTGCCGGGGACGCGCTCCTGGCCTTCGCTGGTCGGGACGCGGAGTTGGC
GATGTTACTTTGAAAAAATTAGAGTGCTCAAAGCAAGCCTATGCTCTGAATACATTAGCATGGAATAACGTGATAGGACT
...

2) the .txt file looks like this:

Zotu1   604  
Zotu566 1023
Zotu6785        31
Zotu6   111453
Zotu69  10380
Zotu223 3706 
Zotu215 2559
Zotu2697        109   
Zotu3   211288
Zotu742 697

...

otu table header fasta • 498 views
ADD COMMENTlink modified 9 months ago • written 11 months ago by EBP910

the second column gives the total number of sequences corresponding with that name

So you have 6047 sequences named Zotu1 in your fasta file ?

Btw your fasta file is not really a fasta file, you do not have > before each header

Is your fasta file really look like this display ?

ADD REPLYlink written 11 months ago by Bastien Hervé3.7k

1) No, I have one sequence with the name Zotu1 in this fasta file. However, based on my reference mapping (-usearch_global command), I know that I have 6047 sequences in my total dataset that have been mapped to the reference 'Zotu1'. These sequences might be, but do not necessarily have to be, identical to Zotu1. I now want to get that size information in the actual file with reference Zotu's.

2) No, there is indeed a '>' in front of the sequences, but that one somehow disappeared in the message.

ADD REPLYlink modified 11 months ago • written 11 months ago by EBP910
1

You can find answers in these links I guess :

replace fasta headers with another name in a text file

Renaming fasta headers according to a matching name list

ADD REPLYlink written 11 months ago by Bastien Hervé3.7k

Thanks for the tips!

I have been playing around with the python script I found in replace fasta headers with another name in a text file and managed to get it to work on a rather archaic, yet effective way, by 1) getting rid of the 'Zotu' notation in the .txt file, 2) sort the .txt file using the 'sort' command in Linux, 3) add 'Zotu' and 'size;' to the .txt file in their respective places using Excel (yes, I know), 4) getting rid of the tabs, and 5) applying the python script to the new .txt file and the original fasta file. I thought this might be interesting for anyone out there who is also totally new in the field, but needs to move along 'fast' with some new data...

ADD REPLYlink modified 11 months ago • written 11 months ago by EBP910

Did you solve your issue with this python code ?

ADD REPLYlink written 10 months ago by Bastien Hervé3.7k

Yes! Thanks again for your tip.

ADD REPLYlink written 9 months ago by EBP910
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2276 users visited in the last hour