Question: Count duplicate sequence in fasta file using python
0
gravatar for jiseon824
5 weeks ago by
jiseon8240
jiseon8240 wrote:

Hello

I am new for python and bioinformatics.

for some reason, I have to analyze the data from a massive fasta file.

I want to count to repeat sequence using python.

test.fasta

>1234
cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac

>456
cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac

>67
cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac

>123
cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac

>57
gccttctctgggttctcactcagcactagtggagtgggtgtgggctggatccgtaagcccccaggaaaggccctggagtggcttgcactca

>35
cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac

>123
gccttctctgggttctcactcagcactagtggagtgggtgtgggctggatccgtaagcccccaggaaaggccctggagtggcttgcactca

>222
gccttctctgggttctcactcagcactagtggagtgggtgtgggctggatccgtaagcccccaggaaaggccctggagtggcttgcactca

Because I am new for Python I couldn't make any code unfortunately. I searched website but I couldn't fine any example code what I can copy and follow.

Does someone can help me to count the duplicate number of sequence?

if I need a reference I can make a file (CSV or fasta)

[what I want is..in csv file] sequence and repeated number

cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac    5    
gccttctctgggttctcactcagcactagtggagtgggtgtgggctggatccgtaagcccccaggaaaggccctggagtggcttgcactca    3

or display ID of reference file and repeated number

ref#1       5
ref#2       3
.
.

Thank you in advance

rna-seq • 130 views
ADD COMMENTlink modified 5 weeks ago by RamRS27k • written 5 weeks ago by jiseon8240

Are these full length sequences that you want to know if are repeated, or are you interested in the number of occurrences of a specific set of subsequence patterns?

ADD REPLYlink written 5 weeks ago by Joe17k

Hi

I want to check the number of occurrences of specific reference sequence in reference file. for example, if i make a reference file as bleow

 > ref#1  

cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac

>ref#2

gccttctctgggttctcactcagcactagtggagtgggtgtgggctggatccgtaagcccccaggaaaggccctggagtggcttgcactca

than, it count the frequency based on the reference. the actual reference sequence is longer then example. it is usually more than 500bp. I've got a fasta file and I have to analyze it to count the sequence reads number based on the reference.

ADD REPLYlink modified 5 weeks ago by genomax85k • written 5 weeks ago by jiseon8240

duplicates by sequences or by IDs?

ADD REPLYlink written 5 weeks ago by cpad011213k

Hi

You should go through this link: https://stackoverflow.com/questions/55226949/how-to-get-the-count-of-duplicated-sequences-in-fasta-file-using-python

You can easily redirect the output to csv or as you want

ADD REPLYlink written 5 weeks ago by gayachit200

Thank you so much. it is working well. :) I hope it is working well with my massive data.

ADD REPLYlink written 5 weeks ago by jiseon8240
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1721 users visited in the last hour