Question: Count duplicate sequence in fasta file using python
0
gravatar for jiseon824
4 months ago by
jiseon8240
jiseon8240 wrote:

Hello

I am new for python and bioinformatics.

for some reason, I have to analyze the data from a massive fasta file.

I want to count to repeat sequence using python.

test.fasta

>1234
cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac

>456
cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac

>67
cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac

>123
cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac

>57
gccttctctgggttctcactcagcactagtggagtgggtgtgggctggatccgtaagcccccaggaaaggccctggagtggcttgcactca

>35
cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac

>123
gccttctctgggttctcactcagcactagtggagtgggtgtgggctggatccgtaagcccccaggaaaggccctggagtggcttgcactca

>222
gccttctctgggttctcactcagcactagtggagtgggtgtgggctggatccgtaagcccccaggaaaggccctggagtggcttgcactca

Because I am new for Python I couldn't make any code unfortunately. I searched website but I couldn't fine any example code what I can copy and follow.

Does someone can help me to count the duplicate number of sequence?

if I need a reference I can make a file (CSV or fasta)

[what I want is..in csv file] sequence and repeated number

cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac    5    
gccttctctgggttctcactcagcactagtggagtgggtgtgggctggatccgtaagcccccaggaaaggccctggagtggcttgcactca    3

or display ID of reference file and repeated number

ref#1       5
ref#2       3
.
.

Thank you in advance

rna-seq • 309 views
ADD COMMENTlink modified 4 months ago by RamRS30k • written 4 months ago by jiseon8240

Are these full length sequences that you want to know if are repeated, or are you interested in the number of occurrences of a specific set of subsequence patterns?

ADD REPLYlink written 4 months ago by Joe18k

Hi

I want to check the number of occurrences of specific reference sequence in reference file. for example, if i make a reference file as bleow

 > ref#1  

cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac

>ref#2

gccttctctgggttctcactcagcactagtggagtgggtgtgggctggatccgtaagcccccaggaaaggccctggagtggcttgcactca

than, it count the frequency based on the reference. the actual reference sequence is longer then example. it is usually more than 500bp. I've got a fasta file and I have to analyze it to count the sequence reads number based on the reference.

ADD REPLYlink modified 4 months ago by genomax91k • written 4 months ago by jiseon8240

duplicates by sequences or by IDs?

ADD REPLYlink written 4 months ago by cpad011214k

Hi

You should go through this link: https://stackoverflow.com/questions/55226949/how-to-get-the-count-of-duplicated-sequences-in-fasta-file-using-python

You can easily redirect the output to csv or as you want

ADD REPLYlink written 4 months ago by gayachit200

Thank you so much. it is working well. :) I hope it is working well with my massive data.

ADD REPLYlink written 4 months ago by jiseon8240
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 959 users visited in the last hour