Count duplicate sequence in fasta file using python
0
0
Entering edit mode
3.9 years ago
jiseon824 • 0

Hello

I am new for python and bioinformatics.

for some reason, I have to analyze the data from a massive fasta file.

I want to count to repeat sequence using python.

test.fasta

>1234
cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac

>456
cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac

>67
cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac

>123
cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac

>57
gccttctctgggttctcactcagcactagtggagtgggtgtgggctggatccgtaagcccccaggaaaggccctggagtggcttgcactca

>35
cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac

>123
gccttctctgggttctcactcagcactagtggagtgggtgtgggctggatccgtaagcccccaggaaaggccctggagtggcttgcactca

>222
gccttctctgggttctcactcagcactagtggagtgggtgtgggctggatccgtaagcccccaggaaaggccctggagtggcttgcactca

Because I am new for Python I couldn't make any code unfortunately. I searched website but I couldn't fine any example code what I can copy and follow.

Does someone can help me to count the duplicate number of sequence?

if I need a reference I can make a file (CSV or fasta)

[what I want is..in csv file] sequence and repeated number

cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac    5    
gccttctctgggttctcactcagcactagtggagtgggtgtgggctggatccgtaagcccccaggaaaggccctggagtggcttgcactca    3

or display ID of reference file and repeated number

ref#1       5
ref#2       3
.
.

Thank you in advance

rna-seq • 2.9k views
ADD COMMENT
0
Entering edit mode

Are these full length sequences that you want to know if are repeated, or are you interested in the number of occurrences of a specific set of subsequence patterns?

ADD REPLY
0
Entering edit mode

Hi

I want to check the number of occurrences of specific reference sequence in reference file. for example, if i make a reference file as bleow

 > ref#1  

cagatcaccttgaagtcgtctgctcctacgctggtgaaacctacac

>ref#2

gccttctctgggttctcactcagcactagtggagtgggtgtgggctggatccgtaagcccccaggaaaggccctggagtggcttgcactca

than, it count the frequency based on the reference. the actual reference sequence is longer then example. it is usually more than 500bp. I've got a fasta file and I have to analyze it to count the sequence reads number based on the reference.

ADD REPLY
0
Entering edit mode

duplicates by sequences or by IDs?

ADD REPLY
0
Entering edit mode

Hi

You should go through this link: https://stackoverflow.com/questions/55226949/how-to-get-the-count-of-duplicated-sequences-in-fasta-file-using-python

You can easily redirect the output to csv or as you want

ADD REPLY
0
Entering edit mode

Thank you so much. it is working well. :) I hope it is working well with my massive data.

ADD REPLY

Login before adding your answer.

Traffic: 1308 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6