Question

How can eliminate the duplication in fasta file

0

Entering edit mode

4.0 years ago

Bioinfo ▴ 20

Hello Biostar ! i hope you're doing well , i have question please i m trying to build database bacteria genre using all the sequences published to calculate the coverage of my reads against this database using bowtie2 for mapping , for that , i merge all the genomes sequences i downloaded from ncbi in one fasta_library ( i merge 74 files in on fasta file ) , the problem is that in this fasta file (the library i created ) i have a lot of duplicated sequences , and that affected the coverage in a big way , so i'm asking if theres any way to eliminate duplication i have in my Library_File , or if theres any way to merge the sequences without having the duplication , or also if theres any other way to calclulate the coverage of my reads against reference sequences

Assembly alignment sequencing assembly • 831 views

ADD COMMENT • link updated 4.0 years ago by GenoMax 141k • written 4.0 years ago by Bioinfo ▴ 20

0

Entering edit mode

Hello Bioinfo!

This topic has been addressed multiple times on the site. Please see posts here: https://www.biostars.org/local/search/page/?q=fasta+remove+duplicates

For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.

If you disagree please tell us why in a reply below, we'll be happy to talk about it.

Cheers!

ADD REPLY • link 4.0 years ago by Ram 43k

1

Entering edit mode

the original poster actually means something else completely, they call the similar regions in their reference genomes as "duplicated" regions,

ADD REPLY • link 4.0 years ago by Istvan Albert 100k

score 0 · Answer 1 · 2020-04-22

regions of high similarity between genomes are not called duplicated sequences

those are just that - similar regions

now in your case don't worry about this at all, your mapper will flag the multiply mapped reads in some way (for example bwa will set the mapping quality to zero and will produce an XA tag I think) thus from the alignments you can always tell what is the coverage of the unique regions, just filter for mapping quality to use the uniquely mapped reads

score 0 · Answer 2 · 2020-04-22

This is precisely why I added this comment to your last post: C: how can i determine the couverture (coverage ) of my reads on the whole genome u

There is no way for you to remove "duplicates" (similar sequences) in reference genomes (unless you have multiple strains of one organism and can cut those down some).

Since your aim seems to be calculation of "coverage" with an aligner like BBMap you can handle the multi-mapping reads in several ways. Following is the relevant option.

ambiguous=best          (ambig) Set behavior on ambiguously-mapped reads (with 
                        multiple top-scoring mapping locations).
                            best    (use the first best site)
                            toss    (consider unmapped)
                            random  (select one top-scoring site randomly)
                            all     (retain all top-scoring sites)

You could choose random to conservatively place them at one random location. This may be the best (not what best means above) option. If you choose to place them in all locations then you would be seriously over-estimating your coverage (since the read likely came from one genome). Other two options would just result in odd placements or loss of a lot of data.

Note: I don't think bowtie2 has a similar set of options. It may simply stop aligning reads to multi-mapping locations after a certain number is reached.