The mitochondrial gene sequence contains a large amount of N
2
0
Entering edit mode
3.4 years ago
Daier ▴ 20

Hi, We performed 10x genome-wide re sequencing of bird blood samples. When extracting mitochondrial genes, we found that there were a lot of N in the gene sequence. Some sequences are as follows:

acaagcaatccacgctcttaccctaacaatccttctaggattctacttcacaggcctcca
aggcatagaatactacgaagcaccattctccatcgcagatagcgtctacggctctacctt
ctttgtcgcNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNttacataactatcta
ctgatgaggatcatactcttctagtatattcattacaatcgacttccaatccttaaaatc
tggtttaaccccagagaagagtaatgaacataattacattcataattaccctatccctaa
ccttaagcctcatcctaaccgcactgaacttctgaatcgcccaaatgaaccccgatgcag
aaaaactatccccctNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNccaccaccacactcacctgagcat
 ccatcctaatcctcctcctcactctgggactagtatacgaatgaatccaaggaggactag
 aatgagcagaataaaaaggcaagaaagttagtctaattaagacagttgatttcggctcaa
 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNcaataatccttgcccaaccatccatcctcctagccttctcagcctc
agaacttacacacttttacatggcatttgaagccactctaatccctaccctaattctcat
cctactactaaaactcggaggctatggcattatacgattcacaaccctagtaaacccaac
attaaacaaccttcactacccattcatcaccttagccctatgaggagcactaataaccag
cgccatctgcttacgacaaatcgacctaaaatNNNNNNNNNNNNNNNNNNNNNNNNNNNN
  NNNNNgcctagtcatcgctgcaaccataatccagacccaatgagcattctcaggagcaat

This kind of sequence can not be compared on megax software, unless the N in the sequence is deleted, but it takes time to open one sample to delete a large number of N, so is there any command or script that can quickly delete n in each sample sequence? Thank you!

mitochondrial genes • 680 views
ADD COMMENT
2
Entering edit mode
3.4 years ago

simply deleting the Ns from the sequence is a bad idea. They are not there for funzies , it means that there should likely be actual bases in those positions but they can't be determined (for now). If you remove them you will end up with incorrect sequences.

The problem here lays with the software you intend to use I have the impression ... perhaps you can use a different one?

ADD COMMENT
0
Entering edit mode

Do you have any suggestions or solutions for this result? Thank tou!

ADD REPLY
0
Entering edit mode

the most obvious one ( but perhaps not the easiest/most feasible one) is to do additional sequencing to resolve these unknown positions.

you could in theory also infer them somehow with using a reference but in your specific case this will not be feasible as this is exactly what you want to analyse.

ADD REPLY
1
Entering edit mode
3.4 years ago
Mensur Dlakic ★ 27k

The Ns indicate that the assembler has some information about the length of that sequence stretch, but can't unambiguously determine what bases are in those positions. That can be because of poor coverage or high mutation rate, and possibly because the region is polymorphic.

Regardless of the reason for the Ns, no software will give you reliable information in that region of sequence when that many bases are ambiguous. It is a guarantee that the result will be even less reliable if you remove the Ns. That will change not only the gene lengths, but it will likely introduce frame shifts and artificial stop codons.

ADD COMMENT
0
Entering edit mode

Is there any way I can deal with it? I want to get the mitochondrial gene sequence of different samples, and then build a tree, so as to understand the differences between samples or trace them to the source.Or is my data useless and I have to reextract and sequence it?

ADD REPLY
0
Entering edit mode

Please take my comment with an appropriate dose of skepticism, as it is easy to be cavalier with other people's data.

I don't see how data of this quality can be useful. Your sequence has more than 1/3 ambiguous positions, which severely reduces the useful signal in it. Phylogenetic trees capture the distance (dissimilarity) between sequences, and those distances can be very small. The amount of missing information in your case ( > 1/3 of sequence positions) is already greater than potential differences between related sequences - the signal-to-noise ratio is likely to be low.

ADD REPLY

Login before adding your answer.

Traffic: 2722 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6