String of 'N's at same location in Fasta file
1
0
Entering edit mode
2.1 years ago
zdiazmar ▴ 30

Hi all,

I am relatively new to bioinformatics and using genomic data. I have generated some ddRAD data using the Petersen et al. 2012 protocol and STACKS to call SNPs (default parameters for -m, -M, and -n). I have filtered sequences to retain 'high-quality' reads. My average read depth coverage per locus is 29X and the average amount of missing data per locus is 7%. Taking a look at my Fasta file I am noticing that for most sequences there is a string of ~10 'N's at around the same location. Here is a subset of sequences.

Locus1: AATTCTTGCAGTGAGAAGTACTGGTGAGTTTCATCCTCATGTTCTGTTCCTATAGTACTGTGGTATACTTATTTTGGATCTTCTTATATGATTAATGGATCNNNNNNNNNNTATTACATACTATTCTGATATCTGTCTGTCGAAATTACAGGCATTTCTGATGAGAAGGTGCATAATCATGAACTCAACCTTAAGGAAGCCGCTTCCACCGG

Locus2: AATTCCTTCGAGTATTGAAGGGATGGCATCTTCATCACCATCATAGAATTTCTGAATCCTCTTCTTCAGCTCTATACGAAAAGCAAAAATCAGTGCAATGNNNNNNNNNNAGTCGAGATTCAACTCCCCAAACAGCCTCATAAGAGAACTAGTAATAAACATTACTTACTGAACTACACTATCGGAAGATTTGTGGCACCGCATAAACCGGACCGTATTTTTCAAAATGGACTTTGACTTATACCGAACCG

Locus3: AATTCAACAACAATTAACGTAGTACTCCATCAAGGTTCCAATCAAAACATCTCTTCTATGTCACTCCATAATAATATAACCTCACTATGCATATCCACTANNNNNNNNNNTCCACCTTGTTATTGCTTGATCCGATTCGGGTCATGAAGAACGGGTTCGGGTCGATATTGATGGCCACTTCTGGTGCTACTTTCGCGGAATGCGGGCCCG

Locus4: AATTCACCTCATTCTTGCGGGTGTTGGGGAGCTATCCTATGGATATGACCCCGTGGTGTCCTTCTAGAGGAGACTAGTAATTAATTAATTTAAAAGTAAANNNNNNNNNNTGAAACTGTGTTTCTTTATTCTTGATTTCACGTTCTTTACCTAAAATACCTACCTACCTGAACTTCTTTCCTCGTGCGGGATTCGAATTGGTGGTGGCCGGAACCAGTACCATAGTCGGTCTGATTCAGTCAAAGTTTCAAGATGTACGCGAACCCGATTTAACCTCCCG

Locus5: AATTCTGATAATCCATTTGTTTGCTCATCTAGTTCTTATGATACAATTATCTGCATCTTTTTCCTTTATACCCGCCGACTTGTTTTCTGCACTAGTAGTGNNNNNNNNNNTAGGTTGGCCATACCCGCACATGCCATAACGACAACGAACATGCCTTGGACATCGATTGAAGCATTTCCCACAATGTCTTCGGTTTCGATTGATGTTCCGGGGGGGCCCTATGAAAATTTGGAAATACGATGCTTTGATAGAGAAAAGGATCCG

Locus6: AATTCAACATAGTTCATGAATCGGGTTATCTATTTTTTACCTGCATGTACCTGGCCAGAACTAAAAAGTCGGTTTCTTGAACCAACTCCAATATCTCTCCTNNNNNNNNNNGGGTCGAGAAGAGGGACTCGGACTACGGATGTCAGGGCATTGAACCTCTTTGCAAGATTAAGGAAGTCCTCATTCATTTGCTCTCGTGGCCATCTAGCCGGTTCGACCCATTTAGTATTCTTGGTCCGACCCGACCCGAATAGGTCGGACCCGATTGAGAAAAACCG

Locus7: AATTCACTGAATGTGTCACGGACTAATAGCATCTATCCAATGATTAGAGGAAATTATTTTAGTTTTTTGGGCAGTGGAAAACTAAAAAATATGGTTTAAANNNNNNNNNNGAAAAGGGGAAAATCTAAATACGAACTTAACGAAAACCCAATACTCTGACAAGGATACCCAATAACCACACATTGTAAAGCAAAAACATGAAGTCAACCGG

Locus8: AATTCTGACAGGAATCATAATGGGATGTGCGCTTTTATTCATGACTCCGTTATTTGAGTATATACCATTGGTATGCTGCAGGCATTTATTTCTTTGATATNNNNNNNNNNAAATTACTTTCCCCCATAATTACGTGTGAGGTGATGCTTACCTTCTAAGACCATAAATAAAATCATCCAAAACCCCTCAACCTGGGATTGGCTGTGGCCGGTCCGGTTCGGTTTTCATTTAAAAATCTGTTCTGTAAATTTCTGTCCGGTCCG

Locus9: AATTCAATTATTGATAGGTTCCAGTAAACTGTATTATTAGTAAGCTAACAGAAGCAGTTGGCGTCAAGATCCCATGAAATAAGTTAGAAGAATCATCATTNNNNNNNNNNTTAGTTCTAGCTACGAATGAATGGAAAAGAGCAGATGGATCAAGAAAATTAAACATTTCCTGGGAAATCCCCATCTGTTAATAGGAGGAGGAGAAGGTTACCGGAGGGAGCTGATTATTCATCTCCTTCTCACCTTCCGTGAATAGCCG

Locus10: AATTCAGAAAAGGAGAGGGACAAATGCTGAAATCCAAACCTCAAGTCCCACAAAAGTGATTGACCATTACACTGGAGATGGCTCTCCCAAGATGACGTCTNNNNNNNNNNAATGGATATGTTAAACATAGACCAAATAACTATAACCTCACAAAGAAATGTGTACATTATGTAGATCTTTCATGAACAAAAAGCAAAATAATACAGCCCGGAGACCGAAGCTCCG

Locus11: AATTCTTTTCACCACCCACAAACCATACCTTGATTTGTTGATTCAACTTGCAGGTATCTATGTTGGAGGAAGCAAGATTGTTCATTTCAGACCTGACCCANNNNNNNNNNGATTCGAGTACGGGGTCCGACCTTCACTCTTCCTAGCCAAAGTCAGAGGCGGCACATGCACCACCGCACCCTCTGACCCGCCCGAAACAGTCATCGACCGGTGGTATTTCGTTATAGTTTCCG

Locus12: AATTCATGATCGGTTCCTTTTTAAGTCACTTCTTATTCACATCATGTACAAAATGAGACCAGACCGATTGATCCGAGTGCCCATACAAAAGACGATTAATNNNNNNNNNNATACCAACTACTAACCTGTAAGATTGGTTCCATTGGGGATGCTCACCGTAGAATTGAGAAACCATGAGCAAACTTTCCGACGTCGGATTGTCGACAACCGG

Does anyone have any ideas of why this might be happening? Thank you for any input!

fasta sequence • 445 views
ADD COMMENT
3
Entering edit mode
2.1 years ago
zdiazmar ▴ 30

In case anyone out there also runs into this, it seems that this is due to non-overlapping paired-end reads. Here is the response from a collaborating bioinformatician:

Stacks added the Ns as part of its analysis process. Here's a description (from the Stacks manual) of how Stacks handles paired-end data:

Stacks directly supports paired-end reads, for both single and double digest protocols. For double-digest RAD, both the paired-end reads are anchored by a contig and Stacks will assemble them into two loci. In both cases, the paired-end contig/locus will be merged with the single-end locus. If the loci do not overlap, they will be merged with a small buffer of Ns in between them.

So the Ns indicate that the paired-end reads at a locus do not overlap, and the actual number of Ns is unknown, although if you know the length distribution of the library (e.g. from a BioAnalyzer trace) you'll have a rough idea of how many Ns there could be.

ADD COMMENT

Login before adding your answer.

Traffic: 2338 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6