I used UCSC Table Browser to generate a fasta where each entry is a CDS within a given region. The parameters I used are the following:
- group: Genes and gene predictions
- track: Ensembl genes
- table: ensGene
- region: position chr20:250000-1000000
- output format: sequence
- output file: myfile.fasta
- sequence type for Ensembl Genes : genomic
And in "sequence retrieval region options": CDS Exons (only) and "One FASTA record per region (exon, intron, etc.)".
Now in the resulting fasta some identical sequences occur several times, with the same range but a different ID, for instance:
>hg19_ensGene_ENST00000217233_0 range=chr20:368655-368945 5'pad=0 3'pad=0 strand=+ repeatMasking=none ATGCGAGCC...... > (...) >hg19_ensGene_ENST00000449710_0 range=chr20:368655-368945 5'pad=0 3'pad=0 strand=+ repeatMasking=none ATGCGAGCC......
Why does this occur and how can I obtain a fasta where each entry is a unique range?