GL000201,GL000202... etc What are these?
1
2
Entering edit mode
5.5 years ago
morovatunc ▴ 470

Hi,

I have two bam files from ICGC. During mutation calling, vardict/mutect/freebayes/varscan split bam to by its chromosomes. Those GL00 parts also appear just like chr1-2-3-4-5-6-7 appear. I check what they are and found they are contigs?!?(forgive my ignorance) How should I treat them ? Will they cause a problem during variant calling output interpretation? I read one previous thread about this but there were not any information about how to treat them. I need your help.

Best,

Tunc.

WGS ICGC BAM • 3.1k views
5
Entering edit mode
5.5 years ago
igor 12k

You can check http://genome.ucsc.edu/cgi-bin/hgGateway where they provide details for every assembly.

chr_random - are called "unlocalized" sequences. The chromosome is known, but not the location on the chromosome. chrUn_ - are called "unplaced" sequences. They probably belong to the sequenced genome, but placement is unknown at this time. The GL numbers (or other types of numbers) in these names are the genbank identification numbers which can be used in a nucleotide search at Entrez. For example: chr1_GL456211_random - unlocalized sequence belonging to chr1, NCBI identification: GL456211 chrUn_JH584304 - unplaced sequence, NCBI identification JH584304

0
Entering edit mode

Thank you very much !

0
Entering edit mode

Igor,

I have a question. Can I remove those locations from my bam file ? Would that be feasible ? since they have unknown location would that be valuable for obtaining any data?

Thank you for the information again,

Best,

Tunc.

0
Entering edit mode

You may then lose multi-mapping reads or reads that would have aligned to regular chromosomes if the random ones were not present.

If you really want to do that, you should be able to do something like samtools view -h file.bam | grep -v "_random" ...