Chromosome accession numbers correspond to which chromosome?
0
0
Entering edit mode
21 months ago
amy__ ▴ 160

Hello,

I am aware this question has been asked similarly before but I cannot find a straighforward answer.

The headers for my bam look like this (just a snippet):

enter image description here

I have lifted my bed file to hg38 from hg19 and would like to now have these accession numbers like chr1, chr2, chr3....

I have looked at this https://github.com/dpryan79/ChromosomeMappings/blob/master/GRCh38_RefSeq2UCSC.txt but find it confusing. I have previously had to recode another bed file so know how to do it, its just which accession code is associated to which chromosome which is confusing me.

Thanks! Amy

bam chromosome bed • 2.1k views
ADD COMMENT
0
Entering edit mode

The file from dpryan79 is pretty clear to me. What about it confuses you?

ADD REPLY
0
Entering edit mode

save the planet: don't post screenshots when you can just copy-n-paste the text.

ADD REPLY
0
Entering edit mode

Thanks! I was going to copy and paste it but the lines went all skew

ADD REPLY
0
Entering edit mode

See my guide on how to format plain text so it looks pretty: How to Use Biostars Part-3: Formatting Text and Using GitHub Gists

ADD REPLY
0
Entering edit mode

Thank you!!

ADD REPLY
0
Entering edit mode

Hi! Thanks for your reply. So previously when I've recoded the bed file to match the bam accession its looked like this (for a different dataset):

 novogene_bed_recode <- agilent_human_region.hg38 %>% mutate(V1=recode(V1, 
                                                   'chr1'="NC_000001.11",
                                                   'chr2'="NC_000002.12",
                                                   'chr3'="NC_000003.12",
                                                   'chr4'="NC_000004.12",
                                                   'chr5'="NC_000005.10",
                                                   'chr6'="NC_000006.12",
                                                   'chr7'="NC_000007.14",
                                                   'chr8'="NC_000008.11",
                                                   'chr9'="NC_000009.12",
                                                   'chr10'="NC_000010.11", 
                                                   'chr11'="NC_000011.10",
                                                   'chr12'="NC_000012.12",
                                                   'chr13'="NC_000013.11",
                                                   'chr14'="NC_000014.9",
                                                   'chr15'="NC_000015.10", 
                                                   'chr16'="NC_000016.10",
                                                   'chr17'="NC_000017.11",
                                                   'chr18'="NC_000018.10",
                                                   'chr19'="NC_000019.10", 
                                                   'chr20'="NC_000020.11",
                                                   'chr21'="NC_000021.9",
                                                   'chr22'="NC_000022.11", 
                                                   'chrX'="NC_000023.11",
                                                   'chrY'="NC_000024.10"))

In the dpryan79 file, it has a lot more options below and I'm not quite sure which would related to the specific chromosomes - and like also NT_113901.1 chrUn_GL000195v1 what would that mean?

Basically, I want to be able to do the code above, but I'm not sure which bits to use.

Thanks! Amy

ADD REPLY
0
Entering edit mode

Google "chrUn" to understand what those contigs are. dpryan's file is just a mapping table so you should be able to use it unless you have contigs not mapped in that table.

ADD REPLY
0
Entering edit mode

Hi,

So my bed file currently only has chr1, chr2 ect. In the mapping file for example it has NT_187361.1 chr1_KI270706v1_random

So if I am changing the IDs in the bed file to match the bam, would it be appropriate to say chr1 should now be NT_187361.1 in the bed file.

In the bam file there is also NC_000001.11 too, so multiple IDs in the bed file will be changed from chr1 to these new IDs in the bam file?

ADD REPLY
0
Entering edit mode

What is "your bed file"? There is the BED file from the gist and your BAM file. Do not alter the BED. By what logic does NT_187361.1 match chr1 when it matches chr1_K..._random? chr1 is not the same as chr1_K..._random.

ADD REPLY
0
Entering edit mode

Hi,

My bed file is the exome panel kit used by the sequencing company, it contains the exome regions. I am using deepvariant which needs the chromosome IDs to match, currently they do not because the bed file is chr1, chr2 ... and the bam file has all the chromosome IDs shown in the header.

If the bed file does not match deepvariant will produce this error:

raise ValueError('The regions to call is empty. Check your --regions and '
ValueError: The regions to call is empty. Check your --regions and --exclude_regions flags to make sure they are not resulting in set of empty region to process. This also happens if you use "chr20" for a BAM where contig names don't have "chr"s (or vice versa).

Thus I need my contig names in the bed to match the bam. There are so many headers that I am not sure which contigs to use. I have had a similar problem before and used the mutate code above to recode the bed file and then deepvariant worked but I don't know if that is the correct way.

I just want to make sure that this is the correct way, and if not, how would you do it?

ADD REPLY
0
Entering edit mode

Your mutate should work fine. Just make sure the mapping dpryan BED file matches your BAM contig names in one column and the exome panel BED contigs in the other column.

ADD REPLY
0
Entering edit mode

Hello,

Thank you for your comment, I checked with a senior bioinf I know, and I was correct to do what I did.

For anyone in the future: The NC_000011.1 etc is the RefSeq accession numbers, you can change the bed file chromosome ID column from chr1 ect to NC_000011.1 etc using my mutate code above in R. As Ram stated, the other headers relate to pseudo-contigs or other regions of the chromosome that aren't explicitly just chr1 ect.. these won't appear in your bed file as the bed file just shows the exome regions.

Please correct me if I am wrong.

ADD REPLY
0
Entering edit mode

I was correct to do what I did.

and

would it be appropriate to say chr1 should now be NT_187361.1 in the bed file

clash with each other. Just be sure to map chromosomes to NC IDs and not any other N* (NT/NM/NR) IDs.

ADD REPLY
0
Entering edit mode

Hey,

In the end I did, originally my bed file was hg19 so used liftover to get hg38, then got the GRCh38_RefSeq2UCSC.txt:

#!/bin/bash

# Deal with MacOS zsh comment errors
setopt interactive_comments

# Set files
INFO=GRCh38_RefSeq2UCSC.txt
BED=hglft_genome_24c82_a688d0.bed
NEWBED=truseq-dna-exome-v1-2-hg38.bed

# Copy bed to output file
cp ${BED} ${NEWBED}

# With each line in translation file
for LINE in `cat ${INFO} | tr "\t" "|"`; do

  # Extract hg19 and hg38 chromosomes
  HG19CHR=`echo $LINE | cut -f 2 -d "|"`
  HG38CHR=`echo $LINE | cut -f 1 -d "|"`

  # Replace hg19 chromosomes with hg38 versions
  sed -i "" -e "s/^$HG19CHR\t/$HG38CHR\t/g" ${NEWBED}

done

# Check output
cat ${NEWBED}
ADD REPLY
0
Entering edit mode

Also for anyone curious, I actually used the wrong HG38 file which is why I started with the refeq accession numbers first - oops! Make sure you use the correct reformatted reference genome and you won't have to do any of the above.

ADD REPLY

Login before adding your answer.

Traffic: 1968 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6