Question

Get chromosome name from RefSeq format for C.elegans

0

Entering edit mode

8.8 years ago

rioualen ▴ 710

Hello,

I'm currently processing ChIP-seq data from C.elegans.

I've used data from GEO which I processed this way:

sra -> fastq (fastq-dump) -> sam (bwa) -> bam (samtools view) -> bed (bedtools bamtobed)

However I noticed chromosome names were in RefSeq format:

NC_003282.8    9767072    9767122
NC_003284.9    7940872    7940922
NC_003284.9    112717     112746

I would like to get "standard" chromosome names (eg chr1, chr2...). Does one of the tools I used so far allows that? If not, is there a way to convert bedfiles, or at least some kind of conversion table?

Thanks a lot!

CR

ChIP-Seq Refseq • 2.0k views

ADD COMMENT • link 8.8 years ago by rioualen ▴ 710

score 1 · Accepted Answer · 2015-07-29

OK, so I found the correspondence there: http://www.ncbi.nlm.nih.gov/assembly/GCF_000002985.6 and I wrote it to a 2-column tab-delimited file.

I wrote this python script:

out = open('test_converted.bed','w')
with open('test.bed') as bedfile:
    for line1 in bedfile:
        RefSeq_id_1 = line1.split()[0]
        other_info = '\t'.join(line1.split()[1:len(line1.split())])
        with open('RefSeq_to_std.tab') as conversion_table:
            for line2 in conversion_table:
                RefSeq_id_2 = line2.split()[0]
                chr_id = line2.split()[1]
                if RefSeq_id_1 == RefSeq_id_2:
                    towrite = chr_id + '\t' + other_info + '\n'
                    out.write(towrite)

However I'm still wondering if there's a more efficient way to do this, as I've got pretty big datasets... Either with python, shell language or R? (I'm using the 3 of them in my workflow).

Thanks again!

Ram · Accepted Answer · 2015-07-29

1

Entering edit mode

8.8 years ago

PoGibas 5.1k

I am not an expert, but it seems that "standard" chromosome names for C. elegans are: chrI/II/III/IV/V/M/X.

I strongly recommend not to create your own names.

NC_003282 = chrIV
NC_003284 = chrX

For more efficient solution use sed or awk one-liner.

awk '{gsub("NC_003282","chrIV"); gsub("NC_003284","chrX"); print}' INPUT.bed

ADD COMMENT • link updated 4.5 years ago by Ram 43k • written 8.8 years ago by PoGibas 5.1k

0

Entering edit mode

1) Indeed ;)

2) Thanks! That's exactly what I was looking for, but I'm not very familiar with awk.

ADD REPLY • link 8.8 years ago by rioualen ▴ 710