Question: Get chromosome name from RefSeq format for C.elegans
0
gravatar for rioualen
2.3 years ago by
rioualen210
France
rioualen210 wrote:

Hello,

I'm currently processing ChIP-seq data from C.elegans.

I've used data from GEO which I processed this way:

sra -> fastq (fastq-dump) -> sam (bwa) -> bam (samtools view) -> bed (bedtools bamtobed)

However I noticed chromosome names were in RefSeq format:

NC_003282.8    9767072    9767122
NC_003284.9    7940872    7940922
NC_003284.9    112717     112746 

I would like to get "standard" chromosome names (eg chr1, chr2...). Does one of the tools I used so far allows that? If not, is there a way to convert bedfiles, or at least some kind of conversion table?

Thanks a lot!

CR

refseq chip-seq • 662 views
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by rioualen210
1
gravatar for rioualen
2.3 years ago by
rioualen210
France
rioualen210 wrote:

OK, so I found the correspondence there: http://www.ncbi.nlm.nih.gov/assembly/GCF_000002985.6 and I wrote it to a 2-column tab-delimited file.

I wrote this python script:

out = open('test_converted.bed','w')
with open('test.bed') as bedfile:
    for line1 in bedfile:
        RefSeq_id_1 = line1.split()[0]
        other_info = '\t'.join(line1.split()[1:len(line1.split())])
        with open('RefSeq_to_std.tab') as conversion_table:
            for line2 in conversion_table:
                RefSeq_id_2 = line2.split()[0]
                chr_id = line2.split()[1]
                if RefSeq_id_1 == RefSeq_id_2:
                    towrite = chr_id + '\t' + other_info + '\n'
                    out.write(towrite)

However I'm still wondering if there's a more efficient way to do this, as I've got pretty big datasets... Either with python, shell language or R? (I'm using the 3 of them in my workflow).

Thanks again!

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by rioualen210
1
gravatar for PoGibas
2.3 years ago by
PoGibas4.7k
Vilnius
PoGibas4.7k wrote:

I am not an expert, but it seems that "standard" chromosome names for C. elegans are: chrI/II/III/IV/V/M/X.

I strongly recommend not to create your own names.

NC_003282 = chrIV
NC_003284 = chrX

For more efficient solution use sed or awk one-liner. 

awk '{gsub("NC_003282","chrIV"); gsub("NC_003284","chrX"); print}' INPUT.bed
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by PoGibas4.7k

1) Indeed ;)

2) Thanks! That's exactly what I was looking for, but I'm not very familiar with awk.

ADD REPLYlink written 2.3 years ago by rioualen210
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1011 users visited in the last hour