How to obtain chromosome number from given scaffold number
1
0
Entering edit mode
6.0 years ago

Hi,

I am working on Saccharomyces cerevisiae, Y55 strain.  I obtained my reference sequence from here.  And this is how the Y55_Stanford_2014_JRIF00000000.fsa sequence looks.  It doesn't include the chromosome number. 
Reference

>gi|696435221|gb|JRIF01000001.1| Saccharomyces cerevisiae Y55 scaffold-0, whole genome shotgun sequence [length=107844]
TTAAGCCTTCAAAGAAGAAGCTCTTCTCTTTCTGATTTCGGCCTTTTCAGCCTTTCTTTCAGACAATCTCTTAGCCAACA
ATTGAGCGTATTCGGCAGCAGCTTCTCTTTGAGCTTGAGCGTTTCTGACCTTCAAAGCTCTTTGGTGTCTCTTTCTTTGC


There's also a gatk-snv vcf included which has the chromosome number. 
gatk.vcf

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    Y55
chrI    111    .    C    T    155.12    PASS    AC=2;AF=1.00;AN=2;BaseQRankSum=3.331;DP=28;Dels=0.00;FS=0.000;HaplotypeScore=12.7288;MLEAC=2;MLEAF=1.00;MQ=28.02;MQ0=0;MQRankSum=0.666;QD=5.54;ReadPosRankSum=2.596;SB=-1.912e+01    GT:AD:DP:GQ:PL    1/1:14,14:28:5:186,5,0
chrI    136    .    G    A    336    PASS    AC=2;AF=1.00;AN=2;BaseQRankSum=1.625;DP=32;Dels=0.00;FS=7.270;HaplotypeScore=30.3526;MLEAC=2;MLEAF=1.00;MQ=28.13;MQ0=0;MQRankSum=-1.733;QD=10.50;ReadPosRankSum=1.083;SB=-9.901e+01    GT:AD:DP:GQ:PL    1/1:1,31:32:45:369,45,0
chrI    156    .    C    G    18.07    LowQual;SnpCluster;filter    AC=1;AF=0.500;AN=2;BaseQRankSum=1.146;DP=44;Dels=0.00;FS=7.776;HaplotypeScore=20.3077;MLEAC=1;MLEAF=0.500;MQ=30.62;MQ0=0;MQRankSum=2.856;QD=0.41;ReadPosRankSum=0.838;SB=-6.519e-03    GT:AD:DP:GQ:PL    0/1:38,6:44:48:48,0,469


Is there anyway I can obtain the chromosome numbers to call for variants for my own data? I want to analyse SNPs and INDELs in each chromosome. 

Thank you,

Parimala 

SNP reference genome scaffold chromosome number • 2.1k views
ADD COMMENT
1
Entering edit mode
6.0 years ago
Steven Lakin ★ 1.5k

All of the information about the reference file is stored in the .gff

http://downloads.yeastgenome.org/sequence/strains/Y55/Y55_Stanford_2014_JRIF00000000/Y55_JRIF00000000.gff.gz

You'll have to find a way to parse it to your liking, but here is quick example bash shell command to print only the reference information followed by its chromosome:

cat Y55_JRIF00000000.gff | sed 's/,/\t/g' | awk '{print $1,$11}' | grep "^g" > RefChromosomes.txt

 

Take a look at the format of the .gff file to parse it in a different way; you can then link the reference entries to their respective chromosomes in your own code (e.g. add the chromosome as another field in the reference line), split the fasta file by chromosome and call each individually, etc.  I'm not sure if there is a way to incorporate gff directly into the variant calling pipeline, but there may be.

ADD COMMENT

Login before adding your answer.

Traffic: 1339 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6