Hello biostars,
I want to find all the # of CpG sites in Chr3.fa from UCSC.
Opening the chr3.fa in microsoft Word and using the CTRL + F, Find and Replace crashed the program. I tried to make a C++ script to count the CGs in a test file.
#include <istream>
#include <cstdlib>
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
        //Open the file:
        ifstream inputfile;
        inputfile.open("/mnt/i/home/Desktop/test.txt");
        //Get length of the file:
        inputfile.seekg (0, inputfile.end);
        int length = inputfile.tellg();
        inputfile.seekg (0, inputfile.beg);
        // read each base in the file for C
        int i;
        int cgcounter;
        char firstbuffer[1]; // Check if that is C
        char secondbuffer[1]; // Check if this is G
        for (i=0; i<= length; i++)
                {
                        inputfile.seekg(i, ios::beg);
                        inputfile.read(firstbuffer,1);
                        if (firstbuffer[1] == 'C' || firstbuffer[1] == 'c') 
                                {
                                     inputfile.seekg(i+1, ios::beg);
                                     inputfile.read(secondbuffer,1); 
                                                if (secondbuffer[1] == 'G' || secondbuffer[1] == 'g') 
                                                        {
                                                                cgcounter++;
                                                                i++;
                                                        }
                                }
                }
        cout << "The number of CGs in your file is " << cgcounter << "\n" ;
        return 0;
}
The test.txt is a file with just AAAAAAAAAAACGAAAAAAAAAAAACGAAAAAAAAAAACG (three CGs), but when I run the script, I get different outputs each time:
 labyu@DESKTOP-U037B9F:/mnt/i/home/Workspac$ g++ cgcounter.cpp
labyu@DESKTOP-U037B9F:/mnt/i/home/Workspac$ ./a.out
The number of CGs in your file is 635585824
labyu@DESKTOP-U037B9F:/mnt/i/home/Workspac$ ./a.out
The number of CGs in your file is 711083296
The code compiles fine, but I am not sure where these random numbers are coming from. I am very new to C++ and C. Alternatively, is their an easier way to find the # of CGs in a fasta/txt file?
Thanks!
Thanks Pierre!
I am trying to learn C as well. This is easy to understand. Could you explain a few parts of the program? How come we can store a character 'C' or 'G' in the prev variable if we declared it as an integer earlier? Is there any significance to initializing it as -1.
the size of an integer is greater than the size of a char
sizeof(char)<sizeof(int), so you can always store a char into a 'int'. Furthermore, fgetc returns a 'int', not a 'char' http://www.cplusplus.com/reference/cstdio/fgetc/-1 is arbitrarty but I'm sure it is initialized as a non interesting value like 'A', 'T', '>', '\n' ...