How many A's ,T's , C's and G's in homo sapiens3 reads data set?
3
0
Entering edit mode
6.4 years ago
saranpons3 ▴ 70

Hello, I would like to know that How many A's ,T's , C's and G's are there in homo sapiens3 reads data set of 1.8TB data set. Do we have any tool to find this? Generally in Genome of human beings, which nucleotides count is more? thanks.

genome • 2.5k views
ADD COMMENT
0
Entering edit mode

Could you clarify what you want to do or find out? Are you asking how to count bases in a certain file type? Or is your question just a general one about AT/GC content in human?

ADD REPLY
0
Entering edit mode

I am asking generally how many A's, T's, C's and G's are there in human genome? Which nucleotides is more in human genome?

ADD REPLY
0
Entering edit mode

Hello all, When I looked at the numbers in every experiment's output, I could see that A and T are appearing more frequently when compared with C and G. So, Can i say that Human genome will have more A's and T's rather than C and G??!!

ADD REPLY
2
Entering edit mode

Yes, this is a rather well known fact. Search for GC-content for more info. There is also the Chargaff's rules stating that the number of As ~= the number of Ts and the number of Gs ~=the number of Cs.

ADD REPLY
0
Entering edit mode

What about the masked N bases? Many of those are centromeric 'difficult to sequence' regions, and are extremely high in GC content.

The reference genome is always just our best representation of the genome. The stats comparing AT to GC from the reference build may not therefore be accurate (?)

ADD REPLY
0
Entering edit mode

While that may be true we have to work with what is available now.

ADD REPLY
0
Entering edit mode

May i know that what do you mean by we have to work with what is available now?

ADD REPLY
0
Entering edit mode

@Kevin was indicating that many of the N's currently used as placeholders in centromeric/telomeric regions may turn out to be GC's in reality (current sequencing technologies are not able to sequence those regions fully). If and when we are able to sequence these regions, proportion of AT/CG will change (as they will do with each major genome build).

ADD REPLY
4
Entering edit mode
6.4 years ago

how many A's, T's, C's and G's are there in human genome?

grep -v '^>' in.fasta | grep -o '.' | sort | uniq -c

(and wait...)

ADD COMMENT
2
Entering edit mode

(and wait...)

Until cows come home?

Edit: Took more like 30 minutes.

Output (GRCh38 from NCBI):

866420001 A
      2 B
598683433 C
600854940 G
      8 K
      8 M
165045996 N
     26 R
      4 S
868918077 T
     13 W
     33 Y
ADD REPLY
1
Entering edit mode

I just mentioned today in class that some non-trivial percent of bases in the human genome are Ns. Now I can compute it. Let's see what it adds up to (surprisingly lengthy construct):

cat output | grep -v N | awk ' {print $1}'  | datamash sum 1 | xargs  echo 100 \* 165045996 / | bc -l

Looks like

5.62

close to 6% of bases are labeled as Ns

ADD REPLY
0
Entering edit mode

surprisingly lengthy construct

because bash is not the fastest/best way to handle(...sorting) 3E9 bases. A fast answer would use a C program with an array.

long count[UCHAR_MAX];....
ADD REPLY
0
Entering edit mode

emboss compseq does it under 1 min. But for some reason, output doesn't contain bases in lower case. (Working on the code)

$ time(compseq -word 1  hg38.fa result.cmp)
Calculate the composition of unique words in sequences

real    0m58.809s
user    0m58.172s
sys 0m0.620s

output:

#
# Output from 'compseq'
#
# The Expected frequencies are calculated on the (false) assumption that every
# word has equal frequency.
#
# The input sequences are:
#   chr1
#   chr10
#   chr11
#   chr11_KI270721v1_random
#   chr12
#   chr13
#   chr14
#   chr14_GL000009v2_random
#   chr14_GL000225v1_random
#   chr14_KI270722v1_random
# ... et al.


Word size   1
Total count 3209286105

#
# Word  Obs Count   Obs Frequency   Exp Frequency   Obs/Exp Frequency
#
A   898285419       0.2799019   0.2500000   1.1196078
C   623727342       0.1943508   0.2500000   0.7774032
G   626335137       0.1951634   0.2500000   0.7806535
T   900967885       0.2807378   0.2500000   1.1229512

Other   159970322       0.0498461   0.0000000   10000000000.0000000
ADD REPLY
0
Entering edit mode

Amazingly fast indeed. Also exhibits that endemic and tragic shortsightedness that permeates every EMBOSS tool - an implementation with tacit flaws in just about every tool.

See how it ignores the Ns? As if those did not exist. But they do...

Emboss is a toolset is invented too soon, too much ahead of its time - and with that comes the awkwardness of its interface that keeps it from being successful.

ADD REPLY
0
Entering edit mode

I guess EMBOSS team tried to emulate GCG (success) and then went no where, from there.

ADD REPLY
0
Entering edit mode

I meant the surprisingly lengthy construct as a qualifier of my own solution of getting the 5.62 number. It felt silly long just to take the fraction relative to a sum. Getting the 6% turned out to be longer than computing all the rest.

ADD REPLY
0
Entering edit mode

I know that A, T, C, G and N can be there in the reads of FASTQ and FASTA files. But, why your result has B, K, M, N, R, S, W and Y? Do reads from reads data set have alphabets other than A, T, C, G and N?

ADD REPLY
0
Entering edit mode

But, why your result has B, K, M, N, R, S, W and Y

it's a degenerate nucleotide alphabet (S='Strong'= C or G )

see also http://plindenbaum.blogspot.fr/2013/07/g1kv37-vs-hg19.html

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Until cows come home?

They're always home in Ireland - never leave.

ADD REPLY
0
Entering edit mode

Thanks for your answer

ADD REPLY
1
Entering edit mode
ADD COMMENT
0
Entering edit mode

Thanks for the reply

ADD REPLY
0
Entering edit mode
6.4 years ago
time(zgrep -v ">" ../reference/hg38/hg38.fa.gz  | fold -w1 | sort | uniq -c)

463840423 a
434444996 A
328257999 c
295469343 C
330651380 g
295683757 G
   3144 n
159967178 N
465881183 t
435086702 T

real    52m44.104s
user    45m5.512s
sys 0m49.220s

sum: 3209286105 and 455 fasta headers.

ADD COMMENT
0
Entering edit mode

This appears to be a genome build with alternate loci etc.

ADD REPLY

Login before adding your answer.

Traffic: 1833 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6