Question: How to check size of genome?
0
gravatar for arianc
12 days ago by
arianc0
arianc0 wrote:

Hey! It is possible to check the size of genome (number of bp) in bam (or sam or fastq) file?

Thanks in advance.

bp bam • 129 views
ADD COMMENTlink modified 12 days ago by Alex Reynolds31k • written 12 days ago by arianc0

You are not checking the size of the genome if you are simply looking at your raw or aligned data. You are just counting the number of bases sequenced.

If you want to estimate size of the genome then you need to do something like this.

ADD REPLYlink written 12 days ago by GenoMax95k

I want to know how to check number of base pairs in bam file

ADD REPLYlink written 12 days ago by arianc0
2

You could simply run reformat.sh -Xmx10g in=your.bam from BBMap suite. It will produce output that will contain this information.

Input is being processed as unpaired
Input:                          9207996 reads           684032432 bases
Output:                         9207996 reads (100.00%)         684032432 bases (100.00%)

mappedonly=t etc should modify the output. Test it out.

ADD REPLYlink modified 12 days ago • written 12 days ago by GenoMax95k

I want to know how to check number of base pairs in bam file

That's not the same as the size of the genome though. What exactly do you want?

ADD REPLYlink written 12 days ago by WouterDeCoster45k
2
gravatar for Alex Reynolds
12 days ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

Reads will overlap, so you can't simply count the lengths of reads. To count the number of unique bases over reads per chromosome for assembly hg38, for example, using BEDOPS, bash, awk, and UCSC Kent utilities:

$ ASSEMBLY=hg38
$ bam2bed --reduced < reads.bam | bedmap --echo --bases-uniq <( fetchChromSizes ${ASSEMBLY} | grep -v "*_*" | awk -v FS="\t" -v OFS="\t" '{ print $1, "0", $2 }' | sort-bed - ) - > unique_bases_per_chromosome.bed

To get the total unique bases over the genome, sum up the last column in the result:

$ awk -v FS="\t" '{ s += $4 } END { print s }' unique_bases_per_chromosome.bed > total_unique_bases_over_assembly.txt

For SAM files, use sam2bed in place of bam2bed. For FASTQ, you would use a mapping tool to map raw sequence to locations on the genome, turning that into BAM.

ADD COMMENTlink modified 11 days ago • written 12 days ago by Alex Reynolds31k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2204 users visited in the last hour
_