Question

Tutorial:Calculation of TSS Scores for ATAC-seq Data

3

Entering edit mode

6.0 years ago

Sael ▴ 30

Hello Everyone,

I want to share with you guys my method for the calculation of TSS scores for ATAC-seq data. I am new to bioinformatics and I learned so much from this place. You can do the calculation in R (ATACqc package) but it is very fast doing it in UNIX command line. I used ENCODE definition of TSS regions and the corresponding 100 bp flanks.

The method does the following:

1.Download hg38 annotation file. 
2.Define and extract TSS regions (1000 bp +/- TSS)
3.Check that the defined TSS regions do not extend outside chromosomal bounds. 
4.Convert the file to SAF file format.
5.Define a 100 bp +/- flanking regions at the edges of the TSS regions you made in step 3.
6.Convert the file to SAF file format.
5.Use featureCounts to counts reads aligned to the defined TSS regions (Signal).
6.Use featureCounts to counts reads aligned to the defined 100 bp +/- flanking regions at the edges of the TSS regions (Background/Noise).
7. Divide reads from signal by reads from (Background/Noise) and this the fold change

I used this method to check TSS score from ChiP input (no expected enrichment at all) and got a value below 5 (poor) consistent with ENCODE guidelines. https://www.encodeproject.org/atac-seq/

-Please guys correct me if I am missing something.
-Please guys have try this method in your own data to see if it works.

First get hg38 refSeq annotations (Credit to Alex Reynolds , Table browser +/- 2Kb of TSS export)

wget -qO- http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz | gunzip -c - | awk 'BEGIN{ OFS="\t" }{ print $3, $5, $6, $13, $10, $4  }' - | sort-bed - > refGene.bed

Then split them by strand and pad around the stranded-start position of the annotation (Take areas around TSS -/+ 1000)

awk '($6 == "+") { print $0 }' refGene.bed | awk 'BEGIN{ OFS="\t" }($2 > 1000){ print $1, ($2 - 1000), ($2 + 1000), $4, $5, $6  }' > refGene.tss.for.padded.bed
awk '($6 == "-") { print $0 }' refGene.bed | awk 'BEGIN{ OFS="\t" }($3 > 1000){ print $1, ($3 - 1000), ($3 + 1000), $4, $5, $6  }' > refGene.tss.rev.padded.bed


bedops --everything refGene.tss.for.padded.bed refGene.tss.rev.padded.bed > refGene.tss.padded.bed

Keep only TSS regions within chromosomal bounds.

fetchChromSizes hg38 | awk '{ print $1"\t0\t"$2; }' | sort-bed - > hg38.bounds.bed
bedops --element-of 100% refGene.tss.padded.bed hg38.bounds.bed > refGene.tss.padded.filtered.bed

Convert to SAF

awk 'BEGIN{FS=OFS="\t"; print "GeneID\tChr\tStart\tEnd\tStrand"}{print $4, $1, $2+1, $3, "."}' refGene.tss.padded.filtered.bed > refGene.tss.padded.filtered.bed.saf

Count reads using FeatureCounts.

featureCounts -T 6 -a refGene.tss.padded.filtered.bed.saf -F SAF -o readCountInPeaks.txt atac.bam

Record number of successful alignments to the TSS

Second part is to do the same thing but on different regions (100bp -/+ of TSS regions)

Use the same file for the TSS regions Use flank from bedtools to get these 100bp -/+ regions (with the exclusion of the TSS region) Flank tool requires a genome file (different from the one you used to get the TSS) to get the 100bp -/+ First you need a genome file defining the length of each chromosome or contig.

samtools faidx genome.fa
cut -f 1,2 genome.fa.fai > chrom.sizes (Credit to igor, https://www.biostars.org/p/206140/#206169)

Use flank from bedtools

bedtools flank -i refGene.tss.padded.filtered.bed -g hg38_size_chrom.genome -b 100 > Flanks_100_up_down.bed

Convert to SAF file

awk 'BEGIN{FS=OFS="\t"; print "GeneID\tChr\tStart\tEnd\tStrand"}{print $4, $1, $2+1, $3, "."}' Flanks_100_up_down.bed > Flanks_100_up_down.bed.saf

Run with featurecounts

featureCounts -T 6 -a Flanks_100_up_down.bed.saf -F SAF -o readCountInPeaks.txt atac.bam

Record number of successful alignments to the 100bp -/+

To get TSS Enrichemnt score Divide total number of successful alignments to the TSS by total number of successful alignments to the 100bp -/+

Transcription start site (TSS) enrichment values are dependent on the reference files used; cutoff values for high quality data are listed below.

GRCh38 Refseq TSS annotation

below 5 Concerning
5 - 7 Acceptable
Above 7 Ideal

https://www.encodeproject.org/atac-seq/

next-gen-sequencing • 5.5k views

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 6.0 years ago by Sael ▴ 30