Question

How can I count reads aligned to features and extract poly-T stretch information from the Read ID?

0

Entering edit mode

6.4 years ago

biplab ▴ 110

I have modified R1 reads ID by adding length of poly-T stretches in R2 reads. After that I have aligned R1 reads to genome. Now I have sam files. I can use HT-seq to count number of reads aligned to features. In addition to counts of reads aligned to feature, I would get average T length. Here is example of a reads from sam files.

J00113:322:0001:30:0:GT 16  IV  1450285 22  150M    *   0   0 ACCAGTAGTGTGTCTTCTCTTTGCCTTGGCAGCCCAGTTGTGAGATCTAGTCTTAGCGGATGGGTAACCACAAGAGGAACAGGTCTTCTTTTGAACATGGAAAGAACGACGACCAC
ATCTGTTACACAAGGTGTGAGATTTGATTCTCCG  7-<-FJFJJFJJFFJJ<JJFJAF7JFFAAFJFF<FAJAJFFA-7JJ7F<AAFJJ<JFFFJJFJJFJF7<7-F77<-7JF7A-AJFJFJJJJJF<FJFJJFA7FF<--7-<--F-F77J<JFJF-<<<FFJJJJFF<JFJJJJFJA-AA-A  AS:i:-23    X

30 in sequence ID field is the length of polyT stretch in R2 files which did not use for alignment. Thanks in advance for helping me with idea about extracting average T length and counts associated with features.

next-gen rna-seq • 1.3k views

ADD COMMENT • link updated 6.4 years ago by Devon Ryan 104k • written 6.4 years ago by biplab ▴ 110

score 3 · Accepted Answer · 2017-12-21

3

Entering edit mode

6.4 years ago

Devon Ryan 104k

Have htseq-count (or featureCounts, which is much faster) output a BAM/SAM file with the alignments labeled as to which gene (if any) they were assigned (the -o option). You then parse that to get the auxiliary tag it appends. That holds the gene to which the read was assigned and you can appropriately increment a value in a hash after parsing the read ID.