Question

Annotating reads with the name of the feature to which they map

0

Entering edit mode

4.4 years ago

lechu ▴ 20

Hi,

I have a long list of chromosome positions and strand information (all produced by sam2tsv (http://lindenb.github.io/jvarkit/Sam2Tsv.html). I am looking for a way to annotate this list with feature names from a bed file (or a gtf file, whatever better). The desired minimal output is a table like the one below, but including e.g., a transcript name column:

#Read-Name  Flag  MAPQ  CHROM  READ-POS0  READ-BASE  READ-QUAL  REF-POS1  REF-BASE  CIGAR-OP
r001        163   30    ref    0          T          .          7         T         M
r001        163   30    ref    1          T          .          8         T         M
r001        163   30    ref    2          A          .          9         A         M

Essentially it boils down to looking up if a position falls within a bed range and if the strand matches, so maybe something like awk with two files looks up (?) (never done so far, so I am not sure if that's anywhere near to be good for that). What would be the most efficient way to do it for a very long table? I'm hoping for some tips to get me started so that I don't get stuck with an inefficient approach.

An alternative could be to use bedtools intersect, and annotate reads with a feature name, but in this case, I need the information about QNAME::feature match for each read individually. Is that possible to get this with bedtools?

Cheers, Lech

RNA-Seq • 1.6k views

ADD COMMENT • link updated 4.4 years ago by A. Domingues ★ 2.7k • written 4.4 years ago by lechu ▴ 20

score 2 · Accepted Answer · 2021-03-03

If I understood teh goal correctly, I have use two different strategies in the past:

bedtools interesect

bamToBed -i sample.bam | bedtools intersect -a - -b features.bed -s -wa -wb -bed -f 1.0 > $output1

You can play with the flags to make the intersections, unstranded/same/opposite strand, and define the level of overlap. The output will a text file with the read information plus the feature info (start, end, name, ...) . Using command line tools such as awk / cut, it can be cleaned up a bit. The features can also be a GFF Th caveat with this approach that there will be some reads will have more than entry when they overlap more than on feature. How to deal with these will depend on the goal of the analysis.

HTseq (> v0.9, i think)

samtools view -h ${aln} | htseq-count --samout=${exp}.test1.sam -s yes -f sam -m intersection-nonempty - ${ESSENTIAL_GENES} > ${exp}.counts

This will generate a classical table of counts, which you can discard, but also a new sam file with an extra tag with gene id (or whatever ID you have):

NB501946:201:H3HMGAFXY:1:21112:24049:15895 16 I 3115
0 22M * 0 0TGATGTTCTACGCTTAAATTTT EEEEEEEEEEEEEEEEEEEEEA XA:i:0 MD:Z:22 NM:i:0 XM:i:2 XF:Z:__too_low_aQual NB501946:201:H3HMGAFXY:2:11202:12920:18180
16 I 3685 255 21M * 0
0TATCTACTAGGAATAACTCGA EEEEEEEEEEEEEEEEEEEEA XA:i:0 MD:Z:21 NM:i:0 XF:Z:__no_feature NB501946:201:H3HMGAFXY:2:21105:7328:11769
0 I 3738 255 21M * 0
0TGTAAAATAGAGGATCAGACC AAEEEEEEEAAEE6EEEEEEE XA:i:0 MD:Z:21 NM:i:0 XF:Z:__no_feature NB501946:201:H3HMGAFXY:2:11112:5811:6494
16 I 3746 255 21M * 0
0AGAGGATCAGACCCAAAATTC EEEEEEEEEEEEEEEEEEEEA XA:i:0 MD:Z:21 NM:i:0 XF:Z:WBGene00023193 NB501946:201:H3HMGAFXY:2:21204:5691:10822 16 I 3746 255 22M * 0
0AGAGGATCAGACCCAAAATTCA EEEEEEEEEEEEEAEEEEEEEA XA:i:0 MD:Z:22 NM:i:0 XF:Z:WBGene00023193 NB501946:201:H3HMGAFXY:1:21209:12787:3406 0 I 3747 255 21M * 0
0GAGGATCAGACCCAAAATTCA AEEEEEEEEEEEEEEEEEEEE XA:i:0 MD:Z:21 NM:i:0 XF:Z:__no_feature NB501946:201:H3HMGAFXY:4:11512:6158:17507
16 I 3747 255 21M * 0
0GAGGATCAGACCCAAAATTCA EEEEEEEEEEEEEEEEEEEEA XA:i:0 MD:Z:21 NM:i:0 XF:Z:WBGene00023193 NB501946:201:H3HMGAFXY:3:11503:17322:16736 16 I 3747 255 20M * 0
0GAGGATCAGACCCAAAATTC EEEEEEEEEEEEEEEEEEEA XA:i:0 MD:Z:20 NM:i:0 XF:Z:WBGene00023193

This can then be converted just like you have in your example, but with that added column. See also https://github.com/simon-anders/htseq/issues/65