Question: Managing data in R
gravatar for ioannis
2.9 years ago by
ioannis30 wrote:

Hello community,

I am newly introduced into R. I have got the basic concept of how the objects work but in order to analyse the statistics of my data I need to know much more than that. I have 75bp reads which include hydroxymethylated Cytosines (5hmC). From the reads I have extracted the ones that start with CCGG because the protocol is enzyme restriction based with MspI.

cat bsz_S2_R1_001.fastq | paste - - - - | awk -F '\t'  '(substr($2,1,4)=="CCGG")' | tr "\t" "\n" > bsz_tagged.fastq

Trimmed the adapters using trim_galore. Aligned them to the reference genome using Bowtie2-2.3.1. From the sam file using -grep ,I got a txt file with the scaffold and the genomic position of each read.

Now I need to make a plot of the distribution of the reads within the scaffolds. However, genomic positions might be common between scaffolds but they represent a different position in the genome. So I need to order my data by scaffolds, then by genomic position within each scaffold and then find frequencies of duplicated position values within each scaffold

Basically I have a huge table with two columns and millions of rows. At the end of this post, there are two tables as image. Example of how I want to handle the table: (could not make the image work, sorry)

I would appreciate if anyone can tell me which commands are critical or any package that can handle data in this way.

Cheers, Ioannis

sequencing next-gen R • 629 views
ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by ioannis30

Sounds like you wanna take a look at the GenomicRanges package and Rsamtools.

ADD REPLYlink written 2.9 years ago by Benn7.9k

I will have a look at the instructions! Thank you!

ADD REPLYlink written 2.9 years ago by ioannis30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1691 users visited in the last hour