Managing data in R
0
0
Entering edit mode
7.1 years ago
ioannis ▴ 50

Hello community,

I am newly introduced into R. I have got the basic concept of how the objects work but in order to analyse the statistics of my data I need to know much more than that. I have 75bp reads which include hydroxymethylated Cytosines (5hmC). From the reads I have extracted the ones that start with CCGG because the protocol is enzyme restriction based with MspI.

cat bsz_S2_R1_001.fastq | paste - - - - | awk -F '\t'  '(substr($2,1,4)=="CCGG")' | tr "\t" "\n" > bsz_tagged.fastq

Trimmed the adapters using trim_galore. Aligned them to the reference genome using Bowtie2-2.3.1. From the sam file using -grep ,I got a txt file with the scaffold and the genomic position of each read.

Now I need to make a plot of the distribution of the reads within the scaffolds. However, genomic positions might be common between scaffolds but they represent a different position in the genome. So I need to order my data by scaffolds, then by genomic position within each scaffold and then find frequencies of duplicated position values within each scaffold

Basically I have a huge table with two columns and millions of rows. At the end of this post, there are two tables as image. Example of how I want to handle the table:

https://ibb.co/fBEitv (could not make the image work, sorry)

I would appreciate if anyone can tell me which commands are critical or any package that can handle data in this way.

Cheers, Ioannis

R next-gen sequencing • 1.2k views
ADD COMMENT
0
Entering edit mode

Sounds like you wanna take a look at the GenomicRanges package and Rsamtools.

ADD REPLY
0
Entering edit mode

I will have a look at the instructions! Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 1464 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6