Generating Group/Cluster Lists from fastq files
1
0
Entering edit mode
4 weeks ago
niruf • 0

Hello Everyone,

I currently have raw R1 and R2 fatsq.gz sequencing files. I would like to do the following with the files:

  1. Generate clusters for sequences that are 100% identical. Create a list containing the sequences and total number of hits for the top 20 clusters. (Similar to what I have at the end of my post)
  2. Generate a group plot like the image below. This is secondary to generating the list mentioned above.

I was hoping to just do this in galaxy or R but cannot find a guide/tutorial for this type if analysis.

  • Do I have to combine the R1 and R2 files first?
  • What type of QC for the raw reads need to be done?
  • After QC, do I need to convert the data from fastq to something else?
  • Do I need to do some sort of alignment?

Does anyone have a guide or instructions on how someone with very little experience can do this form of analysis?

Best,

BD

>A00904:376:H5MVTDSXC:3:1108:14118:11224
TCCATGACTCATGAACAAGAGACCCTATAGTGAGTCGTATTAGGCACCCATCTCTCTCCTTCTAGCCTCCGCTAGTCAAAAATTGGCGTACTCACCAGTCGCCGCTCTCGCCTCTTGCTGTGTGCACCTCAGCAAGCCGAGTCCTGCGTCGAGAGAACTTCTCTGGTCTCCCAGAGTCACACAACAGATGGGCACC

>A00904:376:H5MVTDSXC:3:1108:11731:11851
CTGGGAGACCAGAGAAGTTCTCTCGACGCAGGACTCGGCTTGCTGAGGTGCACACAGCAAGAGGCGAGAGCGGCGACTGGTGAGTACGCCAATTTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTGCCTAATACGACTCACTATAGGGTCTCTTGTTCATGAGTCATGGA

>A00904:376:H5MVTDSXC:3:1108:22010:11334
GGTGCCCATCTGTTGTGTGACTCTGGGAGACCAGAGAAGTTCTCTCGACGCAGGACTCGGCTTGCTGAGGTGCACACAGCAAGAGGCGAGAGCGGCGACTGGTGAGTACGCCAATTTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTGCCTAATACGACTCACTATAGGGTCTCTTGTTCATGAGTCATGGA

>A00904:376:H5MVTDSXC:3:1108:15573:11271
ACTCTGGGAGACCAGAGAAGTTCTCTCGACGCAGGACTCGGCTTGCTGAGGTGCACACAGCAAGAGGCGAGAGCGGCGACTGGTGAGTACGCCAATTTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTGCCCTCACTATAGGGTCTCTTGTTCATGAGTCATGGA

>A00904:376:H5MVTDSXC:3:1108:14597:11741
GTGTGACTCTGGGAGACCAGAGAAGTTCTCTCGACGCAGGACTCGGCTTGCTGAGGTGCACACAGCAAGAGGCGAGAGCGGCGACTGGTGAGTACGCCAATTTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTGCCTAATACGACTCACTATAGGGTCTCTTGTTCATGAGTCATGGA

>A00904:376:H5MVTDSXC:3:1108:22037:11443
GTTGTGTGACTCTGGGAGACCAGAGAAGTTCTCTCGACGCAGGACTCGGCTTGCTGAGGTGCACACAGCAAGAGGCGAGAGCGGCGACTGGTGAGTACGCCAATTTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTGCCTAATACGACTCACTATAGGGTCTCTTGTTCATGAGTCATGGA

>A00904:376:H5MVTDSXC:3:1108:24623:12915
CTCTCGACGCAGGACTCGGCTTGCTGAGGTGCACACAGCAAGAGGCGAGAGCGGCGACTGGTGAGTACGCCAATTTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTGCCATACGACTCACTATAGGGTCTCTTGTTCATGAGTCATGGATGATGGGTGCCTAATACGCTCACTATAGGGTCTCTTGTTCATGAGTCATGGA

Top 7 sequences from our data

Sequencing RNA-seq DNA-Seq • 165 views
ADD COMMENT
1
Entering edit mode
4 weeks ago
GenoMax 143k

I was hoping to just do this in galaxy

This is not a galaxy option but clumpify.sh from BBMap suite can cluster sequences/dedupe them. See for more info --> Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.

Similar to what I have at the end of my post

These sequences are fasta format not fastq. But if that is what you need then reformat.sh from BBMap suite can convert fastq to fasta format. bbmerge.sh tool can merge the reads into a single representation of each library fragment as well.

ADD COMMENT

Login before adding your answer.

Traffic: 3094 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6