Hello Everyone,
I currently have raw R1 and R2 fatsq.gz sequencing files. I would like to do the following with the files:
- Generate clusters for sequences that are 100% identical. Create a list containing the sequences and total number of hits for the top 20 clusters. (Similar to what I have at the end of my post)
- Generate a group plot like the image below. This is secondary to generating the list mentioned above.
I was hoping to just do this in galaxy or R but cannot find a guide/tutorial for this type if analysis.
- Do I have to combine the R1 and R2 files first?
- What type of QC for the raw reads need to be done?
- After QC, do I need to convert the data from fastq to something else?
- Do I need to do some sort of alignment?
Does anyone have a guide or instructions on how someone with very little experience can do this form of analysis?
Best,
BD
>A00904:376:H5MVTDSXC:3:1108:14118:11224
TCCATGACTCATGAACAAGAGACCCTATAGTGAGTCGTATTAGGCACCCATCTCTCTCCTTCTAGCCTCCGCTAGTCAAAAATTGGCGTACTCACCAGTCGCCGCTCTCGCCTCTTGCTGTGTGCACCTCAGCAAGCCGAGTCCTGCGTCGAGAGAACTTCTCTGGTCTCCCAGAGTCACACAACAGATGGGCACC
>A00904:376:H5MVTDSXC:3:1108:11731:11851
CTGGGAGACCAGAGAAGTTCTCTCGACGCAGGACTCGGCTTGCTGAGGTGCACACAGCAAGAGGCGAGAGCGGCGACTGGTGAGTACGCCAATTTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTGCCTAATACGACTCACTATAGGGTCTCTTGTTCATGAGTCATGGA
>A00904:376:H5MVTDSXC:3:1108:22010:11334
GGTGCCCATCTGTTGTGTGACTCTGGGAGACCAGAGAAGTTCTCTCGACGCAGGACTCGGCTTGCTGAGGTGCACACAGCAAGAGGCGAGAGCGGCGACTGGTGAGTACGCCAATTTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTGCCTAATACGACTCACTATAGGGTCTCTTGTTCATGAGTCATGGA
>A00904:376:H5MVTDSXC:3:1108:15573:11271
ACTCTGGGAGACCAGAGAAGTTCTCTCGACGCAGGACTCGGCTTGCTGAGGTGCACACAGCAAGAGGCGAGAGCGGCGACTGGTGAGTACGCCAATTTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTGCCCTCACTATAGGGTCTCTTGTTCATGAGTCATGGA
>A00904:376:H5MVTDSXC:3:1108:14597:11741
GTGTGACTCTGGGAGACCAGAGAAGTTCTCTCGACGCAGGACTCGGCTTGCTGAGGTGCACACAGCAAGAGGCGAGAGCGGCGACTGGTGAGTACGCCAATTTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTGCCTAATACGACTCACTATAGGGTCTCTTGTTCATGAGTCATGGA
>A00904:376:H5MVTDSXC:3:1108:22037:11443
GTTGTGTGACTCTGGGAGACCAGAGAAGTTCTCTCGACGCAGGACTCGGCTTGCTGAGGTGCACACAGCAAGAGGCGAGAGCGGCGACTGGTGAGTACGCCAATTTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTGCCTAATACGACTCACTATAGGGTCTCTTGTTCATGAGTCATGGA
>A00904:376:H5MVTDSXC:3:1108:24623:12915
CTCTCGACGCAGGACTCGGCTTGCTGAGGTGCACACAGCAAGAGGCGAGAGCGGCGACTGGTGAGTACGCCAATTTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTGCCATACGACTCACTATAGGGTCTCTTGTTCATGAGTCATGGATGATGGGTGCCTAATACGCTCACTATAGGGTCTCTTGTTCATGAGTCATGGA
Top 7 sequences from our data