Question

Preprocessing the genomic data

0

Entering edit mode

6.9 years ago

DL ▴ 50

Hello everyone.. I have mate pair sequencing data with 150 bp length and when i checked quality of reads using fastq and i found the quality is good.

#Base   Mean    Median  Lower Quartile  Upper Quartile  10th Percentile 90th Percentile
1   33.93390648005911   35.0    35.0    35.0    31.0    35.0
2   33.95907029321767   35.0    35.0    35.0    31.0    35.0
3   33.98077552343845   35.0    35.0    35.0    31.0    35.0
4   33.98753499777637   35.0    35.0    35.0    31.0    35.0
5   33.984543748770804  35.0    35.0    35.0    31.0    35.0
6   38.591631930233014  40.0    39.0    40.0    36.0    40.0
7   38.56701433221178   40.0    39.0    40.0    36.0    40.0
8   38.55072589489948   40.0    39.0    40.0    36.0    40.0
9   38.53976502189077   40.0    39.0    40.0    36.0    40.0
10-14   38.50712970571924   40.0    39.0    40.0    36.0    40.0
15-19   38.4177180935979    40.0    39.0    40.0    36.0    40.0
20-24   38.38606423538208   40.0    39.0    40.0    36.0    40.0
25-29   38.29027050300057   40.0    39.0    40.0    35.2    40.0
30-34   38.20144945809037   40.0    39.0    40.0    35.0    40.0
35-39   38.11811796390104   40.0    39.0    40.0    34.0    40.0
40-44   38.0175315710861    40.0    39.0    40.0    34.0    40.0
45-49   37.935398773549636  40.0    39.0    40.0    34.0    40.0
50-54   37.84706168928388   40.0    39.0    40.0    34.0    40.0
55-59   37.74265376076599   40.0    38.4    40.0    34.0    40.0
60-64   37.64332664650341   40.0    38.0    40.0    34.0    40.0
65-69   37.52005951881047   39.8    38.0    40.0    34.0    40.0
70-74   37.38430764378722   39.0    38.0    40.0    31.6    40.0
75-79   37.26014222004879   39.0    38.0    40.0    31.0    40.0
80-84   37.12272526413333   39.0    37.2    40.0    31.0    40.0
85-89   36.964182674129546  39.0    37.0    40.0    30.2    40.0
90-94   36.81203894449435   39.0    37.0    40.0    30.0    40.0
95-99   36.65255631448299   39.0    36.4    40.0    29.2    40.0
100-104 35.66682387627061   38.0    34.6    39.2    27.0    39.6
105-109 36.83309679602688   39.0    36.8    40.0    30.0    40.0
110-114 36.80732298238993   39.0    37.0    40.0    30.0    40.0
115-119 36.60025361051608   39.0    36.4    40.0    28.2    40.0
120-124 36.388457503902416  39.0    36.0    40.0    27.0    40.0
125-129 36.12575631519171   39.0    36.0    40.0    27.0    40.0
130-134 35.90247099450205   39.0    36.0    40.0    27.0    40.0
135-139 35.6381126201069    39.0    35.2    40.0    27.0    40.0
140-144 35.37785752835347   39.0    34.6    40.0    26.8    40.0
145-149 35.04320800576903   39.0    34.0    40.0    19.2    40.0
150-151 33.066291537988604  37.0    30.5    39.5    16.0    40.0

Now i want to assemble the data but i am confused because i do not know that adopters are removed from this dataset or not so any one can tell me about the steps of preprocesscing the sequencing data and i want to know also after the preprocessing length of reads are same in R1 or R2 file ???

Thanks

genome Assembly next-gen sequencing • 2.3k views

ADD COMMENT • link updated 6.9 years ago by GenoMax 141k • written 6.9 years ago by DL ▴ 50

0

Entering edit mode

Here is the complete list of adapters commonly used on illumina platforms; illumina adapters, then you can trimm them with so many different software (command line based on linux) as biopieces

ADD REPLY • link 6.9 years ago by Buffo ★ 2.4k

0

Entering edit mode

ohk, i will check it...can you please tell me steps for preproccesing genomic data ??? Thanks

ADD REPLY • link 6.9 years ago by DL ▴ 50

0

Entering edit mode

It depends, what are you looking for? de novo assembly? referenced assembly? find SNP`s? etc etc

ADD REPLY • link 6.9 years ago by Buffo ★ 2.4k

0

Entering edit mode

i am looking for denovo assembly and snp analysis ??

ADD REPLY • link 6.9 years ago by DL ▴ 50

0

Entering edit mode

.- Adapter, lenght and quality triming (prinseq-lite)
.- QC visualization (fastqc tool)
.- Assembly (spades)
.- Assembly stats (biopieces)
.- Annotation (online tools are available, or creating your own scripts)
.- Whatever you want; compare to other related species, etc etc etc.
.- For SNP`s i think you can use VCF tools

ADD REPLY • link 6.9 years ago by Buffo ★ 2.4k

0

Entering edit mode

I have mate pair sequencing data

mate pair =/= pair-end sequencing. Mate-pair data requires different handling.

Just wanted to make sure.

Use bbduk.sh from BBMap suite for trimming. Adapter sequences for most common commercial kits are included in adapters.fa file in resources directory in the software bundle. How to use BBduk.

i want to know also after the preprocessing length of reads are same in R1 or R2 file ???

No the read lengths do not need to be (and may not be) of same length after they are scanned and trimmed. You do want to trim R1/R2 reads together since the order of the reads in two files is important. If you lose one read in one of the files, its mate needs to be removed from the other. bbduk is paired-end aware and will take care of this for you.

ADD REPLY • link 6.9 years ago by GenoMax 141k

0

Entering edit mode

Thank You for your reply....I send u job ID of printseq 31343934383336313632. Will you please see this data and tell me this data contain adapter sequence or not. I am surprised with results of preprocessing data using different tools because every time i found each base of reads have good quality and no tag seauences and duplication so i do not understand what should i do ????

Thanks

ADD REPLY • link 6.9 years ago by DL ▴ 50

0

Entering edit mode

If your reads have been scanned and trimmed (check if they are all the same length as length of sequencing run, if they are not then there is a good chance that they have been pre-trimmed). In that case you can move on to next step in analysis.

Tag sequences are relocated to the fastq header when Illumina reads are demultiplexed.

ADD REPLY • link 6.9 years ago by GenoMax 141k

0

Entering edit mode

Yes, all reads have same length (150 bp) so i am thinking that adopters are not removed from sequencing data. So, first step should be remove adaptor sequence then quality filtering ?? am i right ???

ADD REPLY • link 6.9 years ago by DL ▴ 50

0

Entering edit mode

It is not necessary that your data have adapter contamination (if you have exceptionally well made libraries).

You will only see adapter sequence on 3'-end of reads if (some of) your library inserts happen to be shorter than the length of sequencing. Can you run bbduk.sh based on the link I had posted in my first response above?

ADD REPLY • link 6.9 years ago by GenoMax 141k

score 0 · Answer 1 · 2017-05-12

0

Entering edit mode

6.9 years ago

BioinfGuru ★ 1.6k

FASTQC then Trimmomatic then FASTQC again ... then if all has gone well, you can start aligning to a reference genome

ADD COMMENT • link 6.9 years ago by BioinfGuru ★ 1.6k

0

Entering edit mode

but i dnt know adopter sequence then what should i do ??

ADD REPLY • link 6.9 years ago by DL ▴ 50

0

Entering edit mode

The fastqc report will tell you what adapters are in your files

ADD REPLY • link 6.9 years ago by BioinfGuru ★ 1.6k

0

Entering edit mode

i checked in fastqc results and found no adapters. but how can it possible that all the reads are 150bp length and will consider for assembly.

ADD REPLY • link 6.9 years ago by DL ▴ 50

0

Entering edit mode

Also look in "overrepresented sequences" section of fastqc result....if blank you are good to go. If there is a sequence in there...then that is likely to be an adapter and should be removed by your trimmer

ADD REPLY • link 6.9 years ago by BioinfGuru ★ 1.6k

0

Entering edit mode

Unless you post your FastQC results there is not much we can do to further help. It is possible that your data is of good quality and you are worrying for no reason.

ADD REPLY • link 6.9 years ago by GenoMax 141k

0

Entering edit mode

With no over represented sequences and no adapters showing up in the fastqc report, am I good to go? Or do I still need to check for adapters?

ADD REPLY • link 6.6 years ago by deepti1rao ▴ 50

0

Entering edit mode

post the fastqc results please -> better still....if you have multiple samples....use multiqc

ADD REPLY • link 6.6 years ago by BioinfGuru ★ 1.6k