I am trying to figure out the necessary pre-processing steps before using sequencing data retreived from online databases. I work with metagenomes from Huaman gut microbiome. I figured out that the main three steps for this type of data are:
1) Identify and mask Human reads
2) Remove duplicate reads
3) Trim low quality bases
Here is an example of a study from which I would like to use data.
I can't figure out at what stage those data are. I beleive Human reads masking should have been performed already, as this has to deal with subjects privacy/ethics. But I don't find a clear information telling me that this is the case or not. Are sequencing data available on inline repositories always already cleared of Human reads already?
Thank you, Camille