Deconseq to remove human sequences
1
0
Entering edit mode
6.8 years ago

Hello,

I am using Deconseq to remove human sequences from fastq files generated by Myseq Illumina. I created the database with the human sequences using:

bwa64 index -p hs_ref_GRCh38_p2 -a bwtsw hs_ref_GRCh38_p2_split_PS.fa.fasta > bwa.log 2 >&1

but now I don't know how to remove these sequences from my actual paired files, let's call them myfile_1.fq and myfile_2.fq

could you give some hints?

Thank you.

deconseq cleaning • 2.9k views
0
Entering edit mode
6.8 years ago
satshil.r ▴ 50
perl deconseq.pl -f myfile_1 -dbs hs_ref_GRCh38_p2 -i 90 -c 90 -out_dir <directory>


The -I 90 refers to an identity threshold:

Alignment identity threshold in percentage. The identity is calculated for the part of the query sequence that is aligned to a reference sequence. For example, a query sequence of 100 bp that aligns to a reference sequence over the first 50 bp with 40 matching positions has an identity value of 80%.

The -c 90 refers to the coverage threshold:

Alignment coverage threshold in percent. The coverage is calculated for the part of the query sequence that is aligned to a reference sequence. For example, a query sequence of 100 bp that aligns to a reference sequence over the first 50 bp with 40 matching positions has an coverage value of 50%.

You have to make sure you define your deconseq databases in the configuration file.

hs_ref_GRCh38_p2 => {name => 'hs_ref_GRCh38_p2',
db => 'hs_ref_GRCh38_p2'},


and make sure you define the database location:

use constant DB_DIR => "<DIR_WITH_BWA_DB_OUTPUT>";


Of course you have to adjust the settings, specifically the c and i thresholds to what you seem fit.

0
Entering edit mode

Thank you very much, but it still a bit beyond me. So first of all, if I have two paired files, why there is only one in the command? Secondly, what configuration file shall I modify? Thirdly, the database location should go in the same config file? Should these modification be done verbatim? Cheers

0
Entering edit mode

I created the database with the human sequences using:

bwa64 index -p hs_ref_GRCh38_p2 -a bwtsw hs_ref_GRCh38_p2_split_PS.fa.fasta > bwa.log 2 >&1


This as created a series of files that I placed in a subfolder named refChr. The list of files is:

hs_ref_GRCh38_p2.amb hs_ref_GRCh38_p2.pac hs_ref_GRCh38_p2.sa
hs_ref_GRCh38_p2.ann hs_ref_GRCh38_p2.rbwt hs_ref_GRCh38_p2_split.fa
hs_ref_GRCh38_p2.bwt hs_ref_GRCh38_p2.rpac hs_ref_GRCh38_p2_split.fa.log
hs_ref_GRCh38_p2.fa hs_ref_GRCh38_p2.rsa hs_ref_GRCh38_p2_split_PS.fa.fasta


I then ran the following command to use Deconseq:

~\$ perl /usr/bin/deconseq.pl -f fu_1.fq -dbs ./refChr/hs_ref_GRCh38_p2 -i 90 -c 90 -out_dir DECONSEQ
But I got the following error:
ERROR: database "./refChr/hs_ref_GRCh38_p2" does not exist in config file.

Exit program.


I tried with '/refChr/...' and 'refChr/...' and also with '...hs_ref_GRCh38_p2.fa' and '...hs_ref_GRCh38_p2.sa' but same error.

What would be the correct use of Deconseq with the human library to remove the human contaminants?

Thank you