Step 1

Question

comparing multiple call files to a baseline file with bcftools sec

0

Entering edit mode

6.4 years ago

drowl1 ▴ 30

Hello all,

I am using bcftools i sec command to compare my calls against a variant baseline file but I have multiple call files(>400) which i want to individually compare against the baseline.Is there a way to do this without having to manually run the command each time for every call file? I am a newbie in programming and bioinformatics and would appreciate a point in the right direction on this one.

Thank you!

next-gen software error • 2.4k views

ADD COMMENT • link 6.4 years ago by drowl1 ▴ 30

2

Entering edit mode

It depends on how your data is organized. Are all your files on the same folder? Are they scattered throughout several subfolders under a common folder?

If all your files are on a single folder, the simplest would be a bash loop (I am using a mock command as I don't know the command you are executing):

for i in *.bcf
do
  bcftools whatever -i $i > $i.out
done

In case your files are scattered, you may have to use find with -exec. When you feel more confident (and if you have the computing resources) you can move to GNU parallel.

ADD REPLY • link 6.4 years ago by h.mon 35k

0

Entering edit mode

Thank you.The for loop didn't work for me with this particular bcftools isec option but i learned about find with -exec options!

ADD REPLY • link 6.4 years ago by drowl1 ▴ 30

2

Entering edit mode

6.4 years ago

Pierre Lindenbaum 161k

Using a Makefile. Name the following file "Makefile", there is a 'tab' before bcftools.

BCFS=$(shell find . -type f -name "*.bcf")
REFBCF=your.bcf
%.txt: %.bcf
     bcftools isec ${REFBCF} $< --output $@
all:$(patsubst %.bcf,%.txt,${BCFS}) ${REFBCF}

invoke this with

$ make

to run 10 jobs in parallel:

$ make -j 10

ADD COMMENT • link 6.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you!New stuff learned !

ADD REPLY • link 6.4 years ago by drowl1 ▴ 30

2

Entering edit mode

6.4 years ago

cpad0112 21k

ls *.bcf | parallel 'bcftools isec params baseline.bcf {} -p {.}'

You may get one extra because of baseline.bcf

ADD COMMENT • link 6.4 years ago by cpad0112 21k

0

Entering edit mode

Thank you so much!I also tried with this and it worked!

ADD REPLY • link 6.4 years ago by drowl1 ▴ 30

0

Entering edit mode

6.4 years ago

drowl1 ▴ 30

Thank you Bioinformaticians for your swift and appropriate response to my SOS call!Continue to educate and sharing the knowledge!

ADD COMMENT • link 6.4 years ago by drowl1 ▴ 30

Kevin Blighe · Accepted Answer · 2017-12-01

2

Entering edit mode

6.4 years ago

Kevin Blighe 87k

Step 1

You should first get your BCFs in a file as a listing, like this:

find /home/drowl1/Project1/ -name "*.bcf" > BCF.list
cat BCF.list
/home/drowl1/Project1/Sam1.bcf
/home/drowl1/Project1/Sam2.bcf
/home/drowl1/Project1/Sam3.bcf
/home/drowl1/Project1/Sam4.bcf
/home/drowl1/Project1/Sam5.bcf
...

You can also use ls but here I use find

Step 2

Then, create a loop that will go through each file in the listing and perform the analysis with BCFtools:

paste BCF.list | while read BCF ;
do
        echo "--Processing ""${BCF}""..."

        #Create results filename
        resultsfile=$(echo "${BCF}" | sed 's/.bcf$/.txt/g') ;
        echo -e "--Results file will be ""${resultsfile}\n" ;

        bcftools isec [options] Baseline.bcf "${BCF}" --output "${resultsfile}" ;
done ;

You can probably follow what's going on here. The results will be output to a file with the same name for each BCF, but with the .txt extension. You can change this (with the inner sed command) to anything that you want.

Also be careful because, when you list the BCFs, it may include your baseline reference (in my example above, I've named this Baseline.bcf

This should work in bash and sh. I have not tested other shells.

Kevin

ADD COMMENT • link 4.0 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank you Kevin! The loop you provided here is really easy to follow.

One last follow-up question though. bcftools isec command appears to require a -p argument which provides a directory for the evaluation output (four files for intersection,complement and union of the baseline file and the call file) per every run(i.e per every input call file).

How would I go around this with the loop?

(Newbie apologies if this happens to be an easy question.I am learning)

Thanks!

ADD REPLY • link 6.4 years ago by drowl1 ▴ 30

1

Entering edit mode

If you want to output the files to different directories, then use:

paste BCF.list | while read BCF ;
do
        echo "Processing ""${BCF}""..."

        #Create results filename
        resultsfile=$(echo "${BCF}" | sed 's/.bcf$//g') ;
        echo -e "Results file will be ""${resultsfile}\n" ;

        #Add security to ensure that your input files are not overwritten
        if [ "${BCF}" == "${resultsfile}" ]
        then
             echo "Error! - input filename is the same as output file/directory!"
             exit 1 ;
        fi

        bcftools isec Baseline.bcf "${BCF}" -p "${resultsfile}" ;
done ;

In my example, if the input files are:

/home/drowl1/Project1/Sam1.bcf

/home/drowl1/Project1/Sam2.bcf

Then the output directries will be:

/home/drowl1/Project1/Sam1

/home/drowl1/Project1/Sam2

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

1

Entering edit mode

I updated my initial response to your comment and believe that it now addresses what you were aiming to do.

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin, Apologies for the late reply,I'm back to work now and I have not yet resolved this issue.The loop you provided above looks like this on my end:

paste VCF.list | while read VCF ;
do
        echo "Processing “”${VCF}””…”

        #Create results filename
        resultsfile=$(echo “${VCF}” | sed ’s/.vcf$//g') ;
        echo -e "Results file will be ""${resultsfile}\n" ;

        #Add security to ensure that your input files are not overwritten
        if [ “${VCF}” == "${resultsfile}" ]
        then
             echo "Error! - input filename is the same as output file/directory!"
             exit 1 ;
        fi

        bcftools isec -c all NHGRI1.danRer10.variant.withID.Tab5.2F.vcf.gz  “${VCF}” -p "${resultsfile}" ;
done ;

and my VCF.list has multiple files like this

.............................................................. ./P7354_1382_S382_pcsk9T1_Aligned.out.sorted.bam.vcf.gz.withIDS.gz ./P7354_1383_S383_pcsk9T1_Aligned.out.sorted.bam.vcf.gz.withIDS.gz ./P7354_1384_S384_pcsk9T1_Aligned.out.sorted.bam.vcf.gz.withIDS.gz

......... and so on.

I am running an interactive session on an ssh in the same working directory as the input VCF files .

The loop seems to be running but not quite successfully ,looking like this:

paste VCF.list | while read VCF ; do         echo Processing “”./P7354_1383_S383_pcsk9T1_Aligned.out.sorted.bam.vcf.gz.withIDS.gz””…”
sed: -e expression #1, char 1: unknown command: `?'
Results file will be 

[E::hts_open_format] Failed to open file “./P7354_1383_S383_pcsk9T1_Aligned.out.sorted.bam.vcf.gz.withIDS.gz”
Failed to open “./P7354_1383_S383_pcsk9T1_Aligned.out.sorted.bam.vcf.gz.withIDS.gz”: No such file or directory
sed: -e expression #1, char 1: unknown command: `?'
Processing “”./P7354_1384_S384_pcsk9T1_Aligned.out.sorted.bam.vcf.gz.withIDS.gz””…”

        #Create results filename
        resultsfile= ;
        echo -e Results file will be n ;

        #Add security to ensure that your input files are not overwritten
        if [ “./P7354_1384_S384_pcsk9T1_Aligned.out.sorted.bam.vcf.gz.withIDS.gz” ==  ]
        then
             exit 1 ;
        fi

        bcftools isec -c all NHGRI1.danRer10.variant.withID.Tab5.2F.vcf.gz  “./P7354_1384_S384_pcsk9T1_Aligned.out.sorted.bam.vcf.gz.withIDS.gz” -p  ;
done ;


paste VCF.list | while read VCF ; do         echo Processing “”./P7354_1384_S384_pcsk9T1_Aligned.out.sorted.bam.vcf.gz.withIDS.gz””…”
sed: -e expression #1, char 1: unknown command: `?'
Results file will be 

[E::hts_open_format] Failed to open file “./P7354_1384_S384_pcsk9T1_Aligned.out.sorted.bam.vcf.gz.withIDS.gz”
Failed to open “./P7354_1384_S384_pcsk9T1_Aligned.out.sorted.bam.vcf.gz.withIDS.gz”: No such file or directory

And it goes on like this through the whole VCF.list.

I have sampled the files individually and the program works fine when I supply it with the baseline file and one of the call vcf files,so the input call vcf files are okay.

Are you able to see and point out what I could be missing here? I'll appreciate it much.

Some additional information for you:- Since the program compares the VCF file to the baseline file and outputs 4 files(0000.vcf,0001.vcf,0002.vcf and 0003.vcf),my idea was to have it compare all the call VCF files singly and ; either give me these 4 output files for every input call VCF file evaluated(without re-writting them again and again as it loops to the next input call VCF file) , or compare all of them singly but give me 1 combined output such that I get the four output VCF files containing summary info of the intersection ,complement and union of all the call files and the baseline file.

Kindly let me know.Thanks again!

ADD REPLY • link updated 6.4 years ago by Kevin Blighe 87k • written 6.4 years ago by drowl1 ▴ 30

0

Entering edit mode

Hi, just to get the first part done first, I'm not sure what your files are. An example if your filenames is:

./P7354_1382_S382_pcsk9T1_Aligned.out.sorted.bam.vcf.gz.withIDS.gz

Is this a gzipped VCF? - is it gzipped twice?

In the script that I provided, it assumes that your files just have the .vcf extension. Note the line:

resultsfile=$(echo “${VCF}” | sed ’s/.vcf$//g') ;

This eliminates the .vcf extension at the end of each file, and the result then becomes the output directory. If we have ./P7354_1382_S382_pcsk9T1_Aligned.out.sorted.bam.vcf, then the output directory will be ./P7354_1382_S382_pcsk9T1_Aligned.out.sorted.bam

So, based on what your file extensions are, and what you want your output directories to be, modify that line, i.e., resultsfile=$(echo “${VCF}” | sed ’s/.vcf$//g'), specifically the .vcf characters. For example, you could use:

resultsfile=$(echo “${VCF}” | sed ’s/_Aligned.out.sorted.bam.vcf.gz.withIDS.gz$//g')

Then your output directories would be:

./P7354_1382_S382_pcsk9T1

./P7354_1383_S383_pcsk9T1

./P7354_1384_S384_pcsk9T1

------------------------

On the second part, not sure exactly what you want to do. Take a look at bcftools merge and bcftool concat to see if either of those is what you want?

In that case, you'd just need to add a single line to my script (after bcftools isec), such as:

bcftools merge -Ov "${resultsfile}"/0000.vcf "${resultsfile}"/0001.vcf "${resultsfile}"/0002.vcf "${resultsfile}"/0003.vcf > "${resultsfile}"/MergedResult.vcf

Kevin

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

1

Entering edit mode

Thank you so much Kevin.I resolved!

ADD REPLY • link 6.4 years ago by drowl1 ▴ 30