Question: Bash to match and store numerical prefix to update files
0
gravatar for bioguy24
2.2 years ago by
bioguy24180
Chicago
bioguy24180 wrote:

In the bash below the unique headers of each vcf.gz are stored in a text file with the same name. There can be multiple vcf.gz in a directory, usually 3 files, that I need to fix the header in each file before further processing it. My question is how can I match each text file (using only the numerical prefix) with its vcf.gz and pass the stored variables of each to the reheader code (# edit the header) ? Thank you :). I apologize for the long post just trying to include all the details.

The steps for the bash are below:
 1. vcf.gz created (#gunzip files) - works as expected
 2. unique undefined annotations for each vcf.gz - creates the _header.txt file but doesn't strip off the prefix
 3. match and store vcf.gz as $file1 and .txt as $file2
 4. run the bcftools reheader
 5. repeat for other files

There are always 3 vcf.gz files in the following format below in the directory:

 Files in /home/cmccabe/Desktop/NGS/test
 16-0000_File-A_variant_strandbias_readcount.vcf.gz
 16-0002_File-A_variant_strandbias_readcount.vcf.gz
 16-0005_File-A_variant_strandbias_readcount.vcf.gz

The numeric prefix of each is always 7 characters xx-xxxx and unique. The portion of code in the portion of code below up until the # match files uses each vcf.gz file to create a corresponding .txt file with only the unique numeric prefix followed by _header.

Files in /home/cmccabe/Desktop/NGS/test (if code works as expected)

16-0000_File-A_variant_strandbias_readcount.vcf.gz
16-0002_File-B_variant_strandbias_readcount.vcf.gz
16-0005_File-A_variant_strandbias_readcount.vcf.gz
16-0000_header.txt
16-0002_header.txt
16-0005_header.txt

I am trying to match the 16-0000.vcf.gz with the 16-0000.txt and read 16-0000.vcf.gz into $file1 and 16-0000.txt into $file2.

These two variables $file1 and $file2 would be passed to the reheader command (# edit the header), below which is part of a loop.

After the $file1 and $file2 are processed for 16-0000, the process is repeated for the remaining two files, 16-0002 and 16-0005.

There is only one directory that contains the files and that is /home/cmccabe/Desktop/NGS/test. The numerical prefixes are not being stripped off and used to perform the match between files nor are the variables being passed to reheader. I hope this is a start, but there probably is a better way. Thank you :).

    #!/bin/bash

# gunzip files
logfile=/home/cmccabe/Desktop/NGS/test/process.log
for f in /home/cmccabe/Desktop/NGS/test/*.vcf ; do
 echo "Start vcf.gz creation: $(date) - file: $f"
 bname=`basename $f`
 gzip $f
 echo "End vcf.gz creation: $(date) - file: $f"
done >> "$logfile"

# find undefined annotations
logfile=/home/cmccabe/Desktop/NGS/test/process.log
for f in /home/cmccabe/Desktop/NGS/test/*.vcf.gz ; do
 echo "Start vcf missing header creation: $(date) - file: $f"
 bname=`basename $f`
 pref=${bname%%.vcf.gz}
 bcftools view -h $f > /home/cmccabe/Desktop/NGS/test/${pref}_header.txt
 echo "End missing header creation: $(date) - file: $f"
done >> "$logfile"

# match files
IAm=${0##*/}

InDir1='/home/cmccabe/Desktop/NGS/test'
InDir2='/home/cmccabe/Desktop/NGS/test'
OutDir='/home/cmccabe/Desktop/NGS/test'

cd "$InDir1"
for file1 in *.txt
do  # Grab file prefix.
p=${file1%%_*}

# Find matching file2.
file2=$(printf '%s' "$InDir2/$p"_*.vcf)
if [ ! -f "$file2" ]
then    printf '%s: No single file matching %s found.\n' "$IAm" \
        "$file1" >&2
    continue
fi

# store matches
out=${file1##*/} && ${file2##*/}
done

# edit the header
logfile=/home/cmccabe/Desktop/NGS/test/process.log
for f in /home/cmccabe/Desktop/NGS/test/*.vcf.gz ; do
 echo "Start vcf edit header creation: $(date) - file: $f"
 bname=`basename $f`
 pref=${bname%%.vcf.gz}
  bcftools reheader -h $file1 $file2 > ${pref}_fixed.vcf.gz
 echo "End edit header creation: $(date) - file: $f"
done >> "$logfile"
bash ngs • 847 views
ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by bioguy24180

Hi cmccabe! Were you able to solve this yourself? If not, what is your question exactly? How we can help you?

ADD REPLYlink written 2.1 years ago by Petr Ponomarenko2.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1851 users visited in the last hour