Variant Annotation
To upload a vcf file, first it must be annotated using a custom perl pipeline.
Docker
The following documentation is specific to the Docker version of EVAdb. When using the bare-metal installation your mileage may vary, the general steps remain the same however.
VEP
It is intended to switch from the custom annotation pipeline to VEP. This documentation will be update accordingly.
Annotation is performed at the same time as the actual data upload. For this,
we use a set of intermittent annotation and insert steps. Each step performs
annotation for things like genes and transcripts based on data from the
annotation tables present in the database. For ease of use, the annotation
container uses the main script externalPipelineImport.pl
as entrypoint
interface. To start the import process, the script needs access to a vcf file,
the sample id of the sample in the database and the settings name
(hg19_plus
).
References with/without chr tags
Depending on the Reference that you use for variant calling and alignment,
the contigs will be either have the chr
prefix (e.g. chr1
), or not
(e.g. 1
). EVAdb uses the UCSC versions of all contigs, so the prefix
must be present.
Data Paths
The paths to your data most likely differ between the container and the
outside world. Make sure to adjust the paths for your mountpoint (
the DATA_DIR
configuration variable). Data is mounted to /data
inside the container.
To import a single or multi-sample vcf file, the following command line is sufficient.
docker-compose run annotation -vcf <VCF_FILE> \
-sample "<SAMPLE>" \
-se hg19_plus
Inside the container
If you want to open a shell inside the annotation container first, the command can be run as:
docker-compose run annotation bash # or the corresponding docker exec
perl /pipeline/externalPipelineImport.pl -vcf <VCF_FILE> \
-sample "<SAMPLE>" \
-se hg19_plus
The externalPipelineImport.pl
draws most of its runtime information off the
current.config.xml
file. If you are running via docker, all necessary
configuration parameters are set by the src/make_annotation.sh
script.
If you do run on bare-metal make sure to set the paths in this file such
that tool directories and data locations are what is required by the tool.
Data Locality
With the current iteration of EVAdb and the Docker images all data must be local to the EVAdb server. As noticed in the configuration part, all data directories are entered as docker volume and put through to the container. As such, the data paths differ between container and host. Nevertheless, it is not currently possible to upload samples or data from remote hosts to the machine in excess of what is possible through the use of the web interface.
We recommend a data partition or disk with enough storage for your NGS
experiments to host this data. This brings two advantages, first you spare
the database disk from additional stress (it will be under heavy load on sample
import) and additionally you can use cheaper mass storage media to host the
bulk (e.g. .bam
or .fq.gz
) of your data.
Post-Import
After variants have been imported into the database they can be queried using the standard search tools (e.g. autosomal dominant, autosomal recessive). However, a big part of the filtering capability is derived from in-house frequencies built into the database. Since the system does not compute these frequencies after every import, this has to be done manually.
To update variant frequencies in the database, i.e. count the number of occurences of a snp per disease group, use the following snippet.
# Execute from project root
# Drop to the annotation container
docker-compose run annotation bash
# Perform counting of snv and inserting into the proper tables
/pipeline/doAfterImport.pl -se hg19_plus -s 40
Cron Job
Since this job has to be run regularly, it is also a good fit to run as a
cron job. You can add it as script via crontab -e
to run once every week,
or more often.
Post Import init-Container
In principle, the process above can also be started from the
init-Container. If you intend to do so, make sure to modify the
config.xml
file accordingly.
Scripts
importVCF.sh
To upload many samples in quick succession, we use the following script.
Breakage - Adjust before use
This script is provided as an example for a complete import process. As it is written with the specific configuration of our site in mind it will most likely not work without changes on other systems.
#! /bin/bash
# vim: ts=4 sw=4 expandtab
EVADB_DIR="/home/evadb"
INPUT_DIR="/gluster/gluster01/share/exome"
function usage {
echo -e "importVCF.sh\n\nImport all VCFs of a given Flow Cell into the EVAdb running in $EVADB_DIR as\ndocker version. Files are searched for in $INPUT_DIR.\n\nParameters:\n\n\t-f\t\tFlow Cell to import.\n\t-h\t\tPrint help."
}
PW_FILE="/home/evadb/.env"
ROOT_PW=$(grep MYSQL_ROOT_PASSWORD $PW_FILE | cut -d"=" -f2)
while getopts "f:h" arg; do
case $arg in
f)
FLOW_CELL="${OPTARG}"
;;
h)
usage
exit 0
;;
esac
done
if [ -z "$FLOW_CELL" ]; then
echo -e "Error: Please supply a flow cell id."
usage
exit 1
fi
echo -e "Searching $INPUT_DIR/$FLOW_CELL/"
ESC_INPUT=$(echo "${INPUT_DIR}" | sed -e 's/[](&|$|\|{|}).*[\^]/\\&/g')
cd $EVADB_DIR
for f in $(find "$INPUT_DIR/$FLOW_CELL/" -name "DE*vcf.gz")
do
BASENAME=$(basename $f)
DIRNAME=$(dirname $f)
SAMPLE=${BASENAME%%.*}
VCF="$DIRNAME/$SAMPLE.chr.vcf"
QUERY="select * from sample where name LIKE '%$SAMPLE%' or pedigree LIKE '%$SAMPLE%' or foreignid LIKE '%$SAMPLE%';"
SAMPLE_IN_DB=$(docker-compose exec db mysql -u root -p$ROOT_PW -e "$QUERY" exomehg19 | grep "$SAMPLE")
if [[ -n "$SAMPLE_IN_DB" ]];
then
echo "Annotating and Importing $SAMPLE.."
echo -e "\tAdding chr prefix..."
zcat $f | awk '{ if ( $1 ~ "#" ) { print $0 } else { print "chr"$0 } }' > $VCF
echo -e "\tImporting $SAMPLE into evadb..."
docker-compose run annotation -vcf "${VCF/$ESC_INPUT/\/data}" -sample "$SAMPLE" -se hg19_plus
echo -e "\tCleanup..."
rm $VCF
else
echo -e "Could not find $SAMPLE in database. Have you create the sample using \"Import external samples\"?"
fi
done
cd -