Generate Feature-Enriched ΔΔG Data
This guide is intended to show users how to use DDGWizard to process raw ΔΔG data and output feature-enriched data. It can help user obtain more diverse feature information for their own ΔΔG dataset, facilitating further analysis, feature selection, and machine learning.
1. Prepare a Blast database
A Blast database is required for the program to run. The program will use the path to the Blast database to invoke it and perform sequence alignment.
To construct a Blast database, you first need to prepare a fasta file of the protein sequence database. The richness of the sequence database should be abundant. You can use your own fasta database file, but we recommend downloading it from Uniref Databases. Our program was tested using Uniref50 database. If you download Uniref50 database, you need to unzip it (taking Uniref50 as an example):$ gzip -d uniref50.fasta.gz
You can obtain a fasta file. Then use Blast suite to create a Blast database using obtained fasta file.
Downloaded blast+ 2.13.0 should be in the path DDGWizard/bin/ncbi_blast_2_13_0+/. Please use the command as follows:
$ cd </path/to/DDGWizard/>bin/ncbi_blast_2_13_0+/bin/
$ ./makeblastdb -in </path/to/fasta file> -dbtype prot -out </folder/to/save/Blast_database/><the_name_to_assign_for_Blast_database> -parse_seqids
This step will take some time, depending on the size of the database file and the performance of your computer system.
2. Running template
We first provide you with a running template of running DDGWizard's feature calculation pipeline, and then explain the specifics of each parameter in detail.
You can run the program with:
$ conda activate DDGWizard
$ cd </path/to/DDGWizard/>
$ python Generate_Dataset_Executable.py \
--raw_dataset_path src/Sample.csv \
--db_folder_path </folder/to/save/Blast_database/> \
--db_name <the_name_to_assign_for_Blast database> \
--if_reversed_data 1 \
--blast_process_num 4 \
--mode whole \
--process_num 4 \
--container_type -
3. Parameter details
Below are the details of the parameters for program to generate complete ΔΔG feature set: (1). raw_dataset_path This parameter indicates that you need to provide the path to a csv file, which contains the raw data you want to use to generate ΔΔG feature set. In the path DDGWizard/src, there is a sample file Sample.csv that you can use directly for testing and as a reference. We list some of the contents of this file here, and provide detailed descriptions of each column's attributes in the table file:PDB |
Amino Acid Substitution |
Chain ID |
ddG |
pH |
T |
|---|---|---|---|---|---|
1AAR |
K6E |
A |
0.53 |
5 |
25 |
1AAR |
K6Q |
A |
0.26 |
5 |
25 |
1AAR |
H68E |
A |
0.77 |
5 |
25 |
… |
… |
… |
… |
… |
… |
Description of attributes for each column in the table file:
a. PDB: This attribute requires to provide a PDB identifier sourced from the RCSB database. Using the PDB identifier program can automatically download the PDB file.
b. Amino Acid Substitution: It consists of one-letter code of the wild-type amino acid, the sequential number of the mutation site, and the code of the mutant amino acid, for describing substitution of amino acids caused by the mutation. For example, K6Q represents a substitution where lysine at the 6th position of protein sequence is substituted with glutamine.
c. Chain ID: Indicate the protein chain where the mutation site is located.
d. ddG: Require to provide the ΔΔG values of users' own raw dataset. For users with machine learning needs, this value can serve as the regression target. If users only require generating features, this attribute can be set to any numerical value without affecting the generation of other features.
e. pH: Specify at which pH the mutation occurs.
f. T: Specify at which temperature the mutation occurs.
(2). --db_folder_path
This parameter indicates the folder path of the Blast database that user have prepared.
(3). --db_name
This parameter indicates the name of the Blast database that user have prepared.
(4). --if_reversed_data
This parameter requires user to provide a value of 0 or 1. The value of 0 means only generating features for the direct mutation, while the value of 1 means also generating the features for the reverse mutations of the mutations provided.
(5). --blast_process_num
This parameter requires user to provide an integer greater than 0 and less than 200. It represents the number of processes (multiprocessing) DDGWizard will use for sequence alignment.
(6). --mode
Please provide the default value whole.
(7). --process_num
This parameter requires user to provide an integer greater than 0 and less than 200. It represents the number of processes (multiprocessing) DDGWizard will use for generating features.
(8). --container_type
This parameter requires user to provide a value of D or S or - (default). The value of D means using Docker as container system, the value of S means using Singularity as container system, and the value of - means skipping running PROFbval.
4. Output
There will be an output csv file features_table.csv located in DDGWizard/src/Feature_Res/, which will record complete generated features.5. Notes
(1). When running DDGWizard, you need to cd to the top-level directory of the program to execute the program.
(2). DDGWizard supports multi-process handling itself. If you wish to run multiple instances of DDGWizard to fully utilize your computer's resources, we recommend using the multi-process parameters provided by DDGWizard.
We don't recommend to achieve multi-process handling of DDGWizard by user themselves.
If user need to run multiple instances of DDGWizard at the same time by themselves, please avoid running multiple instances of DDGWizard from the same folder, as the program synchronizes files within the folder, which can cause synchronization errors. Please make multiple copies of the DDGWizard folder and run each instance separately in its own folder.
(3). Do not place your files in the top-level folder of DDGWizard. DDGWizard will automatically clean files in the top-level folder to maintain multi-process synchronization.
(4). The complete log file is saved at the path DDGWizard/src/log.txt.