A simple program to analysis protein-compound complex rapidfire data at Vicinitas using either UniDec or openMS FlashDeconv.
Currently, the program is in development stage and runs as a pipeline. Running mainly off of the main.py function. The program takes in a folder or either raw MS data or mzML files and runs them through the pipeline. If the data is raw MS data then a conversion docker is called via a REST API, more details below.
- The program takes in a folder of data
- Uploads a meta-data file that contains protein masses, compound masses, file identifications and other information. Note that if IC50 values are needed a concentration values are needed in the meta-data file.
- Either UniDec or FlashDeconv is called to process the data.
- If UniDec is called then the program will run the data through the python API for UniDec.
- If FlashDeconv is called then the program will run the data through a CLI call.
- Results from the process are then uploaded to a database.
- Compound complex modifications are then calculated and matched per each well.
- Within each well a percentage intensity is calculated for each protein-compound complex.
- These matches are uploaded separately to the database.
- Using each protein-compound modification number ie Mod0 or Mod1, the IC50 values are calculated and uploaded to the database.
- Plots of each curve are generated and printed to ... png files.
- TODO create a web UI to display the results.
Both installation via pip and poetry are supported. The program is designed to run in a docker container. The layout of the folders allows a package to be built. To build the package run the following command:
which poetry || pip install poetry
poetry buildThis will create a whl file that can be installed via pip.
pip install dist/pyRapidFire-VERSION_NUMBER-py3-none-any.whlAdditionally, a full docker_compose file is provided to run the program. The docker_compose file will run the program and a database and the needed converter API functions.
-
main.py - the main file that runs the program. Has a pipeline function that calls most of the other functions.
-
database.py - contains the database class that is used to upload data to the database.
-
protein_deconvolution.py - contains the functions that are used to process the data. It has two classes
protein_wellandprotein_decon_unidecclass. Theprotein_decon_unidecclass is used to aggragite the wells by a single compound/VCNT-ID. Theprotein_wellclass is used to store the data for each well. Within this class is also the matching functionsimple_matchthat is used to match the protein.- When unidec is used the method needs to know the estimated mass of the protein and the range of masses to search. Additionally, it's helpful for it to know the charge state of the protein.
- FlashDeconv does not need to know the estimated mass of the protein or the range of masses to search and has an improved resolution/mass accuracy.
-
helper.py - contains the helper functions that are used to process the data. Mainly, functions to find the files, and a function to help fit the IC50 curves.
-
analysis.py - contains the functions that are used to analyze the data. Mainly, running functions to process the calculation the IC50 values. The IC50 values are processed in the
IC50_Curvesclass.
The system is designed with a database in mind. The database is used to store the data and the results of the analysis. Most of the methods and functions are designed with the database in mind. Additionally, there is a custom logger that logs to both a file and the database. If the logger object is not passed to the database class then a default logger is made. The caveat here is that the logger needs a database connection. == This means that enviroment variables are needed ==. These are :
DB_USER- the username for the databaseDB_PASS- the password for the databaseDB_HOST- the host for the databaseDB_NAME- the name of the databaseDB_CERT_PATH- the path to the certificate for the databaseDB_CERT_NAME- the name of the certificate fileDATA_PATH- the path to the data folder for data to be processed from
Due to the logger and the need for the database connection, the modules need to be loaded in a specific order.
If you are creating a new run script/program you will need to load the dotenv module prior to load ing the
pyrapidfire.RapidfireDB and logging_db modules. This is because the database connection is needed for the logger.
An example would be as follows:
import os
from dotenv import load_dotenv
from pyrapidfire import database
from pyrapidfire import logging_db
load_dotenv()
logger = logging_db.get_logger()
logger.name = "pyRapidFire" # Set the name of the logger can also be __name__
obj = database.RapidFireDB(sqlalchemy=True, direct_connect=True , logger=logger)
obj.get_experiments()The logger works by creating a custom logger that logs to both a file and the database. The logger is created by the
logging_db.get_logger() function. This function returns a logger object that can be used to log messages. The logger
object has a custom handler that logs to the database. Additional handlers can be added to the logger object to log to
the console or another file by using the logger.addHandler() function. The logger object has a custom attribute
expid that can be set to the experiment id. This is used to log the experiment id to the database. The logger object
also has a custom attribute name that can be set to the name of the logger. This is used to log the name of the logger
to the database.
import os
from dotenv import load_dotenv
from pyrapidfire import database
from pyrapidfire import logging_db
load_dotenv()
logger = logging_db.get_logger()
logger.handlers[0].db.expid = 1 # Set the experiment id for the logger- Add a docker container for running either unidec or flashdeconv
change to a class based processing system removing main.py from the runAdd a logger to the programAdd a web UI for the programMove code around into better files and folder structure.