MOL ECULAR PR OPERTIES O PTIMIZATION PROJECT |
Discrete-Continuos
QSAR Methodology on the Basis of
Physico-Chemical descriptors.
The principal steps of QSAR modelling are usually the following. First, a set of descriptors that adequately characterizes the properties of a set of compounds with known activity (the so called "training set") is estimated. Second, a correlation between the selected descriptors and the property under consideration is developed using statistical methods. LSER (Least Squares Error Reduction ), a Multiple Regression Analysis ( MRA) technique is the common basis of QSAR. This approach describes the dependent vector (in our case, property or biological activity)
Y= {yi}, i=1,…. M as a function,
Yi = a0 + ∑ ajxij + ei (1)
of a number of independent variables X = {xij}, i=1,…M; j=1,….N of the training set ( here, M is number of chemical compounds that are described by N structural descriptors; ei is residuals).
In such models, it is tacitly assumed that all compounds in the training set have the same mechanism of action and the same biological target. In actual situations, the diversity and complexity in a chemical structures may not allow a complete characterization of the compounds by physico-chemical descriptors.
Few approaches have been used to overcome this problem. For example, the use of indicator variables permits QSAR models to operate on sets of non-homogenous compounds. However, MRA technique is limited by the a priori requirement that all structural descriptors be independent from each other, error free, and relevant to the problem, and that all compounds belong to the same group (or cluster). The latter circumstance is important, because while MRA can lead to fairly good explanations of intraclass structure, it is unable to recognize the existance of clusters of compounds.
The other approach is SIMCA/PLS method [1,2] which is based on the philosophy of applying disjoint principal component (PC) models to each set of homogenous compounds.
In 1984 Raevsky at al. proposed QSAR Discriminant-Regression Model (DIREM) [3] which resembles the SIMCA/PLS method but differ from it in some important aspects:
The key idea of DIREM is, while forming clusters, to take into account information about the distribution of compounds with respect to property (biological activity). Thus, DIREM generates homogeneous clusters with respect to similarity in (i) chemical structure, and (ii) property (biological activity).
The application of the new multidimensional scaling methodology makes it possible to use a priori expert knowledge at the step of cluster fformation according to the similarity of property (biological action) among the compounds.
Attention is focused on easily interpretable QSAR models, which are based directly on the original explanatory variables. Thus, in the initial stages of the drug design problems, the stepwise procedure of Linear Discriminant Analysis (LDA) and MRA are used.
QSAR DIREM can be used as a platform for a knowledge based design.
During the 90 ths DIREM was used by authors for creation of stable predictable QSAR models of different properties and activities .
The original combination of Similarity and QSAR for creating stable, predictive models of properties (activity) was recently proposed by Raevsky [4]. In this work four approaches were considered for logP calculation of drugs containing few chemical functional groups:
The application of regression equation based on physicochemical descriptors (polarizability and H-bond acceptor factor),
The logP value of the nearest neighbor in a large training set was used as the calculated value for the compound-of-interest ,
The mean logP value of three nearest neighbors was used as the calculated value for compound-of-interest,
logP values of the nearest neighbors were used only in the first step.
In addition, the contribution to lipophilicity arising from differences in
polarizabilities and H-bond acceptor factors between the
compound-of-interest and its nearest neighbors were also taken into
consideration. In this case the eq. (2) was used employing coefficient
values from direct regression equation based on polarizability (a)
and H-bond acceptor factor (Ca):
logPi = S [((logPj + 0.267(ai - a j) - 1.00 (∑Cai -∑Caj)]/N (5)
where index i indicates the compound-of-interest, index j indicates a near neighbor;
and N is the number of closely related structures used.
Later this approach was applied to construct stable predictable models of lipophilicity, solubility in water, intestinal absorption in human [5-7].
References:
Clementi, S., in Jolles,G. And Wooldridge, K.R.H. (Eds), Drug Design: Fact or Fantasy?, Academic Press, London, 1984, pp. 73-94.
Wold, S., Dunn, W.J. and Hellberg, S., ibid, pp. 95-117.
Raevsky, O.A., Sapegin, A., Zefirov, N.N., The QSAR Discriminant-Regression Model, Quant. Str.-Act.Relat., 13, 412-418(1994).
Raevsky, O.A., Molecular Lipophilicity Calculations of Chemically Heterogeneous Chemicals and Drugs on the Basis of Structural Similarity and Physicochemical Parameters, SAR and QSAR in Environ. Res., 2001, 12, 367-381 (2001).
Raevsky, O.A., Trepalin, S.V., Trepalina, E.P., Gerasimenko, V.A. and Raevskaja, O.E., SLIPPER-2001 – Software for Predicting Molecular Properties on the Basis of Physicochemical Descriptors and Structural Similarity, J.Chem.Inf.Comput.Sci., 42, 540-549 (2002).
Raevsky, O.A., Schaper, K.-J., Artursson, P., and McFarland, J.W., A New Approach for Quant. Struct.-Act.Relat., 21, 402-410 (2002).