**INFORMATION ABOUT PROJECT,
SUPPORTED BY RUSSIAN SCIENCE FOUNDATION**

*The information is prepared on the basis of data from the information-analytical system RSF, informative part is represented in the author's edition. All rights belong to the authors, the use or reprinting of materials is permitted only with the prior consent of the authors. *

COMMON PART

Project Number18-11-00078

Project titleDevelopment of new machine learning models based on compositions of deep forests and neural networks for solving the medical diagnostics problems

Project LeadUtkin Lev

AffiliationPeter the Great St.Petersburg Polytechnic University,

Implementation period | 2018 - 2020 |

Research area 01 - MATHEMATICS, INFORMATICS, AND SYSTEM SCIENCES, 01-202 - Intellectual data analysis and image recognition

Keywordsmachine learning, classification, decision strategy, neural networks, deep forests, pattern recognition, oncology

PROJECT CONTENT

Annotation

The project is aimed to develop new classes of machine learning models and algorithms based on deep forests which can be regarded as an alternative to deep neural networks and as an supplement to the networks in applied problems where the size of training sets does not allow us to use neural networks. These problems include the diagnostics of oncological diseases from the analysis of X-ray scanning, ultrasound examination, computed tomography and other forms of diagnostic investigation of patients. New classes of models are controlled deep forests and their compositions with neural networks. Deep forests have a cascade structure such that each its level contains a number of random forests combined by using a stacking algorithm. The main idea underlying the controlled deep forests is the introduction of tree weights as training parameters and the optimization of an objective loss function corresponding to the machine learning problem to be solved. The idea of optimal tuning of controlled deep forests is to reduce the amount and the space of weights by assigning them not to each tree, but to subsets of class probability distributions of decision trees close to each other within a certain grid that separates the unit simplex of probabilities into a set of small simplexes whose size can be a tuning parameter. The reduction of the weight space is carried out on the basis of imprecise statistical models (the contamination model, the imprecise Dirichlet model, Kolmogorov-Smirnov bounds, etc.). The project proposes the deep forest modifications to address the domain adaptation or knowledge transfer problems when there are many source (large) domain data and a target domain data that have to be classified. The project proposes new algorithms for implementing robust distance metric learning models based on deep forests for various training samples (with known class labels, with only comparative information available).
As a generalization of controlled deep forests, a deep neural forest is proposed where the processing of the class probability distributions at the output of trees by means of the weights is replaced by small neural networks. The deep neural forests provide a completely new type of machine learning models and allow using both the advantages of deep forests and neural networks.
The anomalous behavior detection of objects based on deep forests is studied in the project. The main idea is to use the Siamese deep forest instead of the Siamese neural network.
The combination of deep forests and scanning small neural networks is the basis for the implementation of a completely new type of autoencoders, including the “denoising” autoencoder, contractive autoencoder, etc., which are needed for the primary processing of various forms of the diagnostic investigation of patients to remove the natural “noise”.
The main application result of the developed models and algorithms is an intelligent system for processing diagnostic medical information, which is a composition of deep neural networks and deep neural forests. One of the ideas underlying such a composition is to replace the intermediate layers in a deep neural network of the ResNet type based on the ideas of stacking by cascades of the deep forest.
The relevance of the project results is determined by the fact that effective methods of machine learning today become one of the main elements of intellectualization of such areas as medical diagnostics.
The relevance of the project results is determined by the fact that efficient machine learning methods become one of the main components of the medical diagnostics intellectualization. The project is implemented in the Peter the Great Saint-Petersburg Polytechnic University in cooperation with the St. Petersburg Clinical Scientific-Practical Center of Specialized Types of Medical Care (Oncology).

Expected results

1. Development of a general approach to controlled modifications of the deep forest based on the introduction of additional training parameters, rules or training algorithms, which improve the efficiency of deep forests and the accuracy of classification, as well as solve specific tasks of machine learning. A main peculiarity of the modifications is the control of the class vector structure at the random forest outputs with the use of new parameters whose calculation is performed by solving additional optimization problems or by training small (shallow) neural networks.
2. Development and research of new classification models based on deep forest modifications and reduction of the training weight set and of the weight number for improving the accuracy of the classification problem solving. The first basic idea underlying the reduction is to assign weights not to decision trees, but to subsets of “close” class probability distributions on the unit simplex by means of splitting it into a number of small simplexes. The number of small simplexes is a tuning parameter. The second basic idea is to reduce the unit simplex of weights, whose dimension is equal to the number of decision trees or small simplexes, by means of imprecise statistical models (the imprecise contamination model, the imprecise Dirichlet model, the imprecise pari-mutual model, Kolmogorov-Smirnov bounds). Selection of the optimal model and its parameters for medical applications. The reduction of the unit simplex of weights can be regarded as a “fine” tuning of the model.
3. Development and research of new transfer learning models based on deep forest modifications by using a common feature representation for the two domains of data by means of class vectors at each level of the forest cascade. The use of the deep forest cascade structure peculiarity for the efficient implementation of self-labeling models by iteratively updating target data labels due to several levels of the data processing. A generalization of the obtained models on the case of the multi-view source data and multi-task classification problems.
4. Development of new algorithms for implementing robust distance metric learning on the basis of deep forests for different kinds of training samples (with known class labels, with only comparative information). The development of efficient alternatives of Siamese neural networks and triple neural networks.
5. Development of robust models for cases when the training set is small and the class probability distribution at the output of the decision trees cannot be determined with sufficient accuracy that leads to a bias of the class vectors at outputs of the random forests. The basic idea of the robust models is to use imprecise statistical models (the imprecise contamination model, the imprecise Dirichlet model, the imprecise pari-mutual model, Kolmogorov-Smirnov bounds), but not for “fine-tuning” the set of weights, but for determining the sets of probability distributions and for using a robust decision strategy of selecting optimal distributions within the framework of random forests. This leads to minimax optimization problems over weights and class probability distributions, whose solution is also one of the tasks of the project.
6. Development of the denoising autoencoder, the contractive autoencoder, the split-brain autoencoder based on a combination of random forests and scanning small neural networks that provide a significant reduction of training sets. The main idea underlying new autoencoders is that the first stage of data processing in the encoder and the last stage of processing in the decoder are carried out by means of the random forest. All other stages use the scanning neural networks. This combination significantly reduces the number of learning parameters and reduces the risk of overfitting by a small training set. The preprocessing of X-ray images, ultrasound imaging and other forms of the diagnostic investigation of patients to remove natural “noise” can be carried out by means of the autoencoders. Training of the proposed autoencoders is based on existing non-intelligent image processing algorithms for isolating the required elements from the “noise”.
7. Development of the deep neural forest that contains neural networks used in place of modules for training the decision tree weights. The neural network has its own weights that are independent of trees and they are training parameters. This allows us to get rid of the linearity of the weighted averages and to use the class probability distribution functions at the output of each decision tree implemented by the neural network.
8. Development and investigation of a new forest-based Siamese autoencoder preserving the approximate data structure (distances between data examples) for detecting the anomalous behavior during the patient monitoring. The main peculiarity of the proposed autoencoders is that the distance between pairs of object is preserved in spite of the data transformation. The objective function for training minimizes the difference of Euclidean distances between pairs of objects at the input and the output of the network. The structure is needed to provide the on-line patient monitoring using the Mahalanobis distance.
9. Development of compositions of deep neural networks and deep neural forests to improve the efficiency of machine learning tasks. The implementation of sequential processing of data with respect to the neural network layers and forest cascade levels. The main idea is to replace the intermediate layers in a network such as ResNet, based also on the ideas of stacking, by cascades of deep forests. The search for optimal structures for solving the problems of processing the various forms of patient diagnostic investigation.
10. The development of software implementing new algorithms. The study of the algorithm performance using real medical data. Defining sets of optimal parameters of imprecise statistical models for various algorithms, for example, the contamination parameter in robust models. Comparing the proposed algorithms with the well-known standard machine learning methods. The demonstration of the efficiency and applicability of new models to problems of processing various forms of the patient diagnostic investigation.
11. The development of software implementing new algorithms. The study of the algorithm performance using real medical data. Defining sets of optimal parameters of imprecise statistical models for various algorithms. The demonstration of the efficiency and applicability of new models to problems of processing various forms of the patient diagnostic investigation.
12. Development of methods for parallelizing the models of deep forests. Extension of the Python parallelization library for supercomputer implementation of deep forests. Development of software for visualization of computed tomography imaging in DICOM format by outlining pathological formations in the lungs. Preparing images for solving the segmentation and classification problems. Development of image segmentation algorithms for segmenting the required objects on the computed tomography images.
13. Development of an artificial intelligence system for the diagnosis of cancer patients by using new algorithms of the deep learning.
The results cover many elements and problems of the machine learning. Their practical implementation allows us to develop new efficient approaches to solving the deep learning problems. From an applied point of view, the implementation of new approaches and models, as well as the artificial intelligence system for the diagnosis of cancer patients will increase the effectiveness of the diagnostic investigation and will make it more independent of the doctor's professionalism.
The given list of tasks is rather wide. However, even solution of a part of them may lead to outstanding results which can be represented for submission for journals indexed by Web of Science and Scopus.

REPORTS

Annotation of the results obtained in 2020

1. New architectures of the segmentation and classification subsystems have been developed, which make it possible to reduce the number of false-positive cases and increase the accuracy of differential diagnostics through the implementation of more complex compositions of Siamese neural networks. As a new segmentation architecture, it is proposed to use a combination of 3D detection and segmentation inside bounding boxes, which can significantly improve the segmentation accuracy. As a new classification architecture, it is proposed to use a system consisting of three parallel classification channels with the next calculation of accuracy measures of the entire algorithm. The second channel uses an ensemble of 80 triplet neural networks such that every network can be viewed as a generalization of the Siamese network. A special ensemble training procedure has been developed, which allows us to construct triplets under condition of a significant imbalance in the training data.
2. A quite new approach to the interpretation and explanation of diagnostic results based with natural language is proposed. The main ideas behind the approach are: 1) the natural language explanation is presented in the form of a hierarchy of primitives and phrases that describe the form, structure, inclusions, contours and other peculiarities of nodules; 2) construction and training of simple classifiers that classify suspicious objects or their low-dimensional representation into classes corresponding to primitives such that each classifier corresponds to a peculiarity of the nodule; 3) implementation of the whole explanation algorithm in the form of two parts: the first part is an explanation model (LIME or SHAP) for selecting meaningful features from the object or its representation, and the second part, the key one, is a set of classifiers whose purpose is to combine the selected meaningful features with sentences in natural language.
3. Two new modifications of the adaptive deep forest are proposed. The first one is the transfer learning which aims to solve the classification problem. A solution to the problem of unsupervised domain adaptation is proposed in the project when representation of the initial vectors at each level of the deep forest cascade with adaptive weights is used instead of transforming the feature space to achieve a minimum measure of the domain proximity. The second modification is based on the imprecise representation of weights of training examples using imprecise statistical models, for example, the imprecise Dirichlet model or the contamination model, which leads to many weight distributions as a part of the unit simplex and many random forests such that one forest is selected for testing, which gives the maximum training classification error. The imprecise model parameter allows to adaptively control the robustness of deep forest cascades.
4. New models have been developed for evaluating the heterogeneous treatment effect in small patient groups. The main idea is to use analogs of the local interpretation models and linearization of the “treatment group” model in a local area of the analyzed patient. The use of the distance metric between data, defined as the average distance over the entire forest between the leaves of decision trees, in which these examples fall, is proposed. The average distance is determined over the entire random forest. It is proposed to perform local linearization of the response function estimate based on several close observations on the basis of a similarity matrix constructed for a random forest. The LASSO model is used to mitigate the effect of data redundancy.
5. The concept of functioning and transformation of the radiology departments in oncological medical centers in the context of the use of intelligent systems for diagnosing diseases has been developed. It has been shown that the activity of a radiologist is transformed into a cyclic process, which implies, in addition to analyzing and interpreting images, monitoring the verification of pathology, assigning class labels for machine learning, labeling the pathology and forming a database of the studied pathology. Procedures have been developed for introducing the practice of constantly updating the database with each new detected case when a doctor receives a prediction produced by the intelligent system and compares the result with his own interpretation of the pathology.
Most of the results are published in various journals, including journals indexed by Scopus and Web of Science.

Publications

**1.** *Meldo A.A., Utkin L.V., Trofimova T.N.* **Искусственный интеллект в медицине: современное состояние и основные направления развития интеллектуальной диагностики** Лучевая диагностика и терапия, 1(11), c. 9-17 (year - 2020) https://doi.org/10.22328/2079-5343-2020-11-1-9-17

**2.** *Utkin L.V. and Zhuk K.D.* **Improvement of the Deep Forest Classifier by a Set of Neural Networks** Informatica, 44, 1-13 (year - 2020) https://doi.org/10.31449/inf.v44i1.2740

**3.** *Utkin L.V., Konstantinov A.V., Chukanov V.S., Meldo A.A.* **A new adaptive weighted deep forest and its modifications** International Journal of Information Technology & Decision Making, Vol. 19, No. 04, pp. 963-986 (year - 2020) https://doi.org/10.1142/S0219622020500236

**4.** *Utkin L.V., Kots M.V., Chukanov V.S., Konstantinov A.V., Meldo A.A.* **Estimation of personalized heterogeneous treatment effects using concatenation and augmentation of feature vectors** International Journal on Artificial Intelligence Tools, Vol. 29, No. 05, Article 2050005, pp. 1-23 (year - 2020) https://doi.org/10.1142/S0218213020500050

**5.** *Utkin L.V., Kovalev M.S., Coolen F.* **Imprecise weighted extensions of random forests for classification and regression** Applied Soft Computing Journal, vol. 92, Article 106324, 2020, pp. 1-14 (year - 2020) https://doi.org/10.1016/j.asoc.2020.106324

**6.** *Utkin L.V., Kovalev M.S., Kasimov E.M.* **An explanation method for black-box machine learning survival models using the Chebyshev distance** Artificial Intelligence and Natural Language. AINL 2020, Communications in Computer and Information Science, Springer, Cham, vol. 1292, 2020 (year - 2020) https://doi.org/10.1007/978-3-030-59082-6_5

**7.** *Utkin L.V., Meldo A.A., Kovalev M.S., Kasimov E.M.* **A simple general algorithm for the diagnosis explanation of computer-aided diagnosis systems in terms of natural language primitives** 2020 XXIII International Conference on Soft Computing and Measurements (SCM), IEEE, pp. 202-205 (year - 2020) https://doi.org/10.1109/SCM50615.2020.9198764

**8.** *Utkin L.V., Meldo A.A., Kovalev M.S., Kasimov E.M.* **Простой общий алгоритм объяснения диагноза на выходе интеллектуальной системы диагностики в терминах примитивов естественного языка** XXIII Международная конференция по мягким вычислениям и измерениям (SCM-2020). Сборник докладов. СПб.: СПбГЭТУ «ЛЭТИ», с. 242-245 (year - 2020)

Annotation of the results obtained in 2018

A quite new approach was proposed for controlling deep forests as well as random forests, which allows us to achieve two main goals:
1) to develop a mechanism for the forest control in terms of its orientation to a solved machine learning problem, which will bring deep forests closer to the universality and flexibility of deep neural networks;
2) to improve the classification or regression accuracy of deep forests by introducing additional elements of control over the process of classification or regression.
The main idea of the approach proposed in the project is to build such a weighting function defined on the set of decision trees in each random forest in order to minimize the classification error or, more generally, to minimize some predefined loss function. In this case, outputs of decision trees are the class probability distribution.
Weights are trained using the same training set. In contrast to the original deep forest, which solves only the standard classification problem, the proposed approach allows solving various problems by defining a required loss function for outputs of decision trees. This brings deep forests closer to the generality and flexibility of deep neural networks without leading to the overfitting problem under a small amount of training data.
The first function of weights is the linear weighted averaging of the probability distributions of classes at the output of decision trees. A final optimization problem for computing optimal weights is quadratic with linear constraints. To effectively solve the problem, a modification of the Frank-Wolfe algorithm is proposed, which takes into account that the weights are restricted by a single simplex.
The proposed approach is implemented also for random survival forests which can be viewed as regression models. A weighted random survival forest is for the first time developed as a modification of the standard survival forest. According to the modification, the averaging of hazard functions at the output of each decision tree, which is used to calculate the random forest hazard function, is replaced by a weighted sum of these functions. The calculation of the optimal weights is also reduced to solving a quadratic optimization problem with linear constraints, which maximizes the concordance index or the C-index. For the well-known Primary Biliary Cirrhosis (PBC) Dataset, the C-index increment was almost 9%.
The project proposed for the first time new classification models based on deep forests with a reduced number of weights based on subsets of “closely” located class probability distributions defined on the unit simplex. Weights are assigned not to trees and not examples, but to subsets of the probability distributions of classes obtained at the output of each decision tree for each example. As a model for partitioning the unit simplex, Walley's imprecise pari-mutuel model was chosen. According to this model, the unit simplex is divided into many subsets. Weights of the subsets of the probability distributions can be viewed as second-order probabilities over the subsets of the simplex.
The idea of assigning the weights not to decision trees, not to examples of training data, but to subsets of the class probability distributions, which are simultaneously defined by the classification “ability” of trees and by how an example is typical for its class, is proposed for the first time.
A new classification models based on the reducing the weight set has been developed in the project. Moreover, an attempt to obtain new effective models led to an unexpected new scientific result. The reduction of the weight set is the weight smoothing and restriction, which is the basis of the regularization and a reason for using the regularization terms in the objective function. The use of the proposed modification of the Frank-Wolfe algorithm made it possible to construct a variety of the deep forest models using various restrictions on the unit simplex of weights. Several well-known imprecise statistical models were used including the imprecise contamination model, the imprecise Dirichlet model, the imprecise pari-mutual model, Kolmogorov-Smirnov bounds, the constant odds ratio model.
New robust metric distance models have been developed, which, on the one hand, use a general approach to the deep forest modification using trained weights, and, on the other hand, propose completely new ideas aimed at implementing a general approach to solve tasks within metric models on trees and random forests. The algorithms of Siamese and triple neural networks have been first implemented on the random and deep forest models.
The first model is to compare the object pairs and to change the relative location of objects in the feature space taking into account their belonging to the same classes. To ensure the convexity of the loss function, the idea was first proposed to combine the Euclidean distance and the Manhattan distance in a single loss function.
The second model is used when the class labels are unknown. It analyses semantically close and long distance objects. This is the first complete analog of Siamese neural networks implemented on random forests. The model proposes to control the proximity of objects using the weights of trees and to get a new training sample consisting of concatenated pairs of objects. The third model is an alternative to triple neural networks. The idea of its learning is quite similar.
A new algorithm for contouring pathological lung formations based on the multiplanar reconstruction of computed tomography images in the DICOM format has been developed and implemented. An idea underlying the proposed algorithm is to consider planar images of computed tomography images as a single three-dimensional object. The task is reduced to finding the object boundaries in three-dimensional space. For segmentation of lungs, a method based on the threshold inclusion algorithm has been developed. A new approach has been applied to the segmentation of objects in the lung CT images, which takes into account various types of lung nodule location. A frame-by-frame fill algorithm, a root contour algorithm using dilatation, and density filtering for the lung CT images have been also developed and implemented.

Publications

**1.** *Meldo A.A., Utkin L.V., Moiseenko V.M.* **Алгоритмы диагностики XXl века. Искусственный интеллект в распознавании рака лёгкого** Практическая онкология, Т.19. - №3. - С. 292 - 298 (year - 2018) https://doi.org/10.31917/1903292

**2.** *Moiseenko B.M., Meldo A.A., Utkin L.V., Prokhorov I.Y., Ryabinin M.A., Bogdanov A.A.* **Автоматизированная система обнаружения объемных образований в легких как этап развития искусственного интеллекта в диагностике рака легкого** Лучевая диагностика и терапия, №3 –С. 62-68 (year - 2018) https://doi.org/10.22328/2079-5343-2018-9-3-62-68

**3.** *Utkin L.V., Meldo A.A., Konstantinov A.V.* **Deep Forest as a framework for a new class of machine learning models** National Science Review, - (year - 2018) https://doi.org/10.1093/nsr/nwy151

**4.** *Utkin L.V., Ryabinin M.A., Meldo A.A.* **Интеллектуальная система выбора лечения на основе каскада случайных лесов в рамках анализа выживаемости** Труды Международной научной конференции «IEEE Northwest Russia Conference On Mathematical Methods In Engineering And Technology: ММEТ NW 2018», СПб.: СПбГЭТУ «ЛЭТИ», C. 534-537 (year - 2018)

**5.** *Utkin L.V., Ryabinin M.A., Meldo A.A.* **Случайные леса и метод хорд для интеллектуальной диагностики рака легких** XXI Международная конференция по мягким вычислениям и измерениям (SCM-2018), Т.2, - СПб.: СПбГЭТУ «ЛЭТИ», С. 11-14. (year - 2018)

**6.** *Utkin L.V., Ryabinin M.A., Zhuk K.D., Zhuk Y.A.* **Классификация на основе композиции случайных лесов и параллельных нейронных сетей** XXI Международная конференция по мягким вычислениям и измерениям (SCM-2018), Т.1, - СПб.: СПбГЭТУ «ЛЭТИ», С. 662-665. (year - 2018)

**7.** *Meldo A.A., Utkin L.V.* **Обзор методов машинного обучения в диагностике рака легкого** Искусственный интеллект и принятие решений, №3. – С. 28-38. (year - 2018) https://doi.org/10.14357/20718594180313

**8.** *Ipatov O.S., Utkin L.V., Meldo A.A.* **Интеллектуальные системы диагностики и выбора лечения онкологических заболеваний** Труды VII Международной научно-технической конференции «Информационные технологии в науке, образовании и производстве» (ИТНОП-2018), Белгород: Издательство ООО «ГиК», С. 245-247 (year - 2018)

**9.** *Meldo A.A., Utkin A.A., Prohorov I.Y., Ryabinin M.A., Bogdanov A.A., Lukashin A.A., Moiseenko V.M., Zhuk K.D.* **Эволюция искусственного интеллекта в диагностике рака легкого** Конгресс Российского общества рентгенологов и радиологов. Сборник тезисов, СПб. c. 102-103 (year - 2018)

**10.** *Meldo A.A., Utkin L.V.* **A computer-aided system for differential diagnosis of lung diseases** Intelligent Data Processing: Theory and Applications. Book of abstracts of the 12th International Conference (Moscow, Russia – Gaeta, Italy, 2018), Moscow: TORUS PRESS, 2018. – p. 35 (year - 2018) https://doi.org/10.30826/IDP201812

**11.** *Prohorov I.Y., Ryabinin M.A., Meldo A.A., Utkin L.V.* **Формирование баз данных с целью машинного обучения в диагностике рака легкого** Конгресс Российского общества рентгенологов и радиологов. Сборник тезисов, СПб. c. 124-125 (year - 2018)

**12.** *Utkin L.V., Ipatov O.S., Meldo A.A.* **Медицинские системы искусственного интеллекта на примере диагностики рака легкого** Материалы 5-й Всероссийской научно-технической конференции "Суперкомпьютерные технологии (СКТ-2018)", Дивноморское, Геленджик, Издательство Южного федерального университета, - Т.2, - С. 127-131 (year - 2018)

**13.** *Utkin L.V., Meldo A.A.* **A weighted random survival forest for constructing controllable models** Intelligent Data Processing: Theory and Applications. Book of abstracts of the 12th International Conference (Moscow, Russia – Gaeta, Italy, 2018), Moscow: TORUS PRESS, 2018. – p. 33. (year - 2018) https://doi.org/10.30826/IDP201811

**14.** *-* **Интеллектуальный способ диагностики и обнаружения новообразований в легких** -, 2668699 (year - )

Annotation of the results obtained in 2019

New modifications of the deep forest were proposed to implement the data discrimination, taking into account their classes in a feature space. The problem of implementing distance metric learning methods on the basis of the deep forests was solved. It was proposed to assign weights to decision trees in all random forests in order to reduce distances between pairs of examples from the same class and to increase distances between pairs of examples from different classes. A special contrastive loss function, including two different distance metrics, was introduced to obtain the standard quadratic optimization problem. Modifications of the deep forest are also proposed for implementing the transfer learning task, where a consensus measure based on the Shannon entropy and an average distance (mean discrepancy) between the source and target domains are used to calculate the tree weights.
New models of deep survival forests have been developed to solve problems of survival analysis. Within the framework of the models, it was proposed to change the averaging procedure used to estimate a forest survival function based on the survival function at the output of decision trees, to use the concordance index (C-index) as an objective measure for constructing the optimization problem, and to replace the C-index with its approximate representation which is based on the well-known hinge-loss function application.
A new approach was proposed for using sets of class probability distributions at the next level (layer) of the deep forest. Decision trees of the next level of the forest cascade are trained on the basis of an expanded training set, which is added by new generated class probability distributions from their sets. The increased number of training examples is compensated by updating the hyperparameters in accordance with a used imprecise statistical model, for example, the imprecise Dirichlet model or the Kolmogorov-Smirnov bounds.
In order to deal with sets of class probability distributions at the output of decision trees, and to ensure robustness, a meta-model is proposed that determines optimal weights of decision trees. As part of the approach, new loss functions for classification and regression problems were introduced, which made it possible to reduce complex minimax optimization problems for calculating optimal tree weights to quadratic optimization problems. This approach is also implemented for the survival analysis problem, where confidence intervals were used for Nelson-Aalen estimates.
A new approach is proposed that uses sets of distributions, which consists in assigning weights not to trees, but to subsets of class probability distributions in a special way, which makes the solution more flexible and reduces the number of weights as training parameters.
A new random and deep forest architectures have been proposed, using many neural networks to improve the classification accuracy. In fact, neural networks perform non-linear transformation of class probability distributions in such a way as to ensure maximum classification accuracy at the output of a random forest or deep forest. Neural networks can be regarded as an extension of Siamese neural networks, since they implement the same functions. The idea of using such an architecture opens up new possibilities for implementing a wide variety of machine learning tasks, including the transfer learning, anomaly detection, etc.
it was proposed to consider the task of differential diagnosis of oncological diseases, especially atypical cases of cancer, as a task of one-shot or few-shot learning because of a small amount of training examples for atypical cases. Siamese neural networks and Siamese deep forests were proposed to use as tools for implementing the one-shot or few-shot learning. To increase the accuracy of a diagnosis, a three-channel architecture of the classification system for lung nodules on the lung computed tomography scans was developed to make decision about a diagnosis of a patient. The first channel is a classifier based on deep forests. The other two channels are two Siamese neural networks (fully connected neural networks and convolutional neural networks).
A new modification of the deep forest, called Adaptive Weighted Deep Forest, is proposed. In accordance with this modification, each example is assigned a weight at the next level of the forest cascade, depending on how correctly it was classified at this level. A larger weight is assigned to “bad” examples so that classifiers at the next levels try to classify it correctly. Two strategies for using weights are considered: weighted examples are chosen randomly for training the decision trees in accordance with their weight distribution, and weights are used in the implementation of the splitting procedure when training the decision trees.
A three-channel lung segmentation system is proposed, where the first channel is implemented as a conventional image processing procedure, the second channel is a deep learning procedure using a 3D U-Net segmentation neural network, which is actually a duplication of the first channel, and the third channel is a deep learning procedure using a 2D U-Net segmentation neural network for special segmentation cases that are difficult to accomplish with the first two channels. This architecture achieves the main goal - to avoid cases of missed tumors.
In addition, new models for evaluating the heterogeneous treatment effect were proposed to implement the concept of personalized medicine based on random and deep forests for cases where the number of elements in the treatment group is small. An efficient meta-algorithm, called Co-learner, was proposed to evaluate the conditional average treatment effect, which is based on the concatenation of feature vectors from the control and treatment groups and the generation of additional concatenated vectors.
Most of the results are published in various journals, including journals indexed by Scopus and Web of Science.
Certificates of registration of 2 software programs, 1 database and 2 patents for invention were obtained.
The results of the project were widely covered in the press, where indicated that the project is being implemented with the support of the Russian Science Foundation, examples of articles on the project are:
https://tass.ru/obschestvo/5995816
https://minobrnauki.gov.ru/ru/press-center/card/?id_4=901
https://www.popmech.ru/science/news-458242-uchyonye-nashli-novyy-sposob-diagnostiki-opuholey/#part0
https://lenta.ru/news/2019/01/26/20_seconds/?utm_source=yxnews&utm_medium=desktop
https://www.technologynetworks.com/tn/news/ai-for-lung-cancer-diagnostics-314929
https://ecmiindmath.org/2019/03/20/an-intelligent-system-for-lung-cancer-diagnostics/

Publications

**1.** *Meldo A.A., Utkin L.V.* **Инновационная стратегия развития отделения лучевой диагностики** Медицина: целевые проекты, 34, c. 52-53. (year - 2019)

**2.** *Meldo A.A., Utkin L.V.* **Radiomics and the multidisciplinary approach in the development of CAD system in lung cancer diagnostics** Extreme Robotics, 1(1), pp. 504-510. (year - 2019)

**3.** *Meldo A.A., Utkin L.V., Ryabinin M.A.* **Комбинированная автоматизированная система сегментации и обнаружения новообразований для диагностики рака легкого** Робототехника и техническая кибернетика, 7(2), С. 145-153 (year - 2019) https://doi.org/10.31776/RTCJ.7209

**4.** *Meldo A.A., Utkin L.V., Trofimova T.N., Ryabinin M.A., Moiseenko V.M., Shelekhova K.V.* **Новые подходы к разработке алгоритмов искусственного интеллекта в диагностике рака легкого** Лучевая диагностика и терапия, 1 (10), с.8-18 (year - 2019) https://doi.org/10.22328/2079-5343-2019-10-1-8-18

**5.** *Utkin L.V.* **An Imprecise Deep Forest for Classification** Expert Systems with Applications, Vol. 141, Article 112978, – Pp. 1-11 (year - 2019) https://doi.org/10.1016/j.eswa.2019.112978

**6.** *Utkin L.V., Konstantinov A.V., Chukanov V.S., Kots M.V., Ryabinin M.A., Meldo A.A.* **A weighted random survival forest** Knowledge-Based Systems, Vol. 177, Pp. 136-144 (year - 2019) https://doi.org/10.1016/j.knosys.2019.04.015

**7.** *Utkin L.V., Kovalev M.S., Meldo A.A.* **A deep forest classifier with weights of class probability distribution subsets** Knowledge-Based Systems, Vol. 173, Pp. 15-27 (year - 2019) https://doi.org/10.1016/j.knosys.2019.02.022

**8.** *Utkin L.V., Kovalev M.S., Meldo A.A., Coolen F.P.A.* **Imprecise extensions of random forests and random survival forests** Proceedings of Machine Learning Research, vol. 103, pp. 404-413 (year - 2019)

**9.** *Utkin L.V., Meldo A.A., Ipatov O.S., Ryabinin M.A.* **Медицинские интеллектуальные системы на примере диагностики рака легкого** Известия ЮФУ. Технические науки, 8, С. 241-249 (year - 2018) https://doi.org/10.23683/2311-3103-2018-8-241-249

**10.** *Utkin L.V., Meldo A.A., Kryshtapovich V.S., Tiulpin V.A., Kasimov E.M., Kovalev M.S.* **Трехканальная интеллектуальная система классификации новообразований для диагностики рака легкого** Робототехника и техническая кибернетика, 7(3), С. 196-207. (year - 2019) https://doi.org/10.31776/RTCJ.7304

**11.** *Utkin L.V., Ryabinin M.A.* **Discriminative Metric Learning with Deep Forest** Journal on Artificial Intelligence Tools, Vol. 28(2), - Pp. 1950007-1 – 1950007-19 (year - 2019) https://doi.org/10.1142/S0218213019500076

**12.** *Meldo A.A., Utkin L.V.* **Radiomics as a basis for transformation of radiologists skills and partnership** IOP Conf. Series: Journal of Physics: Conference Series, 1236 (2019) 012063 (year - 2019) https://doi.org/10.1088/1742-6596/1236/1/012063

**13.** *Meldo A.A., Utkin L.V.* **A new approach to differential lung diagnosis with CT scans based on the Siamese neural network** IOP Conf. Series: Journal of Physics: Conference Series, 1236 (2019) 012058 (year - 2019) https://doi.org/10.1088/1742-6596/1236/1/012058

**14.** *Meldo A.A., Utkin L.V., Trofimova T.N., Lukashin A.A., Ryabinin M.A.* **Реализация системы искусственного интеллекта в диагностике рака легкого** Международный конгресс и школа для врачей “Кардиоторакальная радиология”, СПб. c. 127-129 (year - 2019)

**15.** *Utkin L., Konstantinov A., Meldo A., Ryabinin M., Chukanov V.* **A Deep Forest Improvement by Using Weighted Schemes** Proceedings of the 24th Conference of Open Innovations Association FRUCT, pp.451-456 (year - 2019) https://doi.org/10.23919/FRUCT.2019.8711886

**16.** *Utkin L.V., Kovalev M.S., Coolen F.* **Робастные регрессионные случайные леса при малых и зашумленных обучающих данных** XXII Международная конференция по мягким вычислениям и измерениям (SCM-2019), СПб.: СПбГЭТУ «ЛЭТИ» c.200-204 (year - 2019)

**17.** *-* **Программа оконтуривания патологических образований в легких на основе мультипланарных реконструкции КТ изображений** -, 2018666100 (year - )

**18.** *-* **Программа классификации новообразований в легких с использованием метода хорд** -, 2018666379 (year - )

**19.** *-* **База данных компьютерных томограмм грудной клетки с выделенными и маркированными областями патологии легких – LIRA (Lung Image Resource Annotated)** -, 2019620232 (year - )

**20.** *-* **Способ диагностики рака легкого на основе интеллектуального анализа формы, внутренней и внешней структур новообразований** -, 2694476 (year - )

**21.** *-* **В петербургском Политехе научили искусственный интеллект выявлять рак легких за 20 секунд** Доктор Питер, - (year - )

**22.** *-* **Ученые питерского Политеха создали интеллектуальную систему диагностики опухолей в легких** ТАСС, - (year - )

**23.** *-* **Ученые питерского Политеха создали интеллектуальную систему диагностики опухолей в легких** Министерство науки и высшего образования РФ, - (year - )

**24.** *-* **Учёные нашли новый способ диагностики опухолей** Популярная Механика, - (year - )

**25.** *-* **Ученые Политеха создали интеллектуальную систему диагностики опухолей в легких** Медиа-центром СПбПУ, - (year - )

**26.** *-* **В Политехническом университете разработали интеллектуальную систему для диагностики опухолей в легких** Луна Инфо, - (year - )

**27.** *-* **Считать обнаруженным. Петербургские ученые создали систему распознавания опухоли легких** Деловой Петербург, - (year - )

**28.** *-* **В России научились диагностировать рак легких за 20 секунд** LENTA.RU, - (year - )

**29.** *-* **Российские ученые создали интеллектуальную программную систему для диагностики рака легких** Медицина и учеба, - (year - )

**30.** *-* **Russian researchers create intelligent software system for lung cancer diagnostics** News Medical, - (year - )

**31.** *-* **Researchers developed an intelligent system for lung cancer diagnostics** EurekAlert, - (year - )

**32.** *-* **AI for Lung Cancer Diagnostics** Technology Networks, - (year - )

**33.** *-* **Researchers developed an intelligent system for lung cancer diagnostics** Hale Plus Hearty, - (year - )

**34.** *-* **Researchers developed an intelligent system for lung cancer diagnostics** Technology.Org, - (year - )

**35.** *-* **New intelligent system for lung cancer diagnostics** MEDICA, - (year - )

**36.** *-* **An intelligent system for lung cancer diagnostics** European Consortium for Mathematics in Industry, - (year - )

**37.** *-* **Russian researchers create intelligent software system for lung cancer diagnostics** EURASIA DIARY, - (year - )