brief gbt documentation added. some sample fixes made. code updated.

author P. Druzhkov <no@email>

Wed, 15 Jun 2011 21:54:25 +0000 (21:54 +0000)

committer P. Druzhkov <no@email>

Wed, 15 Jun 2011 21:54:25 +0000 (21:54 +0000)
author P. Druzhkov <no@email>
Wed, 15 Jun 2011 21:54:25 +0000 (21:54 +0000)
committer P. Druzhkov <no@email>
Wed, 15 Jun 2011 21:54:25 +0000 (21:54 +0000)
diff --git a/modules/ml/doc/gradient_boosted_trees.rst b/modules/ml/doc/gradient_boosted_trees.rst

new file mode 100644 (file)

index 0000000..8cef701
--- /dev/null
+++ b/modules/ml/doc/gradient_boosted_trees.rst
@@ -0,0 +1,371 @@
+.. _Gradient Boosted Trees:\r
+\r
+Gradient Boosted Trees\r
+======================\r
+\r
+Gradient Boosted Trees (GBT) is a generalized boosting algorithm, introduced by\r
+Jerome Friedman: http://www.salfordsystems.com/doc/GreedyFuncApproxSS.pdf .\r
+In contrast to AdaBoost.M1 algorithm GBT can deal with both multiclass\r
+classification and regression problems. More than that it can use any\r
+differential loss function, some popular ones are implemented.\r
+Decision trees (:ref:`CvDTree`) usage as base learners allows to process ordered\r
+and categorical variables.\r
+\r
+\r
+.. _Training the GBT model:\r
+\r
+Training the GBT model\r
+----------------------\r
+\r
+Gradient Boosted Trees model represents an ensemble of single regression trees,\r
+that are built in a greedy fashion. Training procedure is an iterative proccess\r
+similar to the numerical optimazation via gradient descent method. Summary loss\r
+on the training set depends only from the current model predictions on the\r
+thaining samples,  in other words\r
+:math:`\sum^N_{i=1}L(y_i, F(x_i)) \equiv \mathcal{L}(F(x_1), F(x_2), ... , F(x_N))\r
+\equiv \mathcal{L}(F)`. And the :math:`\mathcal{L}(F)`\r
+gradient can be computed as follows:\r
+\r
+.. math::\r
+    grad(\mathcal{L}(F)) = \left( \dfrac{\partial{L(y_1, F(x_1))}}{\partial{F(x_1)}},\r
+    \dfrac{\partial{L(y_2, F(x_2))}}{\partial{F(x_2)}}, ... ,\r
+    \dfrac{\partial{L(y_N, F(x_N))}}{\partial{F(x_N)}} \right) .\r
+On every training step a single regression tree is built to predict an\r
+antigradient vector components. Step length is computed corresponding to the\r
+loss function and separately for every region determined by the tree leaf, and\r
+can be eliminated by changing leaves' values directly.\r
+\r
+The main scheme of the training proccess is shown below.\r
+\r
+#.\r
+    Find the best constant model.\r
+#.\r
+    For :math:`i` in :math:`[1,M]`:\r
+\r
+    #.\r
+        Compute the antigradient.\r
+    #.\r
+        Grow a regression tree to predict antigradient components.\r
+    #.\r
+        Change values in the tree leaves.\r
+    #.\r
+        Add the tree to the model.\r
+\r
+\r
+The following loss functions are implemented:\r
+\r
+*for regression problems:*\r
+\r
+#.\r
+    Squared loss (``CvGBTrees::SQUARED_LOSS``):\r
+    :math:`L(y,f(x))=\dfrac{1}{2}(y-f(x))^2`\r
+#.\r
+    Absolute loss (``CvGBTrees::ABSOLUTE_LOSS``):\r
+    :math:`L(y,f(x))=|y-f(x)|`\r
+#.\r
+    Huber loss (``CvGBTrees::HUBER_LOSS``):\r
+    :math:`L(y,f(x)) = \left\{ \begin{array}{lr}\r
+    \delta\cdot\left(|y-f(x)|-\dfrac{\delta}{2}\right) & : |y-f(x)|>\delta\\\r
+    \dfrac{1}{2}\cdot(y-f(x))^2 & : |y-f(x)|\leq\delta \end{array} \right.`,\r
+    where :math:`\delta` is the :math:`\alpha`-quantile estimation of the\r
+    :math:`|y-f(x)|`. In the current implementation :math:`\alpha=0.2`.\r
+\r
+*for classification problems:*\r
+\r
+4.\r
+    Deviance or cross-entropy loss (``CvGBTrees::DEVIANCE_LOSS``):\r
+    :math:`K` functions are built, one function for each output class, and\r
+    :math:`L(y,f_1(x),...,f_K(x)) = -\sum^K_{k=0}1(y=k)\ln{p_k(x)}`,\r
+    where :math:`p_k(x)=\dfrac{\exp{f_k(x)}}{\sum^K_{i=1}\exp{f_i(x)}}`\r
+    is the estimation of the probability that :math:`y=k`.\r
+\r
+In the end we get the model in the following form:\r
+\r
+.. math:: f(x) = f_0 + \nu\cdot\sum^M_{i=1}T_i(x) ,\r
+where :math:`f_0` is the initial guess (the best constant model) and :math:`\nu`\r
+is a regularization parameter from the interval :math:`(0,1]`, futher called\r
+*shrinkage*.\r
+\r
+\r
+.. _Predicting with GBT model:\r
+\r
+Predicting with GBT model\r
+-------------------------\r
+\r
+To get the GBT model prediciton it is needed to compute the sum of responses of\r
+all the trees in the ensemble. For regression problems it is the answer, and\r
+for classification problems the result is :math:`\arg\max_{i=1..K}(f_i(x))`.\r
+\r
+\r
+.. highlight:: cpp\r
+\r
+\r
+.. index:: CvGBTreesParams\r
+.. _CvGBTreesParams:\r
+\r
+CvGBTreesParams\r
+---------------\r
+.. c:type:: CvGBTreesParams\r
+\r
+GBT training parameters ::\r
+\r
+    struct CvGBTreesParams : public CvDTreeParams\r
+    {\r
+        int weak_count;\r
+        int loss_function_type;\r
+        float subsample_portion;\r
+        float shrinkage;\r
+\r
+        CvGBTreesParams();\r
+        CvGBTreesParams( int loss_function_type, int weak_count, float shrinkage,\r
+            float subsample_portion, int max_depth, bool use_surrogates );\r
+    };\r
+\r
+The structure contains parameters for each sigle decision tree in the ensemble,\r
+as well as the whole model characteristics. The structure is derived from\r
+:ref:`CvDTreeParams` but not all of the decision tree parameters are supported:\r
+cross-validation, pruning and class priorities are not used. The whole\r
+parameters list is shown below:\r
+\r
+``weak_count``\r
+\r
+    The count of boosting algorithm iterations. ``weak_count*K`` -- is the total\r
+    count of trees in the GBT model, where ``K`` is the output classes count\r
+    (equal to one in the case of regression).\r
+    \r
+``loss_function_type``\r
+\r
+    The type of the loss function used for training\r
+    (see :ref:`Training the GBT model`). It must be one of the\r
+    following: ``CvGBTrees::SQUARED_LOSS``, ``CvGBTrees::ABSOLUTE_LOSS``,\r
+    ``CvGBTrees::HUBER_LOSS``, ``CvGBTrees::DEVIANCE_LOSS``. The first three\r
+    ones are used for the case of regression problems, and the last one for\r
+    classification.\r
+    \r
+``shrinkage``\r
+\r
+    Regularization parameter (see :ref:`Training the GBT model`).\r
+    \r
+``subsample_portion``\r
+\r
+    The portion of the whole training set used on each algorithm iteration.\r
+    Subset is generated randomly\r
+    (For more information see\r
+    http://www.salfordsystems.com/doc/StochasticBoostingSS.pdf).\r
+\r
+``max_depth``\r
+\r
+    The maximal depth of each decision tree in the ensemble (see :ref:`CvDTree`).\r
+\r
+``use_surrogates``\r
+\r
+    If ``true`` surrogate splits are built (see :ref:`CvDTree`).\r
+    \r
+By default the following constructor is used:\r
+\r
+.. code-block:: cpp\r
+\r
+    CvGBTreesParams(CvGBTrees::SQUARED_LOSS, 200, 0.8f, 0.01f, 3, false)\r
+        : CvDTreeParams( 3, 10, 0, false, 10, 0, false, false, 0 )\r
+\r
+\r
+\r
+.. index:: CvGBTrees\r
+.. _CvGBTrees:\r
+\r
+CvGBTrees\r
+---------\r
+.. c:type:: CvGBTrees\r
+\r
+GBT model ::\r
+\r
+       class CvGBTrees : public CvStatModel\r
+       {\r
+       public:\r
+\r
+               enum {SQUARED_LOSS=0, ABSOLUTE_LOSS, HUBER_LOSS=3, DEVIANCE_LOSS};\r
+\r
+               CvGBTrees();\r
+               CvGBTrees( const cv::Mat& trainData, int tflag,\r
+                        const Mat& responses, const Mat& varIdx=Mat(),\r
+                        const Mat& sampleIdx=Mat(), const cv::Mat& varType=Mat(),\r
+                        const Mat& missingDataMask=Mat(),\r
+                        CvGBTreesParams params=CvGBTreesParams() );\r
+\r
+               virtual ~CvGBTrees();\r
+               virtual bool train( const Mat& trainData, int tflag,\r
+                        const Mat& responses, const Mat& varIdx=Mat(),\r
+                        const Mat& sampleIdx=Mat(), const Mat& varType=Mat(),\r
+                        const Mat& missingDataMask=Mat(),\r
+                        CvGBTreesParams params=CvGBTreesParams(),\r
+                        bool update=false );\r
+               \r
+               virtual bool train( CvMLData* data,\r
+                        CvGBTreesParams params=CvGBTreesParams(),\r
+                        bool update=false );\r
+\r
+               virtual float predict( const Mat& sample, const Mat& missing=Mat(),\r
+                        const Range& slice = Range::all(),\r
+                        int k=-1 ) const;\r
+\r
+               virtual void clear();\r
+\r
+               virtual float calc_error( CvMLData* _data, int type,\r
+                        std::vector<float> *resp = 0 );\r
+\r
+               virtual void write( CvFileStorage* fs, const char* name ) const;\r
+\r
+               virtual void read( CvFileStorage* fs, CvFileNode* node );\r
+\r
+       protected:\r
+               \r
+               CvDTreeTrainData* data;\r
+               CvGBTreesParams params;\r
+               CvSeq** weak;\r
+               Mat& orig_response;\r
+               Mat& sum_response;\r
+               Mat& sum_response_tmp;\r
+               Mat& weak_eval;\r
+               Mat& sample_idx;\r
+               Mat& subsample_train;\r
+               Mat& subsample_test;\r
+               Mat& missing;\r
+               Mat& class_labels;\r
+               RNG* rng;\r
+               int class_count;\r
+               float delta;\r
+               float base_value;\r
+               \r
+               ...\r
+\r
+       };\r
+\r
+\r
+       \r
+.. index:: CvGBTrees::train\r
+\r
+.. _CvGBTrees::train:\r
+\r
+CvGBTrees::train\r
+----------------\r
+.. c:function:: bool train(const Mat & trainData, int tflag, const Mat & responses, const Mat & varIdx=Mat(), const Mat & sampleIdx=Mat(), const Mat & varType=Mat(), const Mat & missingDataMask=Mat(), CvGBTreesParams params=CvGBTreesParams(), bool update=false)\r
+\r
+.. c:function:: bool train(CvMLData* data, CvGBTreesParams params=CvGBTreesParams(), bool update=false)\r
+    \r
+       Trains a Gradient boosted tree model.\r
+       \r
+The first train method follows the common template (see :ref:`CvStatModel::train`).\r
+Both ``tflag`` values (``CV_ROW_SAMPLE``, ``CV_COL_SAMPLE``) are supported.\r
+``trainData`` must be of ``CV_32F`` type. ``responses`` must be a matrix of type\r
+``CV_32S`` or ``CV_32F``, in both cases it is converted into the ``CV_32F``\r
+matrix inside the training procedure. ``varIdx`` and ``sampleIdx`` must be a\r
+list of indices (``CV_32S``), or a mask (``CV_8U`` or ``CV_8S``). ``update`` is\r
+a dummy parameter.\r
+\r
+The second form of :ref:`CvGBTrees::train` function uses :ref:`CvMLData` as a\r
+data set container. ``update`` is still a dummy parameter. \r
+\r
+All parameters specific to the GBT model are passed into the training function\r
+as a :ref:`CvGBTreesParams` structure.\r
+\r
+\r
+.. index:: CvGBTrees::predict\r
+\r
+.. _CvGBTrees::predict:\r
+\r
+CvGBTrees::predict\r
+------------------\r
+.. c:function:: float predict(const Mat & sample, const Mat & missing=Mat(), const Range & slice = Range::all(), int k=-1) const\r
+\r
+    Predicts a response for an input sample.\r
+ \r
+The method predicts the response, corresponding to the given sample\r
+(see :ref:`Predicting with GBT model`).\r
+The result is either the class label or the estimated function value.\r
+:c:func:`predict` method allows to use the parallel version of the GBT model\r
+prediction if the OpenCV is built with the TBB library. In this case predicitons\r
+of single trees are computed in a parallel fashion.\r
+\r
+``sample``\r
+\r
+    An input feature vector, that has the same format as every training set\r
+    element. Hence, if not all the variables were actualy used while training,\r
+    ``sample`` have to contain fictive values on the appropriate places.\r
+    \r
+``missing``\r
+\r
+    The missing values mask. The one dimentional matrix of the same size as\r
+    ``sample`` having a ``CV_8U`` type. ``1`` corresponds to the missing value\r
+    in the same position in the ``sample`` vector. If there are no missing values\r
+    in the feature vector empty matrix can be passed instead of the missing mask.\r
+    \r
+``weak_responses``\r
+\r
+    In addition to the prediciton of the whole model all the trees' predcitions\r
+    can be obtained by passing a ``weak_responses`` matrix with :math:`K` rows,\r
+    where :math:`K` is the output classes count (1 for the case of regression)\r
+    and having as many columns as the ``slice`` length.\r
+    \r
+``slice``\r
+    \r
+    Defines the part of the ensemble used for prediction.\r
+    All trees are used when ``slice = Range::all()``. This parameter is useful to\r
+    get predictions of the GBT models with different ensemble sizes learning\r
+    only the one model actually.\r
+    \r
+``k``\r
+    \r
+    In the case of the classification problem not the one, but :math:`K` tree\r
+    ensembles are built (see :ref:`Training the GBT model`). By passing this\r
+    parameter the ouput can be changed to sum of the trees' predictions in the\r
+    ``k``'th ensemble only. To get the total GBT model prediction ``k`` value\r
+    must be -1. For regression problems ``k`` have to be equal to -1 also.\r
+    \r
+\r
+    \r
+.. index:: CvGBTrees::clear\r
+\r
+.. _CvGBTrees::clear:\r
+\r
+CvGBTrees::clear\r
+----------------\r
+.. c:function:: void clear()\r
+\r
+    Clears the model.\r
+    \r
+Deletes the data set information, all the weak models and sets all internal\r
+variables to the initial state. Is called in :ref:`CvGBTrees::train` and in the\r
+destructor.\r
+\r
+\r
+.. index:: CvGBTrees::calc_error\r
+\r
+.. _CvGBTrees::calc_error:\r
+\r
+CvGBTrees::calc_error\r
+---------------------\r
+.. c:function:: float calc_error( CvMLData* _data, int type, std::vector<float> *resp = 0 )\r
+\r
+    Calculates training or testing error.\r
+    \r
+If the :ref:`CvMLData` data is used to store the data set :c:func:`calc_error` can be\r
+used to get the training or testing error easily and (optionally) all predictions\r
+on the training/testing set. If TBB library is used, the error is computed in a\r
+parallel way: predictions for different samples are computed at the same time.\r
+In the case of regression problem mean squared error is returned. For\r
+classifications the result is the misclassification error in percent.\r
+\r
+``_data``\r
+\r
+    Data set.\r
+    \r
+``type``\r
+    \r
+    Defines what error should be computed: train (``CV_TRAIN_ERROR``) or test\r
+    (``CV_TEST_ERROR``).\r
+\r
+``resp``\r
+    \r
+    If not ``0`` a vector of predictions on the corresponding data set is\r
+    returned.\r
+\r
diff --git a/modules/ml/doc/ml.rst b/modules/ml/doc/ml.rst

index 4f0d6bb..1cc01b9 100644 (file)
--- a/modules/ml/doc/ml.rst
+++ b/modules/ml/doc/ml.rst
@@ -15,7 +15,7 @@ Most of the classification and regression algorithms are implemented as C++ clas
      support_vector_machines
      decision_trees
      boosting
+    gradient_boosted_trees
      random_trees
      expectation_maximization
      neural_networks
-
diff --git a/modules/ml/include/opencv2/ml/ml.hpp b/modules/ml/include/opencv2/ml/ml.hpp

index 2263fbe..33782cf 100644 (file)
--- a/modules/ml/include/opencv2/ml/ml.hpp
+++ b/modules/ml/include/opencv2/ml/ml.hpp
@@ -1571,7 +1571,7 @@ public:
      // Response value prediction
      //
      // API
-    // virtual float predict( const CvMat* sample, const CvMat* missing=0,
+    // virtual float predict_serial( const CvMat* sample, const CvMat* missing=0,
               CvMat* weak_responses=0, CvSlice slice = CV_WHOLE_SEQ,
               int k=-1 ) const;
      
@@ -1594,12 +1594,44 @@ public:
      // RESULT
      // Predicted value.
      */
+    virtual float predict_serial( const CvMat* sample, const CvMat* missing=0,
+            CvMat* weakResponses=0, CvSlice slice = CV_WHOLE_SEQ,
+            int k=-1 ) const;
+            
+    /*
+    // Response value prediction.
+    // Parallel version (in the case of TBB existence)
+    //
+    // API
+    // virtual float predict( const CvMat* sample, const CvMat* missing=0,
+             CvMat* weak_responses=0, CvSlice slice = CV_WHOLE_SEQ,
+             int k=-1 ) const;
+    
+    // INPUT
+    // sample         - input sample of the same type as in the training set.
+    // missing        - missing values mask. missing=0 if there are no
+    //                   missing values in sample vector.
+    // weak_responses  - predictions of all of the trees.
+    //                   not implemented (!)
+    // slice           - part of the ensemble used for prediction.
+    //                   slice = CV_WHOLE_SEQ when all trees are used.
+    // k               - number of ensemble used.
+    //                   k is in {-1,0,1,..,<count of output classes-1>}.
+    //                   in the case of classification problem 
+    //                   <count of output classes-1> ensembles are built.
+    //                   If k = -1 ordinary prediction is the result,
+    //                   otherwise function gives the prediction of the
+    //                   k-th ensemble only.
+    // OUTPUT
+    // RESULT
+    // Predicted value.
+    */        
      virtual float predict( const CvMat* sample, const CvMat* missing=0,
              CvMat* weakResponses=0, CvSlice slice = CV_WHOLE_SEQ,
              int k=-1 ) const;
  
      /*
-    // Delete all temporary data.
+    // Deletes all the data.
      //
      // API
      // virtual void clear();
@@ -1607,7 +1639,7 @@ public:
      // INPUT
      // OUTPUT
      // delete data, weak, orig_response, sum_response,
-    //        weak_eval, ubsample_train, subsample_test,
+    //        weak_eval, subsample_train, subsample_test,
      //        sample_idx, missing, lass_labels
      // delta = 0.0
      // RESULT
@@ -1623,7 +1655,7 @@ public:
      //
      // INPUT
      // data  - dataset
-    // type  - defines which error is to compute^ train (CV_TRAIN_ERROR) or
+    // type  - defines which error is to compute: train (CV_TRAIN_ERROR) or
      //         test (CV_TEST_ERROR).
      // OUTPUT
      // resp  - vector of predicitons
@@ -1633,7 +1665,6 @@ public:
      virtual float calc_error( CvMLData* _data, int type,
              std::vector<float> *resp = 0 );
  
-
      /*
      // 
      // Write parameters of the gtb model and data. Write learned model.
@@ -1852,7 +1883,6 @@ protected:
      CvMat* orig_response;
      CvMat* sum_response;
      CvMat* sum_response_tmp;
-    CvMat* weak_eval;
      CvMat* sample_idx;
      CvMat* subsample_train;
      CvMat* subsample_test;
diff --git a/modules/ml/src/gbt.cpp b/modules/ml/src/gbt.cpp

index d512fea..60b1469 100644 (file)
--- a/modules/ml/src/gbt.cpp
+++ b/modules/ml/src/gbt.cpp
@@ -59,7 +59,7 @@ CvGBTrees::CvGBTrees()
      weak = 0;\r
      default_model_name = "my_boost_tree";\r
      orig_response = sum_response = sum_response_tmp = 0;\r
-    weak_eval = subsample_train = subsample_test = 0;\r
+    subsample_train = subsample_test = 0;\r
      missing = sample_idx = 0;\r
      class_labels = 0;\r
      class_count = 1;\r
@@ -117,7 +117,6 @@ void CvGBTrees::clear()
      cvReleaseMat( &orig_response );\r
      cvReleaseMat( &sum_response );\r
      cvReleaseMat( &sum_response_tmp );\r
-    cvReleaseMat( &weak_eval );\r
      cvReleaseMat( &subsample_train );\r
      cvReleaseMat( &subsample_test );\r
      cvReleaseMat( &sample_idx );\r
@@ -143,7 +142,7 @@ CvGBTrees::CvGBTrees( const CvMat* _train_data, int _tflag,
      data = 0;\r
      default_model_name = "my_boost_tree";\r
      orig_response = sum_response = sum_response_tmp = 0;\r
-    weak_eval = subsample_train = subsample_test = 0;\r
+    subsample_train = subsample_test = 0;\r
      missing = sample_idx = 0;\r
      class_labels = 0;\r
      class_count = 1;\r
@@ -276,7 +275,7 @@ CvGBTrees::train( const CvMat* _train_data, int _tflag,
      {\r
          int sample_idx_len = get_len(_sample_idx);\r
          \r
-        switch (CV_ELEM_SIZE(_sample_idx->type))\r
+        switch (CV_MAT_TYPE(_sample_idx->type))\r
          {\r
              case CV_32SC1:\r
              {\r
@@ -818,20 +817,31 @@ void CvGBTrees::do_subsample()
  \r
  //===========================================================================\r
  \r
-float CvGBTrees::predict( const CvMat* _sample, const CvMat* _missing,\r
-        CvMat* /*weak_responses*/, CvSlice slice, int k) const \r
+float CvGBTrees::predict_serial( const CvMat* _sample, const CvMat* _missing,\r
+        CvMat* weak_responses, CvSlice slice, int k) const \r
  {\r
      float result = 0.0f;\r
  \r
      if (!weak) return 0.0f;\r
  \r
-    float* sum = new float[class_count];\r
-    for (int i=0; i<class_count; ++i)\r
-        sum[i] = base_value;\r
-\r
      CvSeqReader reader;\r
      int weak_count = cvSliceLength( slice, weak[class_count-1] );\r
      CvDTree* tree;\r
+    \r
+    if (weak_responses)\r
+    {\r
+               if (CV_MAT_TYPE(weak_responses->type) != CV_32F)\r
+            return 0.0f;\r
+        if ((k >= 0) && (k<class_count) && (weak_responses->rows != 1))\r
+            return 0.0f;\r
+        if ((k == -1) && (weak_responses->rows != class_count))\r
+            return 0.0f;\r
+        if (weak_responses->cols != weak_count)\r
+            return 0.0f;\r
+    }\r
+    \r
+    float* sum = new float[class_count];\r
+    memset(sum, 0, class_count*sizeof(float));\r
  \r
      for (int i=0; i<class_count; ++i)\r
      {\r
@@ -842,11 +852,16 @@ float CvGBTrees::predict( const CvMat* _sample, const CvMat* _missing,
              for (int j=0; j<weak_count; ++j)\r
              {\r
                  CV_READ_SEQ_ELEM( tree, reader );\r
-                sum[i] += params.shrinkage *\r
-                         (float)(tree->predict(_sample, _missing)->value);\r
+                float p = (float)(tree->predict(_sample, _missing)->value);\r
+                sum[i] += params.shrinkage * p;\r
+                if (weak_responses)\r
+                    weak_responses->data.fl[i*weak_count+j] = p;\r
              }\r
          }\r
      }\r
+    \r
+    for (int i=0; i<class_count; ++i)\r
+        sum[i] += base_value;\r
  \r
      if (class_count == 1)\r
      {\r
@@ -884,6 +899,137 @@ float CvGBTrees::predict( const CvMat* _sample, const CvMat* _missing,
      return float(orig_class_label);\r
  }\r
  \r
+\r
+class Tree_predictor\r
+{\r
+private:\r
+       pCvSeq* weak;\r
+       float* sum;\r
+       const int k;\r
+       const CvMat* sample;\r
+       const CvMat* missing;\r
+    const float shrinkage;\r
+    \r
+#ifdef HAVE_TBB\r
+    static tbb::spin_mutex SumMutex;\r
+#endif\r
+\r
+\r
+public:\r
+       Tree_predictor() : weak(0), sum(0), k(0), sample(0), missing(0), shrinkage(1.0f) {}\r
+       Tree_predictor(pCvSeq* _weak, const int _k, const float _shrinkage,\r
+                                  const CvMat* _sample, const CvMat* _missing, float* _sum ) :\r
+                                  weak(_weak), k(_k), sample(_sample),\r
+                   missing(_missing), sum(_sum), shrinkage(_shrinkage)\r
+       {}\r
+       \r
+    Tree_predictor( const Tree_predictor& p, cv::Split ) :\r
+                       weak(p.weak), k(p.k), sample(p.sample),\r
+            missing(p.missing), sum(p.sum), shrinkage(p.shrinkage)\r
+       {}\r
+\r
+       Tree_predictor& operator=( const Tree_predictor& )\r
+       {}\r
+       \r
+    virtual void operator()(const cv::BlockedRange& range) const\r
+       {\r
+#ifdef HAVE_TBB\r
+        tbb::spin_mutex::scoped_lock lock;\r
+#endif\r
+        CvSeqReader reader;\r
+               int begin = range.begin();\r
+               int end = range.end();\r
+               \r
+               int weak_count = end - begin;\r
+               CvDTree* tree;\r
+\r
+               for (int i=0; i<k; ++i)\r
+               {\r
+                       float tmp_sum = 0.0f;\r
+                       if ((weak[i]) && (weak_count))\r
+                       {\r
+                               cvStartReadSeq( weak[i], &reader ); \r
+                               cvSetSeqReaderPos( &reader, begin );\r
+                               for (int j=0; j<weak_count; ++j)\r
+                               {\r
+                                       CV_READ_SEQ_ELEM( tree, reader );\r
+                                       tmp_sum += shrinkage*(float)(tree->predict(sample, missing)->value);\r
+                               }\r
+                       }\r
+#ifdef HAVE_TBB\r
+            lock.acquire(SumMutex);\r
+                       sum[i] += tmp_sum;\r
+            lock.release();\r
+#else\r
+            sum[i] += tmp_sum;\r
+#endif\r
+               }\r
+       } // Tree_predictor::operator()\r
+    \r
+}; // class Tree_predictor\r
+\r
+\r
+#ifdef HAVE_TBB\r
+tbb::spin_mutex Tree_predictor::SumMutex;\r
+#endif\r
+\r
+\r
+\r
+float CvGBTrees::predict( const CvMat* _sample, const CvMat* _missing,\r
+            CvMat* /*weak_responses*/, CvSlice slice, int k) const \r
+    {\r
+        float result = 0.0f;\r
+           if (!weak) return 0.0f;\r
+        float* sum = new float[class_count];\r
+        for (int i=0; i<class_count; ++i)\r
+            sum[i] = 0.0f;\r
+           int begin = slice.start_index;\r
+           int end = begin + cvSliceLength( slice, weak[0] );\r
+       \r
+        pCvSeq* weak_seq = weak;\r
+           Tree_predictor predictor = Tree_predictor(weak_seq, class_count,\r
+                                    params.shrinkage, _sample, _missing, sum);\r
+        \r
+//#ifdef HAVE_TBB\r
+//             tbb::parallel_for(cv::BlockedRange(begin, end), predictor,\r
+//                          tbb::auto_partitioner());\r
+//#else\r
+        cv::parallel_for(cv::BlockedRange(begin, end), predictor);\r
+//#endif\r
+\r
+           for (int i=0; i<class_count; ++i)\r
+            sum[i] = sum[i] /** params.shrinkage*/ + base_value;\r
+\r
+        if (class_count == 1)\r
+        {\r
+            result = sum[0];\r
+            delete[] sum;\r
+            return result;\r
+        }\r
+\r
+        if ((k>=0) && (k<class_count))\r
+        {\r
+            result = sum[k];\r
+            delete[] sum;\r
+            return result;\r
+        }\r
+\r
+        float max = sum[0];\r
+        int class_label = 0;\r
+        for (int i=1; i<class_count; ++i)\r
+            if (sum[i] > max)\r
+            {\r
+                max = sum[i];\r
+                class_label = i;\r
+            }\r
+\r
+        delete[] sum;\r
+        int orig_class_label = class_labels->data.i[class_label];\r
+\r
+        return float(orig_class_label);\r
+    }\r
+\r
+\r
  //===========================================================================\r
  \r
  void CvGBTrees::write_params( CvFileStorage* fs ) const\r
@@ -1080,69 +1226,126 @@ void CvGBTrees::read( CvFileStorage* fs, CvFileNode* node )
  \r
  //===========================================================================\r
  \r
+class Sample_predictor\r
+{\r
+private:\r
+       const CvGBTrees* gbt;\r
+       float* predictions;\r
+       const CvMat* samples;\r
+       const CvMat* missing;\r
+    const CvMat* idx;\r
+    CvSlice slice;\r
+\r
+public:\r
+       Sample_predictor() : gbt(0), predictions(0), samples(0), missing(0),\r
+                         idx(0), slice(CV_WHOLE_SEQ)\r
+    {}\r
+\r
+       Sample_predictor(const CvGBTrees* _gbt, float* _predictions,\r
+                                  const CvMat* _samples, const CvMat* _missing,\r
+                   const CvMat* _idx, CvSlice _slice=CV_WHOLE_SEQ) :\r
+                                  gbt(_gbt), predictions(_predictions), samples(_samples),\r
+                   missing(_missing), idx(_idx), slice(_slice)\r
+       {}\r
+       \r
+\r
+    Sample_predictor( const Sample_predictor& p, cv::Split ) :\r
+                       gbt(p.gbt), predictions(p.predictions),\r
+            samples(p.samples), missing(p.missing), idx(p.idx),\r
+            slice(p.slice)\r
+       {}\r
+\r
+\r
+    virtual void operator()(const cv::BlockedRange& range) const\r
+       {\r
+               int begin = range.begin();\r
+               int end = range.end();\r
+\r
+               CvMat x;\r
+        CvMat miss;\r
+\r
+        for (int i=begin; i<end; ++i)\r
+        {\r
+            int j = idx ? idx->data.i[i] : i;\r
+            cvGetRow(samples, &x, j);\r
+            if (!missing)\r
+            {\r
+                predictions[i] = gbt->predict_serial(&x,0,0,slice);\r
+            }\r
+            else\r
+            {\r
+                cvGetRow(missing, &miss, j);\r
+                predictions[i] = gbt->predict_serial(&x,&miss,0,slice);\r
+            }\r
+        }\r
+       } // Sample_predictor::operator()\r
+\r
+}; // class Sample_predictor\r
+\r
+\r
+\r
  // type in {CV_TRAIN_ERROR, CV_TEST_ERROR}\r
  float \r
  CvGBTrees::calc_error( CvMLData* _data, int type, std::vector<float> *resp )\r
  {\r
-    float err = 0;\r
-    const CvMat* values = _data->get_values();\r
+\r
+    float err = 0.0f;\r
+    const CvMat* sample_idx = (type == CV_TRAIN_ERROR) ?\r
+                              _data->get_train_sample_idx() :\r
+                              _data->get_test_sample_idx();\r
      const CvMat* response = _data->get_responses();\r
-    const CvMat* missing = _data->get_missing();\r
-    const CvMat* sample_idx = (type == CV_TEST_ERROR) ?\r
-                              _data->get_test_sample_idx() :\r
-                              _data->get_train_sample_idx();\r
-    //const CvMat* var_types = _data->get_var_types();\r
-    int* sidx = sample_idx ? sample_idx->data.i : 0;\r
-    int r_step = CV_IS_MAT_CONT(response->type) ?\r
-                1 : response->step / CV_ELEM_SIZE(response->type);\r
-    //bool is_classifier = \r
-    //            var_types->data.ptr[var_types->cols-1] == CV_VAR_CATEGORICAL;\r
-    int sample_count = sample_idx ? sample_idx->cols : 0;\r
-    sample_count = (type == CV_TRAIN_ERROR && sample_count == 0) ?\r
-                                        values->rows :\r
-                                        sample_count;\r
-    float* pred_resp = 0;\r
-    if( resp && (sample_count > 0) )\r
+                              \r
+    int n = sample_idx ? get_len(sample_idx) : 0;\r
+    n = (type == CV_TRAIN_ERROR && n == 0) ? _data->get_values()->rows : n;\r
+    \r
+    if (!n)\r
+        return -FLT_MAX;\r
+    \r
+    float* pred_resp = 0;  \r
+    if (resp)\r
      {\r
-        resp->resize( sample_count );\r
+        resp->resize(n);\r
          pred_resp = &((*resp)[0]);\r
      }\r
+    else\r
+        pred_resp = new float[n];\r
+\r
+    Sample_predictor predictor = Sample_predictor(this, pred_resp, _data->get_values(),\r
+            _data->get_missing(), sample_idx);\r
+        \r
+//#ifdef HAVE_TBB\r
+//    tbb::parallel_for(cv::BlockedRange(0,n), predictor, tbb::auto_partitioner());\r
+//#else\r
+    cv::parallel_for(cv::BlockedRange(0,n), predictor);\r
+//#endif\r
+        \r
+    int* sidx = sample_idx ? sample_idx->data.i : 0;\r
+    int r_step = CV_IS_MAT_CONT(response->type) ?\r
+                1 : response->step / CV_ELEM_SIZE(response->type);\r
+    \r
+\r
      if ( !problem_type() )\r
      {\r
-        for( int i = 0; i < sample_count; i++ )\r
+        for( int i = 0; i < n; i++ )\r
          {\r
-            CvMat sample, miss;\r
              int si = sidx ? sidx[i] : i;\r
-            cvGetRow( values, &sample, si ); \r
-            if( missing ) \r
-                cvGetRow( missing, &miss, si );             \r
-            float r = (float)predict( &sample, missing ? &miss : 0 );\r
-            if( pred_resp )\r
-                pred_resp[i] = r;\r
-            int d = fabs((double)r - response->data.fl[si*r_step]) <= FLT_EPSILON ? 0 : 1;\r
+            int d = fabs((double)pred_resp[i] - response->data.fl[si*r_step]) <= FLT_EPSILON ? 0 : 1;\r
              err += d;\r
          }\r
-        err = sample_count ? err / (float)sample_count * 100 : -FLT_MAX;\r
+        err = err / (float)n * 100.0f;\r
      }\r
      else\r
      {\r
-        for( int i = 0; i < sample_count; i++ )\r
+        for( int i = 0; i < n; i++ )\r
          {\r
-            CvMat sample, miss;\r
              int si = sidx ? sidx[i] : i;\r
-            cvGetRow( values, &sample, si );\r
-            if( missing ) \r
-                cvGetRow( missing, &miss, si );             \r
-            float r = (float)predict( &sample, missing ? &miss : 0 );\r
-            if( pred_resp )\r
-                pred_resp[i] = r;\r
-            float d = r - response->data.fl[si*r_step];\r
+            float d = pred_resp[i] - response->data.fl[si*r_step];\r
              err += d*d;\r
          }\r
-        err = sample_count ? err / (float)sample_count : -FLT_MAX;    \r
+        err = err / (float)n;    \r
      }\r
+    \r
      return err;\r
-\r
  }\r
  \r
  \r
@@ -1156,7 +1359,7 @@ CvGBTrees::CvGBTrees( const cv::Mat& trainData, int tflag,
      weak = 0;\r
      default_model_name = "my_boost_tree";\r
      orig_response = sum_response = sum_response_tmp = 0;\r
-    weak_eval = subsample_train = subsample_test = 0;\r
+    subsample_train = subsample_test = 0;\r
      missing = sample_idx = 0;\r
      class_labels = 0;\r
      class_count = 1;\r
diff --git a/samples/c/tree_engine.cpp b/samples/c/tree_engine.cpp

index 4f41884..2517953 100644 (file)
--- a/samples/c/tree_engine.cpp
+++ b/samples/c/tree_engine.cpp
@@ -125,7 +125,10 @@ int main(int argc, char** argv)
          print_result( ertrees.calc_error( &data, CV_TRAIN_ERROR), ertrees.calc_error( &data, CV_TEST_ERROR ), ertrees.get_var_importance() );
  
          printf("======GBTREES=====\n");
-        gbtrees.train( &data, CvGBTreesParams(CvGBTrees::DEVIANCE_LOSS, 100, 0.05f, 0.6f, 10, true));
+               if (categorical_response)
+                       gbtrees.train( &data, CvGBTreesParams(CvGBTrees::DEVIANCE_LOSS, 100, 0.1f, 0.8f, 5, false));
+               else
+                       gbtrees.train( &data, CvGBTreesParams(CvGBTrees::SQUARED_LOSS, 100, 0.1f, 0.8f, 5, false));
          print_result( gbtrees.calc_error( &data, CV_TRAIN_ERROR), gbtrees.calc_error( &data, CV_TEST_ERROR ), 0 ); //doesn't compute importance
      }
      else
author	P. Druzhkov <no@email>
	Wed, 15 Jun 2011 21:54:25 +0000 (21:54 +0000)
committer	P. Druzhkov <no@email>
	Wed, 15 Jun 2011 21:54:25 +0000 (21:54 +0000)
modules/ml/doc/gradient_boosted_trees.rst	[new file with mode: 0644]	patch \| blob
modules/ml/doc/ml.rst		patch \| blob \| history
modules/ml/include/opencv2/ml/ml.hpp		patch \| blob \| history
modules/ml/src/gbt.cpp		patch \| blob \| history
samples/c/tree_engine.cpp		patch \| blob \| history