1 // Copyright 2011 The Chromium Authors
2 // Use of this source code is governed by a BSD-style license that can be
3 // found in the LICENSE file.
5 #include "base/memory/raw_ptr.h"
6 #include "courgette/adjustment_method.h"
19 #include "base/format_macros.h"
20 #include "base/logging.h"
21 #include "base/strings/stringprintf.h"
22 #include "base/time/time.h"
23 #include "courgette/assembly_program.h"
24 #include "courgette/courgette.h"
25 #include "courgette/encoded_program.h"
29 Shingle weighting matching.
31 We have a sequence S1 of symbols from alphabet A1={A,B,C,...} called the 'model'
32 and a second sequence of S2 of symbols from alphabet A2={U,V,W,....} called the
33 'program'. Each symbol in A1 has a unique numerical name or index. We can
34 transcribe the sequence S1 to a sequence T1 of indexes of the symbols. We wish
35 to assign indexes to the symbols in A2 so that when we transcribe S2 into T2, T2
36 has long subsequences that occur in T1. This will ensure that the sequence
37 T1;T2 compresses to be only slightly larger than the compressed T1.
39 The algorithm for matching members of S2 with members of S1 is eager - it makes
40 matches without backtracking, until no more matches can be made. Each variable
41 (symbol) U,V,... in A2 has a set of candidates from A1, each candidate with a
42 weight summarizing the evidence for the match. We keep a VariableQueue of
43 U,V,... sorted by how much the evidence for the best choice outweighs the
44 evidence for the second choice, i.e. prioritized by how 'clear cut' the best
45 assignment is. We pick the variable with the most clear-cut candidate, make the
46 assignment, adjust the evidence and repeat.
48 What has not been described so far is how the evidence is gathered and
49 maintained. We are working under the assumption that S1 and S2 are largely
50 similar. (A different assumption might be that S1 and S2 are dissimilar except
51 for many long subsequences.)
53 A naive algorithm would consider all pairs (A,U) and for each pair assess the
54 benefit, or score, the assignment U:=A. The score might count the number of
55 occurrences of U in S2 which appear in similar contexts to A in S1.
57 To distinguish contexts we view S1 and S2 as a sequence of overlapping k-length
58 substrings or 'shingles'. Two shingles are compatible if the symbols in one
59 shingle could be matched with the symbols in the other symbol. For example, ABC
60 is *not* compatible with UVU because it would require conflicting matches A=U
61 and C=U. ABC is compatible with UVW, UWV, WUV, VUW etc. We can't tell which
62 until we make an assignment - the compatible shingles form an equivalence class.
63 After assigning U:=A then only UVW and UWV (equivalently AVW, AWV) are
64 compatible. As we make assignments the number of equivalence classes of
65 shingles increases and the number of members of each equivalence class
66 decreases. The compatibility test becomes more restrictive.
68 We gather evidence for the potential assignment U:=A by counting how many
69 shingles containing U are compatible with shingles containing A. Thus symbols
70 occurring a large number of times in compatible contexts will be assigned first.
72 Finding the 'most clear-cut' assignment by considering all pairs symbols and for
73 each pair comparing the contexts of each pair of occurrences of the symbols is
74 computationally infeasible. We get the job done in a reasonable time by
75 approaching it 'backwards' and making incremental changes as we make
78 First the shingles are partitioned according to compatibility. In S1=ABCDD and
79 S2=UVWXX we have a total of 6 shingles, each occuring once. (ABC:1 BCD:1 CDD:1;
80 UVW:1 VWX: WXX:1) all fit the pattern <V0 V1 V2> or the pattern <V0 V1 V1>. The
81 first pattern indicates that each position matches a different symbol, the
82 second pattern indicates that the second symbol is repeated.
84 pattern S1 members S2 members
85 <V0 V1 V2>: {ABC:1, BCD:1}; {UVW:1, VWX:1}
86 <V0 V1 V1>: {CDD:1} {WXX:1}
88 The second pattern appears to have a unique assignment but we don't make the
89 assignment on such scant evidence. If S1 and S2 do not match exactly, there
90 will be numerous spurious low-score matches like this. Instead we must see what
91 assignments are indicated by considering all of the evidence.
93 First pattern has 2 x 2 = 4 shingle pairs. For each pair we count the number
94 of symbol assignments. For ABC:a * UVW:b accumulate min(a,b) to each of
96 After accumulating over all 2 x 2 pairs:
101 The second pattern contributes:
110 From this we decide to assign X:=D (because this assignment has both the largest
111 difference above the next candidate (X:=C) and this is also the largest
112 proportionately over the sum of alternatives).
114 Lets assume D has numerical 'name' 77. The assignment X:=D sets X to 77 too.
115 Next we repartition all the shingles containing X or D:
117 pattern S1 members S2 members
118 <V0 V1 V2>: {ABC:1}; {UVW:1}
119 <V0 V1 77>: {BCD:1}; {VWX:1}
120 <V0 77 77>: {CDD:1} {WXX:1}
121 As we repartition, we recalculate the contributions to the scores:
125 All the remaining assignments are now fixed.
127 There is one step in the incremental algorithm that is still infeasibly
128 expensive: the contributions due to the cross product of large equivalence
129 classes. We settle for making an approximation by computing the contribution of
130 the cross product of only the most common shingles. The hope is that the noise
131 from the long tail of uncounted shingles is well below the scores being used to
132 pick assignments. The second hope is that as assignment are made, the large
133 equivalence class will be partitioned into smaller equivalence classes, reducing
136 In the code below the shingles are bigger (Shingle::kWidth = 5).
137 Class ShinglePattern holds the data for one pattern.
139 There is an optimization for this case:
140 <V0 V1 V1>: {CDD:1} {WXX:1}
142 Above we said that we don't make an assignment on this "scant evidence". There
143 is an exception: if there is only one variable unassigned (more like the <V0 77
144 77> pattern) AND there are no occurrences of C and W other than those counted in
145 this pattern, then there is no competing evidence and we go ahead with the
146 assignment immediately. This produces slightly better results because these
147 cases tend to be low-scoring and susceptible to small mistakes made in
148 low-scoring assignments in the approximation for large equivalence classes.
152 namespace courgette {
153 namespace adjustment_method_2 {
155 ////////////////////////////////////////////////////////////////////////////////
157 class AssignmentCandidates;
158 class LabelInfoMaker;
160 class ShinglePattern;
162 // The purpose of adjustment is to assign indexes to Labels of a program 'p' to
163 // make the sequence of indexes similar to a 'model' program 'm'. Labels
164 // themselves don't have enough information to do this job, so we work with a
165 // LabelInfo surrogate for each label.
169 // Just a no-argument constructor and copy constructor. Actual LabelInfo
170 // objects are allocated in std::pair structs in a std::map.
176 assignment_(nullptr),
177 candidates_(nullptr) {}
181 AssignmentCandidates* candidates();
183 raw_ptr<Label> label_; // The label that this info a surrogate for.
185 uint32_t is_model_ : 1; // Is the label in the model?
186 uint32_t debug_index_ : 31; // A small number for naming the label in debug
187 // output. The pair (is_model_, debug_index_) is
190 int refs_; // Number of times this Label is referenced.
193 assignment_; // Label from other program corresponding to this.
195 std::vector<uint32_t> positions_; // Offsets into the trace of references.
198 raw_ptr<AssignmentCandidates> candidates_;
200 void operator=(const LabelInfo*); // Disallow assignment only.
201 // Public compiler generated copy constructor is needed to constuct
202 // std::pair<Label*, LabelInfo> so that fresh LabelInfos can be allocated
203 // inside a std::map.
206 typedef std::vector<LabelInfo*> Trace;
208 std::string ToString(const LabelInfo* info) {
210 base::StringAppendF(&s, "%c%d", "pm"[info->is_model_], info->debug_index_);
211 if (info->label_->index_ != Label::kNoIndex)
212 base::StringAppendF(&s, " (%d)", info->label_->index_);
214 base::StringAppendF(&s, " #%u", info->refs_);
218 // LabelInfoMaker maps labels to their surrogate LabelInfo objects.
219 class LabelInfoMaker {
221 LabelInfoMaker() : debug_label_index_gen_(0) {}
223 LabelInfoMaker(const LabelInfoMaker&) = delete;
224 LabelInfoMaker& operator=(const LabelInfoMaker&) = delete;
226 LabelInfo* MakeLabelInfo(Label* label, bool is_model, uint32_t position) {
227 LabelInfo& slot = label_infos_[label];
228 if (slot.label_ == nullptr) {
230 slot.is_model_ = is_model;
231 slot.debug_index_ = ++debug_label_index_gen_;
233 slot.positions_.push_back(position);
238 void ResetDebugLabel() { debug_label_index_gen_ = 0; }
241 int debug_label_index_gen_;
243 // Note LabelInfo is allocated 'flat' inside map::value_type, so the LabelInfo
244 // lifetimes are managed by the map.
245 std::map<Label*, LabelInfo> label_infos_;
248 struct OrderLabelInfo {
249 bool operator()(const LabelInfo* a, const LabelInfo* b) const {
250 if (a->label_->rva_ < b->label_->rva_) return true;
251 if (a->label_->rva_ > b->label_->rva_) return false;
252 if (a == b) return false;
253 return a->positions_ < b->positions_; // Lexicographic ordering of vector.
257 // AssignmentCandidates is a priority queue of candidate assignments to
258 // a single program LabelInfo, |program_info_|.
259 class AssignmentCandidates {
261 explicit AssignmentCandidates(LabelInfo* program_info)
262 : program_info_(program_info) {}
264 LabelInfo* program_info() const { return program_info_; }
266 bool empty() const { return label_to_score_.empty(); }
268 LabelInfo* top_candidate() const { return queue_.begin()->second; }
270 void Update(LabelInfo* model_info, int delta_score) {
271 LOG_ASSERT(delta_score != 0);
274 LabelToScore::iterator p = label_to_score_.find(model_info);
275 if (p != label_to_score_.end()) {
276 old_score = p->second;
277 new_score = old_score + delta_score;
278 queue_.erase(ScoreAndLabel(old_score, p->first));
279 if (new_score == 0) {
280 label_to_score_.erase(p);
282 p->second = new_score;
283 queue_.insert(ScoreAndLabel(new_score, model_info));
286 new_score = delta_score;
287 label_to_score_.insert(std::make_pair(model_info, new_score));
288 queue_.insert(ScoreAndLabel(new_score, model_info));
290 LOG_ASSERT(queue_.size() == label_to_score_.size());
293 int TopScore() const {
295 int second_value = 0;
296 Queue::const_iterator p = queue_.begin();
297 if (p != queue_.end()) {
298 first_value = p->first;
300 if (p != queue_.end()) {
301 second_value = p->first;
304 return first_value - second_value;
307 bool HasPendingUpdates() { return !pending_updates_.empty(); }
309 void AddPendingUpdate(LabelInfo* model_info, int delta_score) {
310 LOG_ASSERT(delta_score != 0);
311 pending_updates_[model_info] += delta_score;
314 void ApplyPendingUpdates() {
315 // TODO(sra): try to walk |pending_updates_| and |label_to_score_| in
316 // lockstep. Try to batch updates to |queue_|.
317 for (LabelToScore::iterator p = pending_updates_.begin();
318 p != pending_updates_.end();
321 Update(p->first, p->second);
323 pending_updates_.clear();
326 void Print(int max) {
327 VLOG(2) << "score " << TopScore() << " " << ToString(program_info_)
329 if (!pending_updates_.empty())
330 VLOG(2) << pending_updates_.size() << " pending";
332 for (Queue::iterator q = queue_.begin(); q != queue_.end(); ++q) {
333 if (++count > max) break;
334 VLOG(2) << " " << q->first << " " << ToString(q->second);
339 typedef std::map<LabelInfo*, int, OrderLabelInfo> LabelToScore;
340 typedef std::pair<int, LabelInfo*> ScoreAndLabel;
341 struct OrderScoreAndLabelByScoreDecreasing {
342 OrderLabelInfo tie_breaker;
343 bool operator()(const ScoreAndLabel& a, const ScoreAndLabel& b) const {
344 if (a.first > b.first) return true;
345 if (a.first < b.first) return false;
346 return tie_breaker(a.second, b.second);
349 typedef std::set<ScoreAndLabel, OrderScoreAndLabelByScoreDecreasing> Queue;
351 raw_ptr<LabelInfo> program_info_;
352 LabelToScore label_to_score_;
353 LabelToScore pending_updates_;
357 AssignmentCandidates* LabelInfo::candidates() {
358 if (candidates_ == nullptr)
359 candidates_ = new AssignmentCandidates(this);
363 LabelInfo::~LabelInfo() {
367 // A Shingle is a short fixed-length string of LabelInfos that actually occurs
368 // in a Trace. A Shingle may occur many times. We repesent the Shingle by the
369 // position of one of the occurrences in the Trace.
372 static const uint8_t kWidth = 5;
374 struct InterningLess {
375 bool operator()(const Shingle& a, const Shingle& b) const;
378 typedef std::set<Shingle, InterningLess> OwningSet;
380 // We can't disallow the copy constructor because we use std::set<Shingle> and
381 // VS2005's implementation of std::set<T>::set() requires T to have a copy
383 Shingle(const Shingle&) = default;
384 Shingle& operator=(const Shingle&) = delete; // Disallow assignment only.
386 static Shingle* Find(const Trace& trace, size_t position,
387 OwningSet* owning_set) {
388 std::pair<OwningSet::iterator, bool> pair =
389 owning_set->insert(Shingle(trace, position));
390 // pair.first iterator 'points' to the newly inserted Shingle or the
391 // previouly inserted one that looks the same according to the comparator.
393 // const_cast required because key is const. We modify the Shingle
394 // extensively but not in a way that affects InterningLess.
395 Shingle* shingle = const_cast<Shingle*>(&*pair.first);
396 shingle->add_position(position);
400 LabelInfo* at(size_t i) const { return trace_[exemplar_position_ + i]; }
401 void add_position(size_t position) {
402 positions_.push_back(static_cast<uint32_t>(position));
404 int position_count() const { return static_cast<int>(positions_.size()); }
406 bool InModel() const { return at(0)->is_model_; }
408 ShinglePattern* pattern() const { return pattern_; }
409 void set_pattern(ShinglePattern* pattern) { pattern_ = pattern; }
412 bool operator()(const Shingle* a, const Shingle* b) const {
413 // Arbitrary but repeatable (memory-address) independent ordering:
414 return a->exemplar_position_ < b->exemplar_position_;
415 // return InterningLess()(*a, *b);
420 Shingle(const Trace& trace, size_t exemplar_position)
422 exemplar_position_(exemplar_position),
425 const Trace& trace_; // The shingle lives inside trace_.
426 size_t exemplar_position_; // At this position (and other positions).
427 std::vector<uint32_t> positions_; // Includes exemplar_position_.
429 raw_ptr<ShinglePattern>
430 pattern_; // Pattern changes as LabelInfos are assigned.
432 friend std::string ToString(const Shingle* instance);
435 std::string ToString(const Shingle* instance) {
437 const char* sep = "<";
438 for (uint8_t i = 0; i < Shingle::kWidth; ++i) {
439 // base::StringAppendF(&s, "%s%x ", sep, instance.at(i)->label_->rva_);
441 s += ToString(instance->at(i));
444 base::StringAppendF(&s, ">(%" PRIuS ")@{%d}",
445 instance->exemplar_position_,
446 instance->position_count());
451 bool Shingle::InterningLess::operator()(
453 const Shingle& b) const {
454 for (uint8_t i = 0; i < kWidth; ++i) {
455 LabelInfo* info_a = a.at(i);
456 LabelInfo* info_b = b.at(i);
457 if (info_a->label_->rva_ < info_b->label_->rva_)
459 if (info_a->label_->rva_ > info_b->label_->rva_)
461 if (info_a->is_model_ < info_b->is_model_)
463 if (info_a->is_model_ > info_b->is_model_)
465 if (info_a != info_b) {
472 class ShinglePattern {
474 enum { kOffsetMask = 7, // Offset lives in low bits.
475 kFixed = 0, // kind & kVariable == 0 => fixed.
476 kVariable = 8 // kind & kVariable == 1 => variable.
478 // sequence[position + (kinds_[i] & kOffsetMask)] gives LabelInfo for position
479 // i of shingle. Below, second 'A' is duplicate of position 1, second '102'
480 // is duplicate of position 0.
482 // <102, A, 103, A , 102>
483 // --> <kFixed+0, kVariable+1, kFixed+2, kVariable+1, kFixed+0>
485 explicit Index(const Shingle* instance);
486 uint8_t kinds_[Shingle::kWidth];
488 uint8_t unique_variables_;
489 uint8_t first_variable_index_;
491 int assigned_indexes_[Shingle::kWidth];
494 // ShinglePattern keeps histograms of member Shingle instances, ordered by
495 // decreasing number of occurrences. We don't have a pair (occurrence count,
496 // Shingle instance), so we use a FreqView adapter to make the instance
497 // pointer look like the pair.
500 explicit FreqView(const Shingle* instance) : instance_(instance) {}
501 int count() const { return instance_->position_count(); }
502 const Shingle* instance() const { return instance_; }
504 bool operator()(const FreqView& a, const FreqView& b) const {
505 if (a.count() > b.count()) return true;
506 if (a.count() < b.count()) return false;
507 return resolve_ties(a.instance(), b.instance());
510 Shingle::PointerLess resolve_ties;
513 const Shingle* instance_;
516 typedef std::set<FreqView, FreqView::Greater> Histogram;
519 : index_(nullptr), model_coverage_(0), program_coverage_(0) {}
522 index_; // Points to the key in the owning map value_type.
523 Histogram model_histogram_;
524 Histogram program_histogram_;
526 int program_coverage_;
529 std::string ToString(const ShinglePattern::Index* index) {
531 if (index == nullptr) {
534 base::StringAppendF(&s, "<%d: ", index->variables_);
535 const char* sep = "";
536 for (uint8_t i = 0; i < Shingle::kWidth; ++i) {
539 uint32_t kind = index->kinds_[i];
540 int offset = kind & ShinglePattern::kOffsetMask;
541 if (kind & ShinglePattern::kVariable)
542 base::StringAppendF(&s, "V%d", offset);
544 base::StringAppendF(&s, "%d", index->assigned_indexes_[offset]);
546 base::StringAppendF(&s, " %x", index->hash_);
552 std::string HistogramToString(const ShinglePattern::Histogram& histogram,
553 size_t snippet_max) {
555 size_t histogram_size = histogram.size();
556 size_t snippet_size = 0;
557 for (ShinglePattern::Histogram::const_iterator p = histogram.begin();
558 p != histogram.end();
560 if (++snippet_size > snippet_max && snippet_size != histogram_size) {
564 base::StringAppendF(&s, " %d", p->count());
569 std::string HistogramToStringFull(const ShinglePattern::Histogram& histogram,
571 size_t snippet_max) {
574 size_t histogram_size = histogram.size();
575 size_t snippet_size = 0;
576 for (ShinglePattern::Histogram::const_iterator p = histogram.begin();
577 p != histogram.end();
580 if (++snippet_size > snippet_max && snippet_size != histogram_size) {
584 base::StringAppendF(&s, "(%d) ", p->count());
585 s += ToString(&(*p->instance()));
591 std::string ToString(const ShinglePattern* pattern, size_t snippet_max = 3) {
593 if (pattern == nullptr) {
597 s += ToString(pattern->index_);
598 base::StringAppendF(&s, "; %d(%d):",
599 static_cast<int>(pattern->model_histogram_.size()),
600 pattern->model_coverage_);
602 s += HistogramToString(pattern->model_histogram_, snippet_max);
603 base::StringAppendF(&s, "; %d(%d):",
604 static_cast<int>(pattern->program_histogram_.size()),
605 pattern->program_coverage_);
606 s += HistogramToString(pattern->program_histogram_, snippet_max);
612 std::string ShinglePatternToStringFull(const ShinglePattern* pattern,
615 s += ToString(pattern->index_);
617 size_t model_size = pattern->model_histogram_.size();
618 size_t program_size = pattern->program_histogram_.size();
619 base::StringAppendF(&s, " model shingles %" PRIuS "\n", model_size);
620 s += HistogramToStringFull(pattern->model_histogram_, " ", max);
621 base::StringAppendF(&s, " program shingles %" PRIuS "\n", program_size);
622 s += HistogramToStringFull(pattern->program_histogram_, " ", max);
626 struct ShinglePatternIndexLess {
627 bool operator()(const ShinglePattern::Index& a,
628 const ShinglePattern::Index& b) const {
629 if (a.hash_ < b.hash_) return true;
630 if (a.hash_ > b.hash_) return false;
632 for (uint8_t i = 0; i < Shingle::kWidth; ++i) {
633 if (a.kinds_[i] < b.kinds_[i]) return true;
634 if (a.kinds_[i] > b.kinds_[i]) return false;
635 if ((a.kinds_[i] & ShinglePattern::kVariable) == 0) {
636 if (a.assigned_indexes_[i] < b.assigned_indexes_[i])
638 if (a.assigned_indexes_[i] > b.assigned_indexes_[i])
646 static uint32_t hash_combine(uint32_t h, uint32_t v) {
648 return (h * (37 + 0x0000d100)) ^ (h >> 13);
651 ShinglePattern::Index::Index(const Shingle* instance) {
654 unique_variables_ = 0;
655 first_variable_index_ = 255;
657 for (uint8_t i = 0; i < Shingle::kWidth; ++i) {
658 LabelInfo* info = instance->at(i);
662 for ( ; j < i; ++j) {
663 if (info == instance->at(j)) { // Duplicate LabelInfo
668 if (j == i) { // Not found above.
669 if (info->assignment_) {
670 code = info->label_->index_;
671 assigned_indexes_[i] = code;
674 kind = kVariable + i;
676 if (i < first_variable_index_)
677 first_variable_index_ = i;
680 if (kind & kVariable) ++variables_;
681 hash = hash_combine(hash, code);
682 hash = hash_combine(hash, kind);
684 assigned_indexes_[i] = code;
689 struct ShinglePatternLess {
690 bool operator()(const ShinglePattern& a, const ShinglePattern& b) const {
691 return index_less(*a.index_, *b.index_);
693 ShinglePatternIndexLess index_less;
696 struct ShinglePatternPointerLess {
697 bool operator()(const ShinglePattern* a, const ShinglePattern* b) const {
698 return pattern_less(*a, *b);
700 ShinglePatternLess pattern_less;
703 template<int (*Scorer)(const ShinglePattern*)>
704 struct OrderShinglePatternByScoreDescending {
705 bool operator()(const ShinglePattern* a, const ShinglePattern* b) const {
706 int score_a = Scorer(a);
707 int score_b = Scorer(b);
708 if (score_a > score_b) return true;
709 if (score_a < score_b) return false;
710 return break_ties(a, b);
712 ShinglePatternPointerLess break_ties;
715 // Returns a score for a 'Single Use' rule. Returns -1 if the rule is not
717 int SingleUseScore(const ShinglePattern* pattern) {
718 if (pattern->index_->variables_ != 1)
721 if (pattern->model_histogram_.size() != 1 ||
722 pattern->program_histogram_.size() != 1)
725 // Does this pattern account for all uses of the variable?
726 const ShinglePattern::FreqView& program_freq =
727 *pattern->program_histogram_.begin();
728 const ShinglePattern::FreqView& model_freq =
729 *pattern->model_histogram_.begin();
730 int p1 = program_freq.count();
731 int m1 = model_freq.count();
733 const Shingle* program_instance = program_freq.instance();
734 const Shingle* model_instance = model_freq.instance();
735 size_t variable_index = pattern->index_->first_variable_index_;
736 LabelInfo* program_info = program_instance->at(variable_index);
737 LabelInfo* model_info = model_instance->at(variable_index);
738 if (!program_info->assignment_) {
739 if (program_info->refs_ == p1 && model_info->refs_ == m1) {
747 // The VariableQueue is a priority queue of unassigned LabelInfos from
748 // the 'program' (the 'variables') and their AssignmentCandidates.
749 class VariableQueue {
751 typedef std::pair<int, LabelInfo*> ScoreAndLabel;
753 VariableQueue() = default;
755 VariableQueue(const VariableQueue&) = delete;
756 VariableQueue& operator=(const VariableQueue&) = delete;
758 bool empty() const { return queue_.empty(); }
760 const ScoreAndLabel& first() const { return *queue_.begin(); }
762 // For debugging only.
764 for (Queue::const_iterator p = queue_.begin(); p != queue_.end(); ++p) {
765 AssignmentCandidates* candidates = p->second->candidates();
766 candidates->Print(std::numeric_limits<int>::max());
770 void AddPendingUpdate(LabelInfo* program_info, LabelInfo* model_info,
772 AssignmentCandidates* candidates = program_info->candidates();
773 if (!candidates->HasPendingUpdates()) {
774 pending_update_candidates_.push_back(candidates);
776 candidates->AddPendingUpdate(model_info, delta_score);
779 void ApplyPendingUpdates() {
780 for (size_t i = 0; i < pending_update_candidates_.size(); ++i) {
781 AssignmentCandidates* candidates = pending_update_candidates_[i];
782 int old_score = candidates->TopScore();
783 queue_.erase(ScoreAndLabel(old_score, candidates->program_info()));
784 candidates->ApplyPendingUpdates();
785 if (!candidates->empty()) {
786 int new_score = candidates->TopScore();
787 queue_.insert(ScoreAndLabel(new_score, candidates->program_info()));
790 pending_update_candidates_.clear();
794 struct OrderScoreAndLabelByScoreDecreasing {
795 bool operator()(const ScoreAndLabel& a, const ScoreAndLabel& b) const {
796 if (a.first > b.first) return true;
797 if (a.first < b.first) return false;
798 return OrderLabelInfo()(a.second, b.second);
801 typedef std::set<ScoreAndLabel, OrderScoreAndLabelByScoreDecreasing> Queue;
804 std::vector<AssignmentCandidates*> pending_update_candidates_;
808 class AssignmentProblem {
810 AssignmentProblem(const Trace& trace, size_t model_end)
812 model_end_(model_end) {
813 VLOG(2) << "AssignmentProblem::AssignmentProblem " << model_end << ", "
817 AssignmentProblem(const AssignmentProblem&) = delete;
818 AssignmentProblem& operator=(const AssignmentProblem&) = delete;
821 if (model_end_ < Shingle::kWidth ||
822 trace_.size() - model_end_ < Shingle::kWidth) {
823 // Nothing much we can do with such a short problem.
826 instances_.resize(trace_.size() - Shingle::kWidth + 1, nullptr);
827 AddShingles(0, model_end_);
828 AddShingles(model_end_, trace_.size());
830 AddPatternsNeedingUpdatesToQueues();
832 patterns_needing_updates_.clear();
833 while (FindAndAssignBestLeader())
834 patterns_needing_updates_.clear();
835 PrintActivePatterns();
841 typedef std::set<Shingle*, Shingle::PointerLess> ShingleSet;
843 typedef std::set<const ShinglePattern*, ShinglePatternPointerLess>
846 // Patterns are partitioned into the following sets:
848 // * Retired patterns (not stored). No shingles exist for this pattern (they
849 // all now match more specialized patterns).
850 // * Useless patterns (not stored). There are no 'program' shingles for this
851 // pattern (they all now match more specialized patterns).
852 // * Single-use patterns - single_use_pattern_queue_.
853 // * Other patterns - active_non_single_use_patterns_ / variable_queue_.
855 typedef std::set<const ShinglePattern*,
856 OrderShinglePatternByScoreDescending<&SingleUseScore> >
857 SingleUsePatternQueue;
859 void PrintPatternsHeader() const {
860 VLOG(2) << shingle_instances_.size() << " instances "
861 << trace_.size() << " trace length "
862 << patterns_.size() << " shingle indexes "
863 << single_use_pattern_queue_.size() << " single use patterns "
864 << active_non_single_use_patterns_.size() << " active patterns";
867 void PrintActivePatterns() const {
868 for (ShinglePatternSet::const_iterator p =
869 active_non_single_use_patterns_.begin();
870 p != active_non_single_use_patterns_.end();
872 const ShinglePattern* pattern = *p;
873 VLOG(2) << ToString(pattern, 10);
877 void PrintPatterns() const {
879 PrintActivePatterns();
883 void PrintAllPatterns() const {
884 for (IndexToPattern::const_iterator p = patterns_.begin();
885 p != patterns_.end();
887 const ShinglePattern& pattern = p->second;
888 VLOG(2) << ToString(&pattern, 10);
892 void PrintAllShingles() const {
893 for (Shingle::OwningSet::const_iterator p = shingle_instances_.begin();
894 p != shingle_instances_.end();
896 const Shingle& instance = *p;
897 VLOG(2) << ToString(&instance) << " " << ToString(instance.pattern());
902 void AddShingles(size_t begin, size_t end) {
903 for (size_t i = begin; i + Shingle::kWidth - 1 < end; ++i) {
904 instances_[i] = Shingle::Find(trace_, i, &shingle_instances_);
908 void Declassify(Shingle* shingle) {
909 ShinglePattern* pattern = shingle->pattern();
910 if (shingle->InModel()) {
911 pattern->model_histogram_.erase(ShinglePattern::FreqView(shingle));
912 pattern->model_coverage_ -= shingle->position_count();
914 pattern->program_histogram_.erase(ShinglePattern::FreqView(shingle));
915 pattern->program_coverage_ -= shingle->position_count();
917 shingle->set_pattern(nullptr);
920 void Reclassify(Shingle* shingle) {
921 ShinglePattern* pattern = shingle->pattern();
922 LOG_ASSERT(pattern == nullptr);
924 ShinglePattern::Index index(shingle);
925 if (index.variables_ == 0)
928 std::pair<IndexToPattern::iterator, bool> inserted =
929 patterns_.insert(std::make_pair(index, ShinglePattern()));
931 pattern = &inserted.first->second;
932 pattern->index_ = &inserted.first->first;
933 shingle->set_pattern(pattern);
934 patterns_needing_updates_.insert(pattern);
936 if (shingle->InModel()) {
937 pattern->model_histogram_.insert(ShinglePattern::FreqView(shingle));
938 pattern->model_coverage_ += shingle->position_count();
940 pattern->program_histogram_.insert(ShinglePattern::FreqView(shingle));
941 pattern->program_coverage_ += shingle->position_count();
945 void InitialClassify() {
946 for (Shingle::OwningSet::iterator p = shingle_instances_.begin();
947 p != shingle_instances_.end();
949 // GCC's set<T>::iterator::operator *() returns a const object.
950 Reclassify(const_cast<Shingle*>(&*p));
954 // For the positions in |info|, find the shingles that overlap that position.
955 void AddAffectedPositions(LabelInfo* info, ShingleSet* affected_shingles) {
956 const uint8_t kWidth = Shingle::kWidth;
957 for (size_t i = 0; i < info->positions_.size(); ++i) {
958 size_t position = info->positions_[i];
959 // Find bounds to the subrange of |trace_| we are in.
960 size_t start = position < model_end_ ? 0 : model_end_;
961 size_t end = position < model_end_ ? model_end_ : trace_.size();
963 // Clip [position-kWidth+1, position+1)
965 position > start + kWidth - 1 ? position - kWidth + 1 : start;
966 size_t high = position + kWidth < end ? position + 1 : end - kWidth + 1;
968 for (size_t shingle_position = low;
969 shingle_position < high;
970 ++shingle_position) {
971 Shingle* overlapping_shingle = instances_.at(shingle_position);
972 affected_shingles->insert(overlapping_shingle);
977 void RemovePatternsNeedingUpdatesFromQueues() {
978 for (ShinglePatternSet::iterator p = patterns_needing_updates_.begin();
979 p != patterns_needing_updates_.end();
981 RemovePatternFromQueues(*p);
985 void AddPatternsNeedingUpdatesToQueues() {
986 for (ShinglePatternSet::iterator p = patterns_needing_updates_.begin();
987 p != patterns_needing_updates_.end();
989 AddPatternToQueues(*p);
991 variable_queue_.ApplyPendingUpdates();
994 void RemovePatternFromQueues(const ShinglePattern* pattern) {
995 int single_use_score = SingleUseScore(pattern);
996 if (single_use_score > 0) {
997 size_t n = single_use_pattern_queue_.erase(pattern);
999 } else if (pattern->program_histogram_.empty() &&
1000 pattern->model_histogram_.empty()) {
1001 NOTREACHED(); // Should not come back to life.
1002 } else if (pattern->program_histogram_.empty()) {
1005 active_non_single_use_patterns_.erase(pattern);
1006 AddPatternToLabelQueue(pattern, -1);
1010 void AddPatternToQueues(const ShinglePattern* pattern) {
1011 int single_use_score = SingleUseScore(pattern);
1012 if (single_use_score > 0) {
1013 single_use_pattern_queue_.insert(pattern);
1014 } else if (pattern->program_histogram_.empty() &&
1015 pattern->model_histogram_.empty()) {
1016 } else if (pattern->program_histogram_.empty()) {
1019 active_non_single_use_patterns_.insert(pattern);
1020 AddPatternToLabelQueue(pattern, +1);
1024 void AddPatternToLabelQueue(const ShinglePattern* pattern, int sign) {
1025 // For each possible assignment in this pattern, update the potential
1026 // contributions to the LabelInfo queues.
1028 // We want to find for each symbol (LabelInfo) the maximum contribution that
1029 // could be achieved by making shingle-wise assignments between shingles in
1030 // the model and shingles in the program.
1032 // If the shingles in the histograms are independent (no two shingles have a
1033 // symbol in common) then any permutation of the assignments is possible,
1034 // and the maximum contribution can be found by taking the maximum over all
1037 // If the shingles are dependent two things happen. The maximum
1038 // contribution to any given symbol is a sum because the symbol has
1039 // contributions from all the shingles containing it. Second, some
1040 // assignments are blocked by previous incompatible assignments. We want to
1041 // avoid a combinatorial search, so we ignore the blocking.
1043 const size_t kUnwieldy = 5;
1045 typedef std::map<LabelInfo*, int> LabelToScore;
1046 typedef std::map<LabelInfo*, LabelToScore > ScoreSet;
1049 size_t n_model_samples = 0;
1050 for (ShinglePattern::Histogram::const_iterator model_iter =
1051 pattern->model_histogram_.begin();
1052 model_iter != pattern->model_histogram_.end();
1054 if (++n_model_samples > kUnwieldy) break;
1055 const ShinglePattern::FreqView& model_freq = *model_iter;
1056 int m1 = model_freq.count();
1057 const Shingle* model_instance = model_freq.instance();
1060 size_t n_program_samples = 0;
1061 for (ShinglePattern::Histogram::const_iterator program_iter =
1062 pattern->program_histogram_.begin();
1063 program_iter != pattern->program_histogram_.end();
1065 if (++n_program_samples > kUnwieldy) break;
1066 const ShinglePattern::FreqView& program_freq = *program_iter;
1067 int p1 = program_freq.count();
1068 const Shingle* program_instance = program_freq.instance();
1070 // int score = p1; // ? weigh all equally??
1071 int score = std::min(p1, m1);
1073 for (uint8_t i = 0; i < Shingle::kWidth; ++i) {
1074 LabelInfo* program_info = program_instance->at(i);
1075 LabelInfo* model_info = model_instance->at(i);
1076 if ((model_info->assignment_ == nullptr) !=
1077 (program_info->assignment_ == nullptr)) {
1078 VLOG(2) << "ERROR " << i
1079 << "\n\t" << ToString(pattern, 10)
1080 << "\n\t" << ToString(program_instance)
1081 << "\n\t" << ToString(model_instance);
1083 if (!program_info->assignment_ && !model_info->assignment_) {
1084 sums[program_info][model_info] += score;
1089 for (ScoreSet::iterator assignee_iterator = sums.begin();
1090 assignee_iterator != sums.end();
1091 ++assignee_iterator) {
1092 LabelInfo* program_info = assignee_iterator->first;
1093 for (LabelToScore::iterator p = assignee_iterator->second.begin();
1094 p != assignee_iterator->second.end();
1096 LabelInfo* model_info = p->first;
1097 int score = p->second;
1098 int* slot = &maxima[program_info][model_info];
1099 *slot = std::max(*slot, score);
1104 for (ScoreSet::iterator assignee_iterator = maxima.begin();
1105 assignee_iterator != maxima.end();
1106 ++assignee_iterator) {
1107 LabelInfo* program_info = assignee_iterator->first;
1108 for (LabelToScore::iterator p = assignee_iterator->second.begin();
1109 p != assignee_iterator->second.end();
1111 LabelInfo* model_info = p->first;
1112 int score = sign * p->second;
1113 variable_queue_.AddPendingUpdate(program_info, model_info, score);
1118 void AssignOne(LabelInfo* model_info, LabelInfo* program_info) {
1119 LOG_ASSERT(!model_info->assignment_);
1120 LOG_ASSERT(!program_info->assignment_);
1121 LOG_ASSERT(model_info->is_model_);
1122 LOG_ASSERT(!program_info->is_model_);
1124 VLOG(3) << "Assign " << ToString(program_info)
1125 << " := " << ToString(model_info);
1127 ShingleSet affected_shingles;
1128 AddAffectedPositions(model_info, &affected_shingles);
1129 AddAffectedPositions(program_info, &affected_shingles);
1131 for (ShingleSet::iterator p = affected_shingles.begin();
1132 p != affected_shingles.end();
1134 patterns_needing_updates_.insert((*p)->pattern());
1137 RemovePatternsNeedingUpdatesFromQueues();
1139 for (ShingleSet::iterator p = affected_shingles.begin();
1140 p != affected_shingles.end();
1145 program_info->label_->index_ = model_info->label_->index_;
1147 model_info->assignment_ = program_info;
1148 program_info->assignment_ = model_info;
1150 for (ShingleSet::iterator p = affected_shingles.begin();
1151 p != affected_shingles.end();
1156 AddPatternsNeedingUpdatesToQueues();
1159 bool AssignFirstVariableOfHistogramHead(const ShinglePattern& pattern) {
1160 const ShinglePattern::FreqView& program_1 =
1161 *pattern.program_histogram_.begin();
1162 const ShinglePattern::FreqView& model_1 = *pattern.model_histogram_.begin();
1163 const Shingle* program_instance = program_1.instance();
1164 const Shingle* model_instance = model_1.instance();
1165 size_t variable_index = pattern.index_->first_variable_index_;
1166 LabelInfo* program_info = program_instance->at(variable_index);
1167 LabelInfo* model_info = model_instance->at(variable_index);
1168 AssignOne(model_info, program_info);
1172 bool FindAndAssignBestLeader() {
1173 LOG_ASSERT(patterns_needing_updates_.empty());
1175 if (!single_use_pattern_queue_.empty()) {
1176 const ShinglePattern& pattern = **single_use_pattern_queue_.begin();
1177 return AssignFirstVariableOfHistogramHead(pattern);
1180 if (variable_queue_.empty())
1183 const VariableQueue::ScoreAndLabel best = variable_queue_.first();
1184 int score = best.first;
1185 LabelInfo* assignee = best.second;
1187 // TODO(sra): score (best.first) can be zero. A zero score means we are
1188 // blindly picking between two (or more) alternatives which look the same.
1189 // If we exit on the first zero-score we sometimes get 3-4% better total
1190 // compression. This indicates that 'infill' is doing a better job than
1191 // picking blindly. Perhaps we can use an extended region around the
1192 // undistinguished competing alternatives to break the tie.
1194 variable_queue_.Print();
1198 AssignmentCandidates* candidates = assignee->candidates();
1199 if (candidates->empty())
1200 return false; // Should not happen.
1202 AssignOne(candidates->top_candidate(), assignee);
1207 // The trace vector contains the model sequence [0, model_end_) followed by
1208 // the program sequence [model_end_, trace.end())
1209 const Trace& trace_;
1212 // |shingle_instances_| is the set of 'interned' shingles.
1213 Shingle::OwningSet shingle_instances_;
1215 // |instances_| maps from position in |trace_| to Shingle at that position.
1216 std::vector<Shingle*> instances_;
1218 SingleUsePatternQueue single_use_pattern_queue_;
1219 ShinglePatternSet active_non_single_use_patterns_;
1220 VariableQueue variable_queue_;
1222 // Transient information: when we make an assignment, we need to recompute
1223 // priority queue information derived from these ShinglePatterns.
1224 ShinglePatternSet patterns_needing_updates_;
1226 typedef std::map<ShinglePattern::Index,
1227 ShinglePattern, ShinglePatternIndexLess> IndexToPattern;
1228 IndexToPattern patterns_;
1231 class Adjuster : public AdjustmentMethod {
1233 Adjuster() : prog_(nullptr), model_(nullptr) {}
1235 Adjuster(const Adjuster&) = delete;
1236 Adjuster& operator=(const Adjuster&) = delete;
1238 ~Adjuster() = default;
1240 bool Adjust(const AssemblyProgram& model, AssemblyProgram* program) {
1241 VLOG(1) << "Adjuster::Adjust";
1248 prog_->UnassignIndexes();
1251 CollectTraces(model_, &abs32_trace_, &rel32_trace_, true);
1252 size_t abs32_model_end = abs32_trace_.size();
1253 size_t rel32_model_end = rel32_trace_.size();
1254 CollectTraces(prog_, &abs32_trace_, &rel32_trace_, false);
1255 Solve(abs32_trace_, abs32_model_end);
1256 Solve(rel32_trace_, rel32_model_end);
1257 prog_->AssignRemainingIndexes();
1262 void CollectTraces(const AssemblyProgram* program, Trace* abs32, Trace* rel32,
1264 label_info_maker_.ResetDebugLabel();
1266 for (Label* label : program->abs32_label_annotations())
1267 ReferenceLabel(abs32, is_model, label);
1268 for (Label* label : program->rel32_label_annotations())
1269 ReferenceLabel(rel32, is_model, label);
1271 // TODO(sra): we could simply append all the labels in index order to
1272 // incorporate some costing for entropy (bigger deltas) that will be
1273 // introduced into the label address table by non-monotonic ordering. This
1274 // would have some knock-on effects to parts of the algorithm that work on
1275 // single-occurrence labels.
1278 void Solve(const Trace& model, size_t model_end) {
1279 base::Time start_time = base::Time::Now();
1280 AssignmentProblem a(model, model_end);
1282 VLOG(1) << " Adjuster::Solve "
1283 << (base::Time::Now() - start_time).InSecondsF();
1286 void ReferenceLabel(Trace* trace, bool is_model, Label* label) {
1287 trace->push_back(label_info_maker_.MakeLabelInfo(
1288 label, is_model, static_cast<uint32_t>(trace->size())));
1291 raw_ptr<AssemblyProgram> prog_; // Program to be adjusted, owned by caller.
1292 raw_ptr<const AssemblyProgram>
1293 model_; // Program to be mimicked, owned by caller.
1295 LabelInfoMaker label_info_maker_;
1298 ////////////////////////////////////////////////////////////////////////////////
1300 } // namespace adjustment_method_2
1302 AdjustmentMethod* AdjustmentMethod::MakeShingleAdjustmentMethod() {
1303 return new adjustment_method_2::Adjuster();
1306 } // namespace courgette