Improve accuracy of benchmarking
For small code regions, readtsc can give inaccurate results because it does
not account for out-of-order execution. Add x86_tsc_start and x86_tsc_end
that account for this, according to the white paper at
https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf
Using x86_tsc_start/end will also add in several more instructions; I imagine
this is negligible.
Change-Id: I54a1c8fa7977c34bf91b422369c96f036c93a08a