Refactor malloc tracing, synchronize most of the stuff after all.
I completely forgot before that a free can follow a malloc in a
different thread which is perfectly fine. To support this, we must
centralize our data handler after all.
This is not too bad as we can merge the multi-thread data into one
file that way and have it interleaved properly which will be good
for later evaluation.
From a contention point of view, I tried to minimize it as much as
possible. Further evaluation and profiling will be done later.