README for corewatcher The corewatcher package provides a daemon for monitoring a system for crashes. Crashes are analyzed and summary crash report information is sent to a crashdb server. The daemon is managed by a systemd unit file, corewatcher.service. Configuration is stored in /etc/corewatcher. Corefiles are assumed to be written to /var/lib/corewatcher by the kernel with /proc/sys/kernel settings of: core_pattern=/var/lib/corewatcher/core_%e_%t core_uses_pid=1 To build and run, use the standard autotools workflow like: ./configure make sudo make install =========================================================================== The corewatcher daemon can be considered to be a state machine with the following 5 possible states and the listed major functions called for state transitions: S1: core_folder has no core_* | | crash happens leading to inotification | S2: core_folder has core_* present | | scan_core_folder() | move_core(fullpath, "to-process") | S3: processed_folder has some core_*.to-process or processed_folder has some core_*.processed, but no associated *.txt | | scan_processed_folder() | create_report() | (calls gdb, creates report summary *.txt) | S4: processed_folder has some core_*.processed, and associated *.txt | | queue_backtrace() . . . unqueueing | | submit_loop(): a sleepy thread whose work condition is set in | queue_backtrace() and in the period timer | "cleanup" thread, submits *.txt and where | successful moves associated core_*.processed | to core_*.submitted | S5: processed_folder has only core_*.submitted and *.txt NOTES: o At daemon start any of the states in the filesystem could exist, so we need to do all of scan_core_folder(), scan_processed_folder() and submit_loop(). o During submission, crash reports are removed from the in-memory pending submission work list. If the curl POST then fails, the associated cores stay in the filesystem as "processed" files, and are placed back on the in-memory submission work list. - if client network is down and comes back up, an event notifier could trigger resubmit by toggling the submit_loop() condition variable - if server or intermediate connectivity was the problem, only a periodic timer can trigger resubmission by setting the work condition for submit_loop() - failed submissions should hang out at the end of the work queue in case there is something truly wrong with them so new reports have a better chance of getting through =========================================================================== Internals: locking & global state o bt_mtx: (submit.c) - protects: o bt_work GCond condition variable o bt_list struct oops linked list o bt_hash GHashTable of core file names o A struct oops may exist off of bt_list and still be referenced by name in be in bt_hash. Such a struct oops must exist if the core name is in the bt_hash. o pq_mtx: (coredump.c) - protects: o pq "processing queue" boolean: the actual queue is represented by the presence of files in filesystem, but this allow threads to signal there are new ones to process o pq_work GCond condition variable