In the previous post (part-1) we explored the various monitoring categories in an enterprise architecture. We also looked at how error tracking solutions can help quickly identify issues in the architecture.
To re-iterate, these are the categories that encompass monitoring as a whole.
- Error or Exception Tracking (covered in a previous post)
- Log Management and Tracking (covered in this post)
- Application Performance Monitoring
- Distributed Tracing
- User Experience Tracking
- Infrastructure Monitoring
In this post we will look into another related area which is log management.
What is a log management solution?
A log management solution is one that enables components of a software architecture to log data, collect the logged data, manage or retain logged data and visualize or report on it as needed.
When designing a service oriented architecture or picking components for an enterprise service bus, one of the more common mistakes is to ignore the architecture for centralized standardized logging.
Some of the key drivers in enterprise architecture to adopt a centralized log management solution tends to be around operations, auditing and security (in that order). Often this decision of choosing a centralized log management solution happens much later after all other business aspects of the architecture are dealt with.
This can lead to long troubleshooting windows, sub-optimal logging, lack of standardized logging mechanisms or nomenclature for logging data, poor performance, lack of clear audit trails and more.
What to look for in a log management solution
When choosing a log management and tracking solution, here are some key elements that you should consider in the architecture before picking a solution:
- Volume – Volume of logging, frequency, rotation and sizing
- Sources – Source of logs (mobile devices vs. on-prem vs cloud)
- Encryption – requirements for encryption of logging data or data at rest.
- Reliability – understand what it takes to log reliably
- Review the reliability requirements of logging systems and related infrastructure
- Define how and whether logging data needs to be backed up
- Security – restricting access to logging data to only those that should have access.
- Libraries – Choice of logging libraries or systems across various languages
- Management – Management interfaces for the control of these loggers (turning them on or off) and log levels at runtime in production
- Log Levels – Default log levels in production (INFO’s) and performance implications of logging in production in hot paths
- Format – Outline the format of logged data across the enterprise
- Structured (eg. Zap) vs unstructured logging (eg. Apache logging).
- Uniformity in syntax of log statements across services (date time, thread id, component name in output, initialization, shutdown statements etc.).
- Operational format requirements for devops teams to deal with logs
- Auditing – Functional audit requirements (eg. Login events, Call Detail Records or other related kinds)
- Searching and Reporting – understand what it takes to easily search through and report on logged data
- Alarming – Certain log errors or warnings may require alarming. Consider using an error tracking solution instead for this.
- Sampling – Ability to sample logs (eg. sample 10% of users and enable logging for them)
With the above considerations, it may be wise to investigate a centralized structured log management solution if the budget permits it.
If there are any factors that are missing in the above list please leave a comment or send an email to firstname.lastname@example.org.
What information to log and at what level
Once a log management solution is picked, it is also important to have a framework that lets developers choose what information to log and at what level. This again is often overlooked and leads to random choices of logging data and levels.
Here are some aspects that should be documented as part of the architecture on what gets logged:
- Decide on a common logging format output across services. Preferably use the same configuration for logging across the entire architecture.
- Outline information that MUST NOT be logged. For example, any PII (Personally Identifiable Information) cannot be logged or should be encrypted before being logged.
- Ensure contextual data associated with errors is readily available in the logs. The last thing you want is needing to turn on debug logging for an error that has already happened.
- Take care in defining what information is too verbose to be logged at an INFO level as opposed to a DEBUG level especially when INFO may be the default level.
Trakerr.IO as a structured logging platform
While there are several tools out there that allow you to log data, Trakerr.IO is a great platform to use in production that not only lets you log events in a structured way but also lets you capture other related information like errors and performance data all under a single platform.
Trakerr.IO offers many of the requirements that we outlined earlier for a log management solution by plugging into logging libraries in many popular languages.
What kind of data to log to Trakerr.IO
We recommend getting started logging only critical events in a structured way with Trakerr.IO. Why? Because if you decide to perform debug or trace level logging on your production service for every call and performance is critical, it may end up impacting performance at large scale. Depending on your use-case it may still be possible to log everything to Trakerr.IO.
Each event in Trakerr.IO is identified by an event type and classification that can be supplied through the API / SDK. The event type and classification are completely customizable and can represent anything within the application.
So you may ask, what defines a critical event or something that may be useful to log to Trakerr.IO? Here are a few examples of what may be critical:
- Page view events
- User click events
- Application install events
- Application start / shutdown events
- Database connections or calls
- User login / logoff events
- and any troubleshooting related event
With each of the above, Trakerr.IO’s SDK lets you also log Performance, User and Session data along with the log statement.
For example, you can log that a database call took place along with the operation time in milliseconds for that operation to complete and which database was used to make that call.
This provides a more structured log statement that can be logged in a standardized way across different micro-services.
Trakerr.IO lets you dig into logged events quicker
Once this data is logged to Trakerr.IO, you now have powerful search, segmentation and alarming capabilities.
Searching now is super easy with this data being indexed.
Zoom into a user or session
You can now also view related events for a specific user or session.
Capture OS information, IP, CPU and Memory
Capture additional data along with each event, including OS information, IP, CPU and memory utilization automatically when using the Trakerr.IO SDK.
Track perf data like operational time and compute math
Trakerr lets you look at performance data and compute SUM, AVERAGE, PERCENTILE (25th, 99th percentile etc.) on numeric data such as operation time (time taken for an operation and more).
Segment data based on criteria
Trakerr also lets you Segment the logging data by application version, browser and many more segments.
Trakerr.IO’s SDK is offered in many languages
Trakerr can integrate with many of the popular tools available in the market to make your life simpler.
Find out more, sign up for a free trial
Get started tracking errors or exceptions by signing up for a free trial https://trakerr.io/#/.
We would love to hear back from you on your thoughts on Trakerr, so please write to us at email@example.com