Application Monitoring Tools – the trends (part-2)

In the previous post (part-1) we explored the various monitoring categories in an enterprise architecture. We also looked at how error tracking solutions can help quickly identify issues in the architecture.

To re-iterate, these are the categories that encompass monitoring as a whole.

  • Error or Exception Tracking (covered in a previous post)
  • Log Management and Tracking (covered in this post)
  • Application Performance Monitoring
  • Distributed Tracing
  • User Experience Tracking
  • Infrastructure Monitoring

In this post we will look into another related area which is log management.

What is a log management solution?

centralized logging.png

A log management solution is one that enables components of a software architecture to log data, collect the logged data, manage or retain logged data and visualize or report on it as needed.

When designing a service oriented architecture or picking components for an enterprise service bus, one of the more common mistakes is to ignore the architecture for centralized standardized logging.

Some of the key drivers in enterprise architecture to adopt a centralized log management solution tends to be around operations, auditing and security (in that order).  Often this decision of choosing a centralized log management solution happens much later after all other business aspects of the architecture are dealt with.

This can lead to long troubleshooting windows, sub-optimal logging, lack of standardized logging mechanisms or nomenclature for logging data, poor performance, lack of clear audit trails and more.

What to look for in a log management solution

When choosing a log management and tracking solution, here are some key elements that you should consider in the architecture before picking a solution:

  • Volume – Volume of logging, frequency, rotation and sizing
  • Sources – Source of logs (mobile devices vs. on-prem vs cloud)
  • Encryption – requirements for encryption of logging data or data at rest.
  • Reliability – understand what it takes to log reliably
    • Review the reliability requirements of logging systems and related infrastructure
    • Define how and whether logging data needs to be backed up
  • Security – restricting access to logging data to only those that should have access.
  • Libraries – Choice of logging libraries or systems across various languages
  • Management – Management interfaces for the control of these loggers (turning them on or off) and log levels at runtime in production
  • Log Levels – Default log levels in production (INFO’s) and performance implications of logging in production in hot paths
  • Format – Outline the format of logged data across the enterprise
    • Structured (eg. Zap)  vs unstructured logging (eg. Apache logging).
    • Uniformity in syntax of log statements across services (date time, thread id, component name in output, initialization, shutdown statements etc.).
    • Operational format requirements for devops teams to deal with logs
  • Auditing – Functional audit requirements (eg. Login events, Call Detail Records or other related kinds)
  • Searching and Reporting – understand what it takes to easily search through and report on logged data
  • Alarming – Certain log errors or warnings may require alarming. Consider using an error tracking solution instead for this. 
  • Sampling – Ability to sample logs (eg. sample 10% of users and enable logging for them)

With the above considerations, it may be wise to investigate a centralized structured log management solution if the budget permits it.

If there are any factors that are missing in the above list please leave a comment or send an email to blog@trakerr.io.

What information to log and at what level

Once a log management solution is picked, it is also important to have a framework that lets developers choose what information to log and at what level. This again is often overlooked and leads to random choices of logging data and levels.

Here are some aspects that should be documented as part of the architecture on what gets logged:

  • Decide on a common logging format output across services. Preferably use the same configuration for logging across the entire architecture.
  • Outline information that MUST NOT be logged. For example, any PII (Personally Identifiable Information) cannot be logged or should be encrypted before being logged.
  • Ensure contextual data associated with errors is readily available in the logs. The last thing you want is needing to turn on debug logging for an error that has already happened.
  • Take care in defining what information is too verbose to be logged at an INFO level as opposed to a DEBUG level especially when INFO may be the default level.

Trakerr.IO as a structured logging platform

trakerr-carousel-3.gif

While there are several tools out there that allow you to log data, Trakerr.IO is a great platform to use in production that not only lets you log events in a structured way but also lets you capture other related information like errors and performance data all under a single platform.

Trakerr.IO offers many of the requirements that we outlined earlier for a log management solution by plugging into logging libraries in many popular languages.

What kind of data to log to Trakerr.IO

We recommend getting started logging only critical events in a structured way with Trakerr.IO. Why? Because if you decide to perform debug or trace level logging on your production service for every call and performance is critical, it may end up impacting performance at large scale. Depending on your use-case it may still be possible to log everything to Trakerr.IO.

Each event in Trakerr.IO is identified by an event type and classification that can be supplied through the API / SDK. The event type and classification are completely customizable and can represent anything within the application.

So you may ask, what defines a critical event or something that may be useful to log to Trakerr.IO? Here are a few examples of what may be critical:

  • Page view events
  • User click events
  • Application install events
  • Application start / shutdown events
  • Database connections or calls
  • User login / logoff events
  • and any troubleshooting related event

With each of the above, Trakerr.IO’s SDK lets you also log PerformanceUser and Session data along with the log statement.

For example, you can log that a database call took place along with the operation time in milliseconds for that operation to complete and which database was used to make that call.

This provides a more structured log statement that can be logged in a standardized way across different micro-services.

Trakerr.IO lets you dig into logged events quicker

Once this data is logged to Trakerr.IO, you now have powerful search, segmentation and alarming capabilities.

Searching

Searching now is super easy with this data being indexed.

Zoom into a user or session

You can now also view related events for a specific user or session.

Capture OS information, IP, CPU and Memory

Capture additional data along with each event, including OS information, IP, CPU and memory utilization automatically when using the Trakerr.IO SDK.

FlexDataCapture.png

Track perf data like operational time and compute math

Trakerr lets you look at performance data and compute SUM, AVERAGE, PERCENTILE (25th, 99th percentile etc.) on numeric data such as operation time (time taken for an operation and more).

ComputeMetrics.png

Segment data based on criteria

Trakerr also lets you Segment the logging data by application version, browser and many more segments.

 

Trakerr.IO’s SDK is offered in many languages

Trakerr can integrate with many of the popular tools available in the market to make your life simpler.Language Support

 

Find out more, sign up for a free trial

Get started tracking errors or exceptions by signing up for a free trial https://trakerr.io/#/.

We would love to hear back from you on your thoughts on Trakerr, so please write to us at feedback@trakerr.io

Trakerr Overview

Application Monitoring Tools – the trends (part-1)

A trend in the past decade or so has been around the development of Application Monitoring tools that run in production to help developers, devops, operations teams, product leaders and product managers understand how an application or a set of applications are performing.

With the development of agile methodologies and also an increased interest in microservices, the need to monitor applications in real-time in production has become paramount.

Not utilizing such tools can have a direct impact on customer satisfaction, decrease reliability or increase response times to incidents and can result in lost opportunity, lost sales or poor retention.

Some of these tools focus on user experience improvements, some on application performance monitoring (APM) and others on operational issues such as errors

In this post, we will explore the various tools that exist in the market today and what these tools offer and how you can decide what’s best for your business.

What exists in the market today can be broadly divided into the following categories based on what pain point the tools solve:

  • Error or Exception Tracking (covered in this post)
  • Log Management and Tracking (covered in part-2)
  • Application Performance Monitoring
  • Distributed Tracing
  • User Experience Tracking
  • Infrastructure Monitoring

Some tools have overlaps into other areas even though their primary focus is one thing.

We will go over the first in this post and subsequent posts will do a deep-dive into other individual categories.

What is Error or Exception Tracking?

TrakerrErrors.png

Tracking errors or exceptions that are handled or unhandled in applications is critical in maintaining a high standard of software quality. Left unattended, these can start to impact customer experience and eventually sales.

Since logging in production systems are kept to a minimum to not affect performance, the primary fallback is to capture errors or warnings. This is where these tools shine.

It is not only important to detect these prior to rollout to production but to also monitor these in production when issues happen.

In the past, developers have resorted to scrounging log files to detect and track exceptions. These days, with the advent of error or exception tracking tools, developers or operational teams have a lot less to worry about.

Various tools take different approaches to the problem. Some require an SDK to be integrated, others need to have another operational system deployed alongside the application to monitor these.

What does exception tracking or error tracking get me?

  • Centralized management – All errors or exceptions from all servers, devices in a single place. No more scrounging through logs or various tools such as Jira etc. Consolidate errors into a single cloud tool that works for you.
  • Deduplication – The ability to group similar errors together, uniquely so you don’t miss out an error in the log when a lot of these are generated
  • Impact and Prioritization – Understanding how an error or exception impacted customers, users, sessions and servers. Ordering of errors by the counts of users impacted or sessions impacted allows you to better prioritize which errors need to be addressed first.
  • Segmentation – Allows you to segment errors by server, by region, by browser, by user, by session, by geo-location, by any custom segments or parameters and detect when an error peaks within a segment (like region or within a browser).
  • Full Context – The ability to capture full stack or error traces from within the application both handled and unhandled. Most solutions also lets developers send or extract custom properties that provides more information like which user or session caused this error. This way a developer has fewer reasons to have to go through logs.
  • Timeline views – The ability to track when the error happened and with what frequency
  • Alarming – The ability to alarm or notify when an error peaks
  • Integrations – The ability to automatically open a trouble tracking ticket in the tool of your choice when these issues happen
  • Resolution and Regressions – The ability to mark errors as resolved and be notified if the problem were to recur, i.e. regressions
  • Deployment Tracking – The ability to track deployments and to relate errors that happen with a particular deployment

The above are some of the key features that make up an error tracking tool. There are more advanced features such as segmentation, error correlation and a few more that we will be exploring in subsequent posts.

Introducing Trakerr.IO – as an error tracking tool

Trakerr is a new cloud tool that offers the ability for developers and product leaders a means to capture application errors or exceptions as well as some performance metrics side-by-side using Trakerr’s SDK.

ApplicationMetrics.png

Trakerr provides the ability to not just track errors but also some performance metrics such as latencies for calls to databases, CPU, memory usage and more with a little more work.

Integrating Trakerr.IO

Integrating Trakerr.IO is simple in most languages.

Most languages offer a logging tool to log exceptions, Trakerr.IO directly plugs into these logging libraries and will be able to get you started tracking exceptions in a matter of minutes.

Trakerr.IO offers SDK’s in various popular languages to integrate your errors into Trakerr.

Language Support.png

Who does it benefit?

Error or exception capture tools can be used by

  • Developers or devops
    This is obvious, developers get access to full stack traces or error traces along with contextual data like version, deployment, browser, OS, memory or CPU information with errors grouped together for easy bug tracking in production or other environments.
  • QA or quality engineers
    QA engineers track bugs using bug tracking applications such as Jira. When errors happen in applications, developers or QA create a new error in a bug tracking tool such as Jira for the developers to fix. This can become tedious and tools such as Trakerr offers the ability to automate this for you by automatically creating the errors in Jira.
  • Product leaders, development managers
    This is a little less obvious, but development managers and product leaders can get a birds eye view of errors happening in the application when an error or exception occurs. It also helps them prioritize the errors to fix first with the ability to view the number of customers or sessions that were impacted by a specific error.
  • Operation teams
    Operation teams often get good insights into operational issues that happen at various levels. Often this does not capture full information such as stack traces but merely error counts. Having access to full stack traces, the same ones that developers have access to will be incredibly useful to ensuring that all teams are on the same page.
  • Product or program management teams
    Product or program management teams care about the customers experience and with the ability of tools such as Trakerr to track how customers are affected by an error, product or program management teams can now track, improve and communicate progress of error fixes with customers to set their expectations.

Integrations

Trakerr can integrate with many of the popular tools available in the market to make your life simpler.

Integrations.png

Find out more, sign up for a free trial

Get started tracking errors or exceptions by signing up for a free trial https://trakerr.io/#/.

We would love to hear back from you on your thoughts on Trakerr, so please write to us at feedback@trakerr.io

Trakerr Overview.png