Slack has developed a custom telemetry and optimisation pipeline integrating generative AI, leading to significant cost savings and faster Spark workloads on Amazon EMR, marking a shift towards continuous, autonomous performance tuning.
At Slack, engineers confronted a familiar problem for modern data platforms: as ingestion grew to terabytes per day, existing tooling failed to give the visibility required to keep Apache Spark workloads on Amazon EMR both performant and...
Continue Reading This Article
Enjoy this article as well as all of our content, including reports, news, tips and more.
By registering or signing into your SRM Today account, you agree to SRM Today's Terms of Use and consent to the processing of your personal information as described in our Privacy Policy.
The core of Slack’s approach is a custom Spark listener that emits enriched, context‑aware events for applications, jobs, stages and tasks. Where Spark’s native metrics can be coarse, ephemeral and scattered between UIs and logs, Slack’s listener framework attaches Airflow DAG metadata, YARN and cluster identifiers, user and team context, and detailed failure causes so every event is queryable and durable. Those events are streamed in real time to Kafka, landed into an Apache Iceberg table and then aggregated via a Spark SQL pipeline into a compact, single‑row summary per application ID for downstream analysis. The company says this flow eliminates the “guesswork” of manual tuning and produces environment‑aware recommendations that can be applied automatically or reviewed as pull requests.
Slack organises its diagnostics around what it calls the Five Pillars of Spark Monitoring: driving event metadata (Airflow,YARN), user configuration versus actual usage, execution insights focused on skew/spill/retries, runtime distribution across jobs/stages/tasks and resource health metrics such as peak JVM heap and GC overhead. By comparing intended configurations (for example executor memory, cores per executor, dynamic allocation bounds and shuffle partitions) against observed behaviour, the framework aims to expose over‑ and under‑provisioning and the specific causes of latency or instability. Slack reports 30–50% reductions in compute cost and 40–60% faster job completion times after adopting the system,along with a large drop in developer time spent on tuning.
To accelerate remediation and scale expertise, Slack pairs its telemetry with an AI‑assisted tuning workflow. Metrics and aggregate diagnostics are surfaced through an internal analytics service and exposed via a Model Context Protocol (MCP) server so developers can connect assisted coding tools. Foundation models run inside the organisation’s AWS account using Amazon Bedrock to keep sensitive telemetry within the cloud boundary,while a carefully structured prompt enforces deterministic output: application overview, current config, job health summary, resource recommendations and an action summary formatted to be machine‑readable and reproducible. Slack frames the prompt as a way to limit hallucination and produce traceable, repeatable tuning advice.
Slack’s implementation sits alongside, and echoes, features provided by AWS for Spark troubleshooting. Amazon’s documentation describes automated agents and tooling that analyse workload metadata, metrics and logs to surface bottlenecks and code recommendations for PySpark and Scala applications on EMR,while AWS Glue and SageMaker provide complementary routes to generative troubleshooting and integrated ML model workflows within Spark environments. Industry tools for cluster monitoring and alerting, such as Marbot, also highlight how integrating cluster event signals with collaboration platforms can speed incident response. Taken together, these capabilities point to an ecosystem where custom telemetry plus vendor automation can be combined for faster diagnosis and safer, more cost‑efficient execution.
Slack’s narrative emphasises two operational shifts. First, turning telemetry into a durable, richly tagged history enables root‑cause analysis that ties failures and inefficiencies to teams and pipelines rather than to opaque runtime symptoms. Second, embedding deterministic AI workflows into developer tools converts recommendations into low‑friction changes; Slack describes automatic configuration updates and ready‑to‑review pull requests as part of the feedback loop. The company attributes near‑zero configuration waste and a greater than 90% reduction in person‑hours spent on tuning to the combined effect of the metrics framework and AI tooling.
A pragmatic view of Slack’s claims points to implementation trade‑offs. Building and maintaining a custom listener,Kafka pipeline and analytics stack demands engineering effort and operational overhead; organisations must weigh that investment against the expected cost savings and velocity gains. Slack mitigates some risk by keeping models and data inside its AWS account via Bedrock,but organisations with different compliance profiles may need alternative arrangements. Furthermore, while AWS offers built‑in troubleshooting agents and generative features for services such as Glue,adopting a bespoke stack may still be attractive where vendor tooling lacks the specific linking of orchestration, user ownership and long‑term history that Slack describes.
For teams seeking to replicate Slack’s outcomes, the practical starting points are consistent with Slack’s own advice: instrument at multiple levels (application,job,stage,task),capture orchestration and ownership metadata,stream telemetry into a durable store that supports aggregation and ad hoc queries,and codify the mapping from observed metric patterns to tuning actions. Where possible, integrate automated recommendations into developer workflows so fixes are reviewable and reversible,rather than pushed silently. Vendor features from AWS and third‑party monitoring services can reduce implementation effort,but the value often comes from the added context and the discipline to act on it.
Slack’s account is a useful case study in how organisations combine classical observability with generative AI to compress a decades‑old skill barrier for Spark performance tuning. According to the AWS blog summarising Slack’s work, the result is not merely lower bills and faster jobs but a cultural shift from firefighting to continuous optimisation where teams can focus on analytics value rather than infrastructure troubleshooting. For many enterprises running large Spark fleets on EMR, that is likely to be the more valuable outcome.
Source: Noah Wire Services



