latest news



DZone.com Feed

Failure Handling in AI Pipelines: Designing Retries Without Creating Chaos (Fri, 06 Mar 2026)
Retries have become an integral part of the AI tools or systems. In most systems I have seen, teams usually approach failures with blanket retrying. This often yields duplicate work, cost spikes, wasted compute, and operational instability. Every unnecessary retry triggers another inference call, an embedding request, or a downstream write, without improving the outcome. In most early-stage AI tools, the pattern is that if a request fails, a retry is added. If the retry succeeds intermittently, then the logic is considered sufficient. This approach works fine until the application is in the test environment or in low-user-usage mode; as soon as the application sees higher traffic and concurrent execution, retries begin to dominate system behavior.
>> Read More

Reducing Daily PM Overhead With a Chat-Based AI Agent (Fri, 06 Mar 2026)
As a project manager, I have often encountered time losses caused by daily operational routines. Depending on how many departments are involved in development, these delays can range from two extra days per task to one or even two weeks for a relatively small feature. These delays usually occur in processes not directly related to development itself: clarifying requirements, working in task trackers, searching for information, duplicating work, and constantly switching between tasks. This is also supported by research: around 90% of professionals say they regularly lose time because of inefficient processes and tools, and about half of them lose more than 10 hours every week because of this.
>> Read More

When Million Requests Arrive in a Minute: Why Reactive Auto Scaling Fails and the Predictive Fix (Fri, 06 Mar 2026)
Reactive autoscaling is a critical safety net. Demand rises, metrics spike, policies trigger, and capacity increases. But flash-crowd events, product drops, major campaigns, and limited-inventory moments do not ramp. They cliff. Users arrive at once, and reactive scaling is structurally late because “scale triggered” is only the start of the journey to usable capacity. If your demand spike arrives faster than your system can warm up, reactive scaling will lag no matter how well you tune it. The fix is planning and verification, scaling before the event, and proving the system is ready before customers arrive.
>> Read More

Fabric's Resource Governance and Scaling Pitfalls (Fri, 06 Mar 2026)
Performance and Operational Pitfalls When Scaling BI on Fabric Microsoft chose cost predictability over elasticity for the Fabric billing model. While Fabric’s capacity model simplifies setup, there is a high chance of depleting shared compute resources of a capacity, as well as paying for more resources than necessary.  Common Pitfalls  Fabric capacity scaling is manual, and no auto-scaling is available at present. While this provides absolute control over cost, the whole capacity planning burden is on the admin.
>> Read More


DevOps Cafe Podcast

DevOps Cafe Ep 79 - Guests: Joseph Jacks and Ben Kehoe (Mon, 13 Aug 2018)
Triggered by Google Next 2018, John and Damon chat with Joseph Jacks (stealth startup) and Ben Kehoe (iRobot) about their public disagreements — and agreements — about Kubernetes and Serverless. 
>> Read More

DevOps Cafe Ep 78 - Guest: J. Paul Reed (Mon, 23 Jul 2018)
John and Damon chat with J.Paul Reed (Release Engineering Approaches) about the field of Systems Safety and Human Factors that studies why accidents happen and how to minimize the occurrence and impact. Show notes at http://devopscafe.org
>> Read More

DevOps Cafe Ep. 77 - Damon interviews John (Wed, 20 Jun 2018)
A new season of DevOps Cafe is here. The topic of this episode is "DevSecOps." Damon interviews John about what this term means, why it matters now, and the overall state of security.  Show notes at http://devopscafe.org
>> Read More