Inside Financial Systems Engineering With Anath Bandhu Chatterjee

Table of Contents
    Add a header to begin generating the table of contents

    Inside Financial Systems Engineering With Anath Bandhu Chatterjee

    You’ve spent 15+ years building distributed payment systems and microservices infrastructure across major financial institutions. What does working at that scale, across multiple organizations, teach you about how large enterprises actually manage money movement under the hood?

    The honest answer is that it’s far messier than most people imagine. From the outside, a payment looks instant, you tap a card, money moves, done. But behind that tap, dozens of services are working together across multiple data centers, each performing a specific task: granting or denying permission, checking for fraud, writing a record to the ledger, aligning its data with the rest of the system. All of this happens in under thirty milliseconds. When you’re processing hundreds of transactions per second across hundreds of millions of accounts, even a small inconsistency in how two services agree on the state of a transaction can cascade into real financial exposure.

    What surprised me early in my career was how much of the actual complexity isn’t in moving the money, it’s in making sure everyone agrees the money moved. Reconciliation, settlement, and exception handling form the core of the genuine engineering work. I’ve worked on systems where two ledgers had diverged by a fraction of a cent on each transaction. At the scale we processed, those fractions accumulated into significant discrepancies within a few hours. Distributed consensus isn’t an academic concept in this environment, it’s a daily operational reality.

    The other thing working across multiple institutions teaches you is that every organization carries architectural debt from its own history. One place I worked had a core ledger system originally designed for batch processing in the 1990s, and the entire real-time payment layer we built was essentially an elaborate orchestration system to make that legacy core behave as if it were event-driven. Another organization had the opposite problem, modern infrastructure, but regulatory and compliance layers that introduced latency in ways the architecture wasn’t designed to absorb. You end up spending as much time negotiating with organizational constraints as you do writing code.

    Financial institutions carry decades of legacy infrastructure, and the expectations around uptime, compliance, and transaction integrity leave very little room for error. Where does that tension show up most concretely when you’re trying to modernize payments architecture from the inside?

    It shows up hardest at the integration boundary, the seam where a new event-driven microservices layer has to talk to a core system designed around nightly batch cycles and synchronous request-response patterns. Those older systems were created under expectations about timing, data consistency, and failure recovery that differ from the expectations of the new layer. That seam is where most modernization attempts either demand unusual design choices or silently stop.

    One of the most challenging projects I worked on involved modernizing a funds control system whose architecture lacked support for eventual consistency, every operation was expected to be committed immediately and globally visible. The legacy system upheld this rule by setting strict locks at the database layer, which meant throughput hit a ceiling it couldn’t grow beyond. We couldn’t rip and replace because the system was processing live funds around the clock with no maintenance window. So we built a parallel processing path, essentially a shadow ledger running alongside the original, and migrated traffic incrementally while continuously validating that both systems agreed on every balance, every position, every exposure figure. That dual-write reconciliation period lasted months. A discrepancy between the two systems during live trading hours wasn’t a bug ticket, it was a potential regulatory event.

    Data gravity creates a second tension that teams rarely mention. Legacy systems store decades of transaction records, audit logs, and reference data with dependencies that are poorly documented. I worked on a migration where we believed we had identified every dependency on an old service. During the cutover, we discovered three separate reporting tools querying the service’s database tables directly, bypassing any official interface. When the switch began, each buried link became an independent point of failure. Modernization inside a bank demands more than a technology upgrade, it demands an archaeological dig through forgotten code, understanding not only what the system does but why, because a regulatory or business rule is often embedded in that code and no one recalls it until it stops running.

    Microservices decomposition is widely discussed as the path forward for financial systems, yet the operational reality of breaking apart monolithic payment infrastructure is far messier than the architecture diagrams suggest. After doing this work in production environments, what do engineering teams most consistently underestimate going in?

    Three things, and they’re all related.

    The first is distributed data consistency. In a monolith, transactions that touch multiple business domains, debiting one account, crediting another, updating a fee ledger, posting an audit entry, all happen within a single database transaction. One commit, one rollback, everything stays consistent. The instant you split the system into distinct services, each with its own data store, you replace that simplicity with a coordination problem that’s genuinely hard to implement correctly. I’ve worked on payment systems where we spent more engineering time designing and testing the compensation logic, what happens when step three of a five-step saga fails, than building the happy path. Teams that assume ‘we’ll just use eventual consistency’ without fully mapping every partial failure scenario and its business impact are the ones that end up with money stuck in limbo states at two in the morning.

    The second is observability. In a monolith, when something goes wrong, you have one application log, one stack trace, one database to query. In a payment system spread across many services, one transaction often passes through eight services, three message queues, and two external integrations. I’ve seen teams invest heavily in decomposition and then realize six months in that they have no way to answer basic operational questions like ‘where exactly is this transaction right now.’ Distributed tracing, structured correlation identifiers that cross every service boundary, and centralized log collection need to exist before the first service split, not as a follow-on project.

    The third, and probably the most underestimated, is organizational complexity. Conway’s Law keeps showing up in these conversations because it’s true in ways that are almost uncomfortably precise. I participated in multiple decompositions where the technical architecture worked but the organizational structure lagged behind. One team controlled the authorization service, a separate team controlled the settlement service. When a problem crossed both, a timeout setting that affected both components, neither team held explicit authority to decide the fix. The result was meetings, escalations, and delayed resolution for what should have been a fifteen-minute configuration change. If you draw service boundaries on a whiteboard without rethinking who owns each part and how teams coordinate during incidents, you create an architecture that will fight the organization at every step.

    Distributed systems in payments introduce failure modes that simply don’t exist in monolithic architectures. How do you design for resilience when a transaction touching six services can fail at any one of them, and the cost of being wrong is measured in dollars, not just downtime?

    In a distributed payment system, failure is normal, not exceptional. The question is never ‘will something fail’, it’s ‘when this specific thing fails, does the system lose money, lock money, or handle it gracefully?’

    The first principle I apply is designing for the unhappy path first. For a typical six-service payment flow, I start with a failure mode taxonomy, enumerating every point in the transaction lifecycle where something can go wrong, what the financial exposure is if it does, and what the recovery mechanism needs to be. That usually produces between 15 and 25 distinct failure scenarios, each with different severity and different remediation. Some can be retried safely because the operation is idempotent. Others absolutely cannot be retried without risking a double debit. That taxonomy becomes the actual engineering specification.

    Idempotency is probably the single most important technical discipline in distributed payment resilience, and also the one implemented poorly most often. The concept is simple, if you execute the same operation twice, the result should be identical to executing it once. But achieving true idempotency across a chain of six services, each with its own data store, is genuinely difficult. You need globally unique idempotency keys generated at the entry point and propagated through every downstream call. And you need all of that to perform well under load, because an idempotency check that adds 50 milliseconds per service call across six services has just added 300 milliseconds to your transaction, which at scale can push you past your latency SLAs.

    Circuit breaking is the next layer. I learned this the hard way early in my career when a single downstream fraud-scoring service started responding slowly under load. The calling service’s thread pool filled up waiting for responses, which caused it to stop accepting new requests, which backed up the service calling it, and within about 90 seconds we had a full cascade affecting payment flows that had nothing to do with fraud scoring. After that experience, every service-to-service call gets a circuit breaker with carefully tuned thresholds. When a downstream dependency is unhealthy, the circuit opens and the system falls back to a degraded but financially safe mode.

    No matter how good your idempotency, circuit breakers, and timeout tuning are, in a distributed system at scale, state will occasionally diverge. What you can do is detect it quickly. Every payment system I build distributed includes a continuous reconciliation process that checks service records against each other at fixed intervals. The moment it finds a discrepancy, it emits an alert. The goal isn’t to build a flawless real-time path, it’s to build a real-time path that’s as resilient as possible, backed by a reconciliation layer that catches anything that slips through before it appears as a loss on a balance sheet.

    As agentic AI moves closer to the transaction layer in financial services, what do you think the next few years will require from engineering leaders who want to deploy those capabilities without introducing new categories of systemic risk?

    I spend more time on this question than almost any other, because the sector is at a point where excitement about agentic AI in financial services is racing ahead of the engineering discipline required to deploy it safely.

    I want to be specific about what I mean by agentic AI at the transaction layer. I’m not talking about ML models that score fraud risk or predict transaction volumes, those have been in production for years and the operational patterns around them are reasonably well understood. I’m talking about autonomous agents that can initiate, route, modify, or approve financial transactions with minimal human intervention. Agents that can decide to move money, adjust settlement timing, or escalate exceptions, not just recommend those actions, but execute them. That’s a fundamentally different category of system.

    The key thing engineering leaders need to internalize is that agentic AI doesn’t just introduce new bugs, it introduces judgment failures. Cases where the agent is functioning correctly from a systems perspective but making decisions that are financially inappropriate, contextually wrong, or subtly misaligned with the institution’s risk posture. The system isn’t broken. It’s working exactly as designed. It’s just doing something you didn’t want it to do, and in a financial context, by the time you notice, the consequences may already be in motion.

    The largest hidden danger lies in emergent behavior when multiple agents run simultaneously. Safety discussions tend to focus on a single rogue agent. The real risk emerges when multiple agents, each operating correctly within its own rules, interact and produce an outcome no one planned. One agent optimizing cash positions across accounts, another routing large payments along the fastest path, each following clear rules, but together capable of draining a key settlement account at the worst possible moment. This mirrors a race condition in a distributed computer system, except the loss isn’t a duplicate row in a ledger. It’s a missed payment that can propagate from one counterparty to the next.

    Engineering leaders who want to succeed here need three capabilities most banks currently lack: a simulation environment that reproduces multi-agent interactions under real market conditions and stress events; a graduated autonomy framework where agents begin with narrow authority and earn expanded permissions only through validated performance; and a real-time human oversight mechanism that allows override within seconds without routing every decision through an approval queue that destroys the speed benefit the agent was designed to deliver.

    The broader point is that introducing agentic AI into the transaction layer is not primarily an ML problem. It’s a systems engineering problem, a governance problem, and at its core a risk management challenge that happens to include machine learning as one component. The safety architecture needs to be built before it’s needed, not after an incident forces it. Once agentic systems are woven into transaction infrastructure and generating revenue, the institutional incentives to constrain them diminish. The foundations have to be right from the start.

    Author

    • Peyman Khosravani is a seasoned expert in blockchain, digital transformation, and emerging technologies, with a strong focus on innovation in finance, business, and marketing. With a robust background in blockchain and decentralized finance (DeFi), Peyman has successfully guided global organizations in refining digital strategies and optimizing data-driven decision-making. His work emphasizes leveraging technology for societal impact, focusing on fairness, justice, and transparency. A passionate advocate for the transformative power of digital tools, Peyman’s expertise spans across helping startups and established businesses navigate digital landscapes, drive growth, and stay ahead of industry trends. His insights into analytics and communication empower companies to effectively connect with customers and harness data to fuel their success in an ever-evolving digital world.