Job Description
**Introduction**
We’re building Astra Serverless, the next generation of distributed, scalable, fault-tolerant, serverless NoSQL data services — powered by Apache Cassandra and extended with native Vector and AI capabilities across multi-cloud environments.
Our customers depend on our platform to serve real-time, mission-critical workloads on a global scale.
Ensuring reliability, performance, and correctness under unpredictable workloads is a non-trivial challenge — and that’s where you come in.
As an engineer on the Quality Engineering and Performance team, you’ll develop and evolve the system-level testing frameworks that validate a distributed database-as-a-service at massive AI-driven workload scale.
You’ll help ensure that new features, performance improvements, and AI-driven extensions meet the highest standards of scalability and resilience.
Why this role?
You’ll work at the intersection of distributed systems engineering and test architecture — hands on designing and building automation and frameworks that simulate complex multi-cloud deployments, chaos scenarios, and performance stress conditions.
This is not QA-as-usual: you’ll engineer the test systems that validate an elastic database platform capable of scaling thousands of non-uniform nodes, self-healing under failure, and integrating real-time vector search and analytics.
If you thrive on deep technical challenges, curiosity, analytical and systems thinking, and building tools other engineers rely on, this role will feel like home.
**Your role and responsibilities**
* Design and develop frameworks for end-to-end and chaos testing of distributed, serverless Cassandra-based systems.
* Engineer automation that validates data correctness, fault tolerance, and performance under complex multi-region and -cloud topologies.
* Collaborate closely with your peers in local and remote feature development teams to model real-world scenarios and integrate automated validation into the delivery pipeline.
* Continuously evolve the test infrastructure for scale, speed, and observability — leveraging Kubernetes, Docker, and cloud-native toolchains.
* Profile and tune distributed workloads to uncover systemic bottlenecks and verify that service-level goals are consistently met.
* Contribute code to shared testing frameworks and participate in design and code reviews across teams.
* Own the full cycle of quality engineering — from test design and execution to insights and continuous improvement.
**Required technical and professional expertise**
* Exposure to system level Java and Python development in testing for distributed or cloud systems — replication, partitioning, consistency, and eventual convergence.
* Eagerness to learn more about and using chaos testing, fault injection, or resilience validation.
* Knowledge of analyzing complex logs and metrics to isolate performance and reliability issues.
*
Familiarity with Linux, Kubernetes, Docker, and CI/CD pipelines (Jenkins, GitHub Actions, etc.).
**Preferred technical and professional experience**
* Familiarity with NoSQL technologies (Cassandra, DynamoDB, ScyllaDB, etc.) and cloud platforms (AWS, GCP, Azure) and multi-cloud topologies.
* Curiosity-driven mindset, strong communication skills, and a focus on collaboration and craftsmanship.
* Understanding of vector search, AI embeddings, or data-intensive workloads.
IBM is committed to creating a diverse environment and is proud to be an equal-opportunity employer.
All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, gender, gender identity or expression, sexual orientation, national origin, caste, genetics, pregnancy, disability, neurodivergence, age, veteran status, or other characteristics.
IBM is also committed to compliance with all fair employment practices regarding citizenship and immigration status.