Goal
Design and implement a robust, concurrent data processing pipeline that ingests data from multiple sources, validates and transforms records according to business rules, computes aggregations, and exports results — all while supporting observability, error collection, cancellation, and automated tests.
Scope & Deliverables
Deliverables
- Working pipeline implementation (Go)
- REST API for managing pipeline jobs (see endpoints below)
- Unit and integration tests
- README with architecture diagram, run instructions, and example requests
- Sample input files (CSV, JSON) and example API responses
- A short report (≤1 page) summarizing design choices, concurrency model, and trade-offs
Functional Requirements
Core Features
- Data Ingestion
- Support reading from multiple input sources: CSV files, JSON files, and external HTTP APIs.
- Allow configurable source definitions per job.
- Data Validation
- Validate records against schema and business rules.
- Filter or mark invalid records while continuing processing.
- Data Transformation
- Apply business rules and transformations (type conversions, normalization, enrichment).
- Data Aggregation
- Compute required statistics (e.g., counts, sums, averages, group-by aggregations).
- Data Export
- Persist processed data and aggregations to a database and/or files (CSV/JSON).
- Support configurable export targets per job.
- Progress Tracking
- Track and expose job progress (records processed, records pending, percent complete).