Senior Data Architect
Brak informacji o wynagrodzeniu
SeniorFull-time
#330791·Dodano 12 dni temu·15
Źródło: OmiliaTech Stack / Keywords
ArchitectureLLMSnowflakeAWSETLAirflowSecurityAI
Wymagania
Technical / Professional Skills:
- 5+ years in data architecture, data engineering, or LLM/ML data infrastructure with ownership of production data systems for ML/AI model development.
- Strong understanding of ML training data requirements for LLM and NLU model development.
- Deep experience with data modeling, schema design, and data pipeline architecture.
- Strong proficiency with Snowflake, AWS S3, and ETL/ELT orchestration tools (Airflow, dbt, or similar).
- Experience defining annotation requirements and managing data annotation workflows.
- Experience with data cataloging, metadata management, and dataset discovery at scale.
- Strong SQL and Python skills for data pipeline development and data quality analysis.
- Experience with data quality frameworks: deduplication, sampling strategies, diversity optimization.
Desirable:
- Hands-on experience with LLM training data preparation including instruction tuning datasets, preference data, RLHF/DPO annotation, synthetic data generation.
- Experience with data anonymization and PII/PCI redaction in ML data pipelines.
- Familiarity with AWS SageMaker ML pipeline integration and active learning/data selection strategies.
- Knowledge of voice/audio data handling, storage, and processing at scale.
Soft / Behavioural Skills:
- Excellent communication skills to translate ML team data needs and explain data architecture decisions.
- Strong cross-functional collaboration skills.
- Analytical mindset for informed trade-off decisions on data quality, diversity, and scale.
- Self-driven ownership mentality.
Formal Requirements:
- Master's degree or PhD in Computer Science, Data Engineering, Information Systems, or related field.
- Experience with conversational AI data is a strong advantage.
- Experience with data governance for regulated industries is a plus.
- Familiarity with NER/NLU-based data processing approaches is desirable.
Obowiązki
- Own the Training Environment data architecture end-to-end: dataset design and schema for all ML training pipelines, including dialog corpora for LLM training, conversational steps for NLU models, annotated evaluation sets, and whole-call recordings for speech-to-speech model development.
- Define and govern data selection and sampling strategy: establish criteria that determine which production conversations have the highest training value, including diversity-optimized sampling, confidence-based filtering, edge-case prioritization, and deduplication strategies.
- Build and maintain the data catalog and dataset discovery infrastructure: enable ML engineers across LLM, NLU, Speech, and Agentic teams to find, understand, and use training data without friction.
- Define annotation pipeline architecture: establish requirements for data labeling — intent annotation, entity tagging, dialog act classification, task completion scoring, and agentic reasoning evaluation — across internal annotators and external vendors.
- Architect the data flywheel: the closed-loop system where real customer conversations feed back into training data collection, curation, annotation, model retraining, and evaluation.
- Own and maintain data pipelines and infrastructure spanning Snowflake, AWS S3, ETL/ELT pipelines (Airflow), and integration with ML training workflows on AWS SageMaker.
- Work directly with LLM, NLU, and Agentic systems teams to understand training data requirements and translate these into concrete dataset specifications and pipeline configurations.
- Define and maintain the data architecture for Omilia's Training Environment: schema design, data flow patterns from production to centralized training infrastructure, storage strategy (Snowflake + S3), cross-pipeline consistency, and clear auditable data lineage, including anonymization requirements.
- Design data quality frameworks that improve model outcomes: content-based deduplication, diversity-maximizing sampling, confidence-based filtering using NLU scores and behavioral signals, and dedicated NLU improvement corpus extraction.
- Define annotation requirements for ML model development and design annotation workflows; evaluate and manage external data annotation vendors.
- Build and maintain the data catalog that enables cross-team dataset discovery: document dataset contents, schemas, lineage, quality metrics, intended use cases, and known limitations; define taxonomy for organizing training datasets.
- Architect the closed-loop data flywheel: production conversations to model training and back to production; define feedback mechanisms for model failure cases.
- Identify gaps in production training data and define requirements for external data acquisition; design data augmentation strategies.
- Work closely with technical leads and senior engineers to align data architecture with model training requirements; collaborate with Platform Engineering, Security & Compliance, and Product Management.
- Maintain comprehensive documentation of data architecture, dataset specifications, pipeline configurations, and data catalog; produce data architecture RFCs and share best practices with ML teams.
Oferta
- Fixed compensation
- Long-term employment with working days vacation
- Professional growth development (courses, training, etc.)
- Being part of cutting-edge technology products impacting the service industry
- Proficient and fun-to-work-with colleagues
- Apple gear
Płatny urlop
Dofinansowanie szkoleń
Imprezy teamowe
Opieka zdrowotna
Inne informacje
Omilia is an equal opportunity employer committed to diversity and inclusion. All eligible candidates will be given consideration regardless of race, color, religion, gender, gender identity or expression, sexual orientation, national origin, heredity, disability, age, or veteran status.
Omilia
3 aktywne oferty