EPAM Systems Interview Question

2. What is MLOps? Answer MLOps is the practice of applying DevOps principles to Machine Learning systems. It covers: Data Management Model Development Model Versioning Deployment Monitoring Retraining Lifecycle Data Collection ↓ Data Validation ↓ Feature Engineering ↓ Model Training ↓ Model Validation ↓ Deployment ↓ Monitoring ↓ Retraining 3. Difference between DevOps and MLOps? DevOps MLOps Focuses on application code Focuses on data + model + code CI/CD CI/CD/CT Version code Version code + data + models Functional testing Model testing Performance monitoring Model drift monitoring 4. What is CI/CD/CT in MLOps? CI Continuous Integration Code Commit ↓ Unit Tests ↓ Build CD Continuous Delivery Build ↓ Deploy CT Continuous Training New Data ↓ Retrain Model ↓ Validate ↓ Deploy 5. How do you version ML models? Tools MLflow DVC S3 Git Example: import mlflow mlflow.sklearn.log_model(model,"customer_churn") Version: v1 v2 v3 6. Explain MLflow Components Tracking Projects Models Registry Example with mlflow.start_run(): mlflow.log_param("lr",0.01) mlflow.log_metric("accuracy",0.95) Interview Follow-up: Why MLflow? Answer: Track experiments, compare runs, register models, and manage deployments. 7. What is Data Drift? Answer Input data distribution changes over time. Example: Training: Age: 20-40 Production: Age: 50-80 Model performance drops. 8. What is Concept Drift? Answer Relationship between features and target changes. Example: Before Covid: Online spending low After Covid: Online spending high Same inputs but different outcomes. 9. How do you detect drift? Methods PSI Population Stability Index KL Divergence Wasserstein Distance KS Test Example: from scipy.stats import ks_2samp ks_2samp(train_data,prod_data) 10. How do you monitor models? Metrics Business Metrics Revenue Conversion CTR Model Metrics Accuracy Precision Recall F1 System Metrics CPU Memory Latency Throughput Tools: Prometheus Grafana ELK 11. Explain Model Retraining Pipeline New Data ↓ Validation ↓ Feature Engineering ↓ Training ↓ Evaluation ↓ Deployment Trigger: Weekly Monthly Drift detection 12. What is Feature Store? Answer Central repository for ML features. Benefits: Reuse features Consistency Online serving Offline training Tools: Feast Tecton 13. Explain Docker in MLOps Dockerfile FROM python:3.11 COPY . /app WORKDIR /app RUN pip install -r requirements.txt CMD ["python","app.py"] Benefits: Portability Reproducibility 14. Difference between Docker and Kubernetes? Docker Kubernetes Containerization Orchestration Single container Multiple containers Packaging Scaling 15. How do you deploy ML models on Kubernetes? Steps Build Docker Image ↓ Push to Registry ↓ Create Deployment ↓ Create Service ↓ Expose API Deployment: apiVersion: apps/v1 kind: Deployment metadata: name: model spec: replicas: 3 16. What is Canary Deployment? Answer Deploy new model to small percentage of users. 90% → Old Model 10% → New Model If successful: 100% New Model 17. Blue-Green Deployment? Answer Blue = Production Green = New Version Switch traffic instantly. Benefits: Zero downtime Easy rollback 18. How would you deploy a model with zero downtime? Answer: Kubernetes Rolling Update Blue-Green Deployment Canary Deployment 19. How do you handle large datasets? Techniques Spark Partitioning Parallel Processing Example: df.repartition(100) 20. What if training data is 1 TB? Answer Never load into memory. Use: Spark Batch Processing Distributed Training 21. What if model training takes 12 hours? Answer Options: Distributed Training GPU Hyperparameter Optimization Incremental Learning 22. Explain Kubernetes HPA Horizontal Pod Autoscaler CPU > 70% Scale: 3 Pods → 10 Pods Example: kubectl autoscale deployment model 23. What happens if a pod crashes? Answer Kubernetes automatically recreates it. Controller: ReplicaSet maintains desired state. 24. How do you secure ML APIs? Methods Authentication JWT OAuth Encryption HTTPS TLS Secrets Kubernetes Secrets AWS Secrets Manager 25. Explain FastAPI deployment from fastapi import FastAPI app = FastAPI() @app.get("/") def predict(): return {"prediction":1} Run: uvicorn app:app 26. What is Model Explainability? Techniques SHAP LIME Feature Importance Example: import shap Shows why prediction happened. 27. Scenario: Accuracy dropped from 95% to 70% Approach Check: Data Drift Concept Drift Data Quality Pipeline Failures Feature Changes Then: Retrain Validate Redeploy 28. Scenario: Prediction API latency increased Investigate CPU Memory Network Database Model Size Optimization: Caching Autoscaling Quantization GPU inference 29. Scenario: Production model gives different results than training Root Causes Feature mismatch Data preprocessing mismatch Version mismatch Missing transformations Solution: Use same pipeline object. 30. Design an End-to-End MLOps Architecture Data Sources ↓ Kafka ↓ Spark ↓ Feature Store ↓ Training Pipeline ↓ MLflow ↓ Model Registry ↓ Docker ↓ Kubernetes ↓ FastAPI ↓ Prometheus/Grafana ↓ Retraining Pipeline Advanced EPAM Follow-up Questions Why use Kubernetes instead of ECS? Multi-cloud support Better ecosystem Advanced autoscaling Service mesh support Why MLflow over DVC? Experiment tracking Model registry Deployment integration How

Interview Answer

Anonymous

Jun 3, 2026

2. What is MLOps? Answer MLOps is the practice of applying DevOps principles to Machine Learning systems. It covers: Data Management Model Development Model Versioning Deployment Monitoring Retraining Lifecycle Data Collection ↓ Data Validation ↓ Feature Engineering ↓ Model Training ↓ Model Validation ↓ Deployment ↓ Monitoring ↓ Retraining 3. Difference between DevOps and MLOps? DevOps MLOps Focuses on application code Focuses on data + model + code CI/CD CI/CD/CT Version code Version code + data + models Functional testing Model testing Performance monitoring Model drift monitoring 4. What is CI/CD/CT in MLOps? CI Continuous Integration Code Commit ↓ Unit Tests ↓ Build CD Continuous Delivery Build ↓ Deploy CT Continuous Training New Data ↓ Retrain Model ↓ Validate ↓ Deploy 5. How do you version ML models? Tools MLflow DVC S3 Git Example: import mlflow mlflow.sklearn.log_model(model,"customer_churn") Version: v1 v2 v3 6. Explain MLflow Components Tracking Projects Models Registry Example with mlflow.start_run(): mlflow.log_param("lr",0.01) mlflow.log_metric("accuracy",0.95) Interview Follow-up: Why MLflow? Answer: Track experiments, compare runs, register models, and manage deployments. 7. What is Data Drift? Answer Input data distribution changes over time. Example: Training: Age: 20-40 Production: Age: 50-80 Model performance drops. 8. What is Concept Drift? Answer Relationship between features and target changes. Example: Before Covid: Online spending low After Covid: Online spending high Same inputs but different outcomes. 9. How do you detect drift? Methods PSI Population Stability Index KL Divergence Wasserstein Distance KS Test Example: from scipy.stats import ks_2samp ks_2samp(train_data,prod_data) 10. How do you monitor models? Metrics Business Metrics Revenue Conversion CTR Model Metrics Accuracy Precision Recall F1 System Metrics CPU Memory Latency Throughput Tools: Prometheus Grafana ELK 11. Explain Model Retraining Pipeline New Data ↓ Validation ↓ Feature Engineering ↓ Training ↓ Evaluation ↓ Deployment Trigger: Weekly Monthly Drift detection 12. What is Feature Store? Answer Central repository for ML features. Benefits: Reuse features Consistency Online serving Offline training Tools: Feast Tecton 13. Explain Docker in MLOps Dockerfile FROM python:3.11 COPY . /app WORKDIR /app RUN pip install -r requirements.txt CMD ["python","app.py"] Benefits: Portability Reproducibility 14. Difference between Docker and Kubernetes? Docker Kubernetes Containerization Orchestration Single container Multiple containers Packaging Scaling 15. How do you deploy ML models on Kubernetes? Steps Build Docker Image ↓ Push to Registry ↓ Create Deployment ↓ Create Service ↓ Expose API Deployment: apiVersion: apps/v1 kind: Deployment metadata: name: model spec: replicas: 3 16. What is Canary Deployment? Answer Deploy new model to small percentage of users. 90% → Old Model 10% → New Model If successful: 100% New Model 17. Blue-Green Deployment? Answer Blue = Production Green = New Version Switch traffic instantly. Benefits: Zero downtime Easy rollback 18. How would you deploy a model with zero downtime? Answer: Kubernetes Rolling Update Blue-Green Deployment Canary Deployment 19. How do you handle large datasets? Techniques Spark Partitioning Parallel Processing Example: df.repartition(100) 20. What if training data is 1 TB? Answer Never load into memory. Use: Spark Batch Processing Distributed Training 21. What if model training takes 12 hours? Answer Options: Distributed Training GPU Hyperparameter Optimization Incremental Learning 22. Explain Kubernetes HPA Horizontal Pod Autoscaler CPU > 70% Scale: 3 Pods → 10 Pods Example: kubectl autoscale deployment model 23. What happens if a pod crashes? Answer Kubernetes automatically recreates it. Controller: ReplicaSet maintains desired state. 24. How do you secure ML APIs? Methods Authentication JWT OAuth Encryption HTTPS TLS Secrets Kubernetes Secrets AWS Secrets Manager 25. Explain FastAPI deployment from fastapi import FastAPI app = FastAPI() @app.get("/") def predict(): return {"prediction":1} Run: uvicorn app:app 26. What is Model Explainability? Techniques SHAP LIME Feature Importance Example: import shap Shows why prediction happened. 27. Scenario: Accuracy dropped from 95% to 70% Approach Check: Data Drift Concept Drift Data Quality Pipeline Failures Feature Changes Then: Retrain Validate Redeploy 28. Scenario: Prediction API latency increased Investigate CPU Memory Network Database Model Size Optimization: Caching Autoscaling Quantization GPU inference 29. Scenario: Production model gives different results than training Root Causes Feature mismatch Data preprocessing mismatch Version mismatch Missing transformations Solution: Use same pipeline object. 30. Design an End-to-End MLOps Architecture Data Sources ↓ Kafka ↓ Spark ↓ Feature Store ↓ Training Pipeline ↓ MLflow ↓ Model Registry ↓ Docker ↓ Kubernetes ↓ FastAPI ↓ Prometheus/Grafana ↓ Retraining Pipeline Advanced EPAM Follow-up Questions Why use Kubernetes instead of ECS? Multi-cloud support Better ecosystem Advanced autoscaling Service mesh support Why MLflow over DVC? Experiment tracking Model registry Deployment integration How