Job Description
Responsibilities
- Data Pipeline Development: Design, develop and maintain highly scalable and optimized ETL pipelines using PySpark on the Cloudera Data Platform ensuring data integrity.
- Ingestion: Implement and manage data ingestion processes from a variety of sources (relational databases, APIs, file systems) to the data lake or data warehouse.
- Transformation and Processing: Use PySpark to process, cleanse and transform large datasets into meaningful formats that support analytical needs and business.
- Optimization: Conduct performance tuning of PySpark code and Cloudera components, optimizing resource utilization and reducing runtime of ETL.
- Quality and Validation: Implement data quality checks, monitoring and validation routines to ensure data accuracy and reliability throughout.
- Orchestration: Automate data workflows using tools like Apache Oozie, Airflow or similar orchestration tools within the Cloudera environment. ...
Apply for this Position
Ready to join Virtusa? Click the button below to submit your application.
Submit Application