Data Engineering & Database Development

I design and optimize robust database solutions and data pipelines (ETL) that work with various SQL platforms—whether it's SQL Server, PostgreSQL, or others. Using Python and SQL, I build scalable systems that efficiently extract, transform, and load data, ensuring you have a solid foundation for data-driven decision-making.

Example of ETL implementation:

a. Data Extraction (E)

Sources: Data might come from multiple sources like apps, devices or third-party APIs.

File-based ingestion: Read files (CSV, JSON, Parquet, etc.) from Azure Blob Storage or Azure Data Lake Storage (ADLS).
Streaming ingestion: Use Structured Streaming to capture real-time data from devices or APIs.
Example: Reading a JSON file from ADLS:
df_raw = spark.read.json("abfss://<container>@<storage_account>.dfs.core.windows.net/path/to/data/")

b. Data Transformation (T)

Data Cleaning & Enrichment: Remove duplicates, handle missing values, and transform raw data into a structured format.
Data Modeling: Aggregate, join, or pivot data as needed for downstream analytics.
Using PySpark:
from pyspark.sql.functions import col, to_date, sum as spark_sum
df_transformed = (df_raw.filter(col("status") == "active")
.withColumn("date", to_date(col("timestamp")))
.groupBy("user_id", "date")
.agg(spark_sum("steps").alias("total_steps")))

c. Data Loading (L)

Delta Lake: Write data into Delta tables for ACID transactions and optimized reads.
Data Warehouse: Alternatively, you might load the data into Azure SQL Database or another analytics platform.
Example: Writing data to a Delta table:
df_transformed.write.format("delta").mode("overwrite").save("/delta/gathered_data")
Registering as a Table (if needed for BI tools):
spark.sql("CREATE TABLE IF NOT EXISTS gathered_data USING DELTA LOCATION '/delta/gathered_data'")