Data over-engineering is one of the most expensive traps in modern tech. A Moroccan SME processing a few gigabytes of data per day does not need Spark, yet we regularly see full Databricks or Spark-on-Kubernetes stacks deployed for use cases a Python script on Azure Functions, AWS Lambda, or Cloud Run would solve at 1% of the cost. This guide helps make the right call between the two approaches in 2026, starting from the real volumes Moroccan teams actually face.
The simple rule: 100 GB per day
Set the most useful practical rule. Above 100 GB of data processed per day in a distributed fashion, Apache Spark starts becoming the right tool. Below that, serverless functions (Azure Functions, AWS Lambda, Google Cloud Run) cover almost all needs, at radically lower cost and simpler operations.
It is an indicative rule — exceptions exist — but for 80% of Moroccan SMEs, you are in the "small data" zone that does not justify Spark. Here is why, and how to build a data stack matched to your real volumetry.
Why Spark was designed for problems you don't have
Apache Spark was created in 2009 at Berkeley to solve a specific problem: processing terabytes of data in memory across dozens or hundreds of nodes. It was the natural evolution of Hadoop MapReduce, faster thanks to in-memory processing and an optimized execution graph.
The problem: most companies adopting Spark never process terabytes in practice. They process 5 to 50 GB files a few times a day, and spend their time managing Spark's operational complexity — Databricks or EMR clusters to provision, jobs crashing on partition skew, dependency libraries hard to pin, cluster costs far exceeding the actual data volume.
For a typical Moroccan SME with 1 to 10 million transactions per month, here are the volumes you really see:
- Application logs: 1 to 10 GB per day
- E-commerce / transactional events: 100 MB to 5 GB per day
- CRM / marketing data: 10 MB to 1 GB per day
- Accounting / financial exports: 1 to 50 MB per day
- Network log / security data: 5 to 50 GB per day
At those volumes, Spark is the equivalent of using a 40-ton truck to deliver a pizza. You pay for infrastructure you do not use, you manage complexity you do not need to manage, and you lose development agility.
When to use serverless functions
Azure Functions, AWS Lambda, and Google Cloud Run together cover 90% of SME data use cases. Most common pattern: a trigger (file dropped to Blob Storage / S3, HTTP event, queue message) fires a function that processes the data, writes the result to a database or file, and dies. No cluster to maintain, no complicated orchestrator, billing by the millisecond of compute.
Typical case 1: nightly e-commerce ETL.
Every night at 2 AM, your store exports the day's orders to a CSV file. An Azure Function runs, reads the file, maps it to your data warehouse, refreshes the dashboard. Volume: 50,000 rows, 20 MB. Cost: under $0.01 per run.
Typical case 2: real-time event enrichment.
Stripe events arrive on a webhook. An Azure Function enriches them (margin calculation, marketing attribution, basic anti-fraud) and pushes them to your data lake. Volume: 1,000 events per hour. Cost: under $5 per month.
Typical case 3: weekly report generation.
Every Monday morning, a function pulls the week's data, computes KPIs, generates a PDF, emails it to the team. Volume: 500 MB pulled, 200 KPIs computed. Cost: $0.03 per run.
For a typical project, the entire monthly serverless data stack of a Moroccan SME costs $50 to $300 per month. The same need on Databricks Spark would cost $1,500 to $5,000 per month.
When Spark becomes the right tool
Three scenarios legitimize Spark investment.
Scenario 1: Batch processing of very large datasets.
If your business generates more than 100 GB of data per day to process — typically a network log platform, a massive-traffic site (millions of unique daily visitors), or a telecom operator — Spark is the right tool. Distributed parallelization becomes necessary to meet overnight batch SLAs.
Scenario 2: Heavy machine learning pipelines.
If you train ML models on large datasets or run feature engineering on millions of rows, Spark MLlib or Spark + PyTorch on Databricks remains standard. For simpler models (classification, regression on under 1M rows), pandas + scikit-learn in an Azure Function is more than enough.
Scenario 3: Compliance audit and data lineage.
Some sectoral regulations (health, finance, telecom) require a complete audit trail of data transformations. Spark on Databricks provides that audit trail natively through Unity Catalog. On serverless, you must build it by hand — possible, but complex.
Comparison table
| Criterion | Serverless Functions | Apache Spark | |---|---|---| | Ideal daily volume | Under 100 GB | Over 100 GB | | Typical SME monthly cost | $50 to $300 | $1,500 to $10,000 | | Operational complexity | Low | High | | Time to production | 1 to 2 weeks | 4 to 12 weeks | | Skills required | Python or Node.js | Spark + Scala/Python + Kubernetes | | Startup latency | 100 ms to 5 s (cold start) | 30 s to 5 min (cluster startup) | | Real-time streaming fit | Limited (Event Hubs / Kinesis) | Yes (Spark Structured Streaming) | | Heavy ML integration | Limited | Excellent (MLlib) |
Recommended architecture for a Moroccan SME
For 80% of Moroccan SMEs, the ideal 2026 data stack looks like this.
Storage: Azure Blob Storage or AWS S3 as data lake ($50 to $500 per month depending on volume). Managed PostgreSQL (Azure Database for PostgreSQL, AWS RDS) for transactional and business views. BigQuery, Snowflake, or ClickHouse for analytical queries (from $100 per month on small volumes).
Ingestion: Azure Functions or AWS Lambda for triggers. Logic Apps or Step Functions for multi-step orchestrations. Azure Data Factory or AWS Glue for more complex ETL pipelines ($50 to $200 per month).
Transformation: dbt (data build tool) to model your transformations in versioned SQL. It became the 2025-2026 standard for SMEs, simpler than Spark and more powerful than ad-hoc Python scripts.
Visualization: Metabase (self-hosted, free), Looker Studio (free for basic uses), or Superset (self-hosted) depending on your maturity.
This stack covers the essentials of a Moroccan SME's data needs for $200 to $800 per month, and does not require Spark or Kubernetes skills. If your team wants to modernize its data stack without falling into over-engineering, our digital audit often includes this kind of framing. Our custom development team regularly implements these architectures for Moroccan SMEs, avoiding classic traps.
When the crossover becomes inevitable
A growing SME sometimes ends up crossing the 100 GB/day threshold. Here are the signs that it is time to consider Spark.
- Your Azure Functions regularly hit the 10-minute timeout.
- You pay more than $2,000 per month on serverless that runs almost continuously.
- Your data warehouse (BigQuery, Snowflake) bills more than $3,000 per month in compute scan.
- Your data scientists complain they cannot load datasets into memory to iterate.
When 2 or 3 of those signals appear simultaneously, it is time to prototype Databricks or Spark on Kubernetes. Not before.
A note on the migration path
If you currently run a Spark stack that does not need it, the migration to a serverless-plus-warehouse architecture typically takes 8 to 16 weeks for a team with the right skills. The first 4 weeks model the existing pipelines in dbt or pure SQL, the next 4 weeks rebuild the ingestion in Functions or Lambda, and the final phase runs both stacks in parallel for verification before cutting over. Done well, the migration pays back in 6 to 12 months through cluster cost savings alone.
Related Resources
Comparing providers? Check out our detailed comparison:
FAQ
What is the difference between Azure Functions and Azure Databricks?
Azure Functions is a serverless environment to run short code (up to 10 minutes) in response to an event. Azure Databricks is a managed Spark platform to process large data volumes in distributed mode. The two cover different needs: Functions for small/medium data, Databricks for big data.
Can you do machine learning in Azure Functions?
Yes — for already-trained models and inference. To train models, it is more constrained — you are limited by memory (up to 14 GB on Premium plans) and execution duration. To train on datasets over 100 MB, prefer Azure ML, AWS SageMaker, or Databricks.
How much does Apache Spark on Kubernetes cost versus Databricks?
Spark on Kubernetes is often 30% to 50% cheaper than Databricks at equivalent compute power, but requires far more operational expertise. For a Moroccan SME, Databricks remains more cost-efficient once you add the cost of operational maintenance.
Does dbt replace Spark for data transformation?
For most analytical transformations (joins, aggregations, dimensional modeling), yes. dbt runs directly in your data warehouse (BigQuery, Snowflake, Redshift, ClickHouse) and does not need Spark. Spark remains relevant for very heavy transformations or distributed machine learning.
Is serverless reliable for critical production pipelines?
Yes — provided you take care of three things: idempotency (your function can be re-executed without harm), monitoring (centralized logs, error alerting), and dead-letter queue handling (what happens when a function fails 3 times). With those guardrails, Azure Functions and AWS Lambda offer 99.95% SLA reliability — better than many poorly operated Spark stacks.
