OPSWAT MetaDefender Cloud As a Service

About MetaDefender Cloud

MetaDefender Cloud is OPSWAT’s cloud-based, advanced threat prevention and malware analysis platform. Our unique combination of Deep CDR combined with Multiscanning from over 20 of the best AV engines protects organizations from zero-day attacks and increasingly sophisticated malware. MetaDefender Cloud’s sandbox combined with real-time hash, IP and domain analysis using OPSWAT’s world-class threat intelligence database aids malware researchers and provides a deep understanding of existing and potential threats.

MetaDefender Cloud platform currently supports over 5 million scan requests per day from our customers while still providing them with an average scan time of 0.4 seconds.

Why did we develop MetaDefender as a Service (MDaaS)?

To meet the market’s requirements and better support our customers

We wanted to ensure that MetaDefender Cloud could scale to meet changing requirements and the growing need for advanced application security services and the increasing complexity of DevOps Security as more applications move to the cloud. With the increase in file traffic, it was a necessity that MetaDefender Cloud maintains and improves its performance to ensure a smooth user experience for our end customers.

To enhance monitoring and predictive scaling

We decided to migrate our on-premises architecture to cloud-native Kubernetes based on microservices with infrastructure-as-code to be able to provide a seamless and consistent experience over the current deployment and monitoring model.

MetaDefender As a Service Architecture

A diagram of MetaDefender as a service architecture

As we migrated to MDaaS, our Multiscanning services were moved from the Windows-based AMI to a Kubernetes based cluster. Administrators can now configure scalability per engine. Since the performance of engines is different, the slower engine(s) can be scaled up to maintain quick scan times.

The file processing flow is now as follows:

1. A message from an external requestor is sent to a “request” Kafka topic (1) with request instructions, such as scanning a file with AV1, AV2, etc., sanitizing it with Deep CDR, and analyzing the file with Sandbox, etc.

2. After that, a Lambda extractor (2), which subscribes to receive the messages, divides the request into a number of different commands and sends them to another Kafka topic (3), where they are then classified and assigned to the relevant engine(s). (4)

3. The engine processing (4) is the heart of the system. It contains several engine containers, runs on Amazon Elastic Kubernetes Service (EKS) and has the ability to scale in or scale out based on the workload. Each engine handles a specific request that boosts the processing performance.

4. During the process, a S3 bucket (5) is also used to store the input and output files.

5. At the same time, an available log processing module (6) receives the log from the engines and delivers it to a log analysis system.

6. After the file processing, the result derived from each engine is returned to the “results” Kafka topic (7)

7. Subsequently, a microservices aggregator using AWS Lambda (8) consolidates the results into one report and sends it to a Kafka topic (9) getting back to the requestor.

Technical Challenges and Solutions

Predicting engine behavior and handling abnormalities

The traditional MD Core AMI deployment allows the engines to run on a powerful computer where they can share the resources (CPU, RAM, Disk, Network, etc.) with each other. However, with the microservices architecture, each engine operates individually in a less powerful container. Thus, it was difficult for us to define the system’s resource requirements in this case.

To address this issue, we used historical data from the old system to set a baseline for each engine and added Datadog monitoring. We kept monitoring the engines’ behavior and fine-tuning the infrastructure until the product achieved superior performance.

Maintaining a balance between performance and hosting costs

With the new architecture, MetaDefender Cloud can be scaled easily to adapt to our customers’ limitless needs and perform at optimal levels. Nevertheless, it also meant the maintenance cost could surge proportionally. Without spending checks or governance models, scaling could be uncontrollable, leading to increased cloud service bills far beyond the initial allocated budget.

Therefore, frequent architectural reviews with stakeholders were conducted to ensure consistent experience with stable and balanced costs.

Environment simulation

Simulating a production load into a non-production environment without real-live data is a challenge. To handle this, we set up parallel workflows so that real data would pass through both the old and new architectures, allowing us to assess key metrics of both side by side. This apples-to-apples comparison allowed us to quickly and effectively identify areas where the new architecture was superior to the old as well as where the new architecture needed to be improved.

Monitoring, Reporting & Control

Real-time cloud infrastructure monitoring

MetaDefender Cloud puts a strong emphasis on building robust monitoring into its systems to provide a clear view of system health. For a service like MDaaS that can - handle over 44 requests per second (RPS) at a 0.6% error rate, relies on several upstream systems and partners ecosystems as its traffic source and simultaneously produces heavy traffic for different internal and external downstream systems, it is important to have a strong combination of metrics, alerting and logging in place.

A dashboard showing real-time cloud infrastructure monitoring

Alerts on high-abnormal-traffic by environments in Datadog

In addition to the standard system health metrics such as CPU, memory, and performance, we added several “edge-of-the-service” metrics such as growth in queue, response-time of service, status-cake and logging to capture any aberrations from upstream or downstream systems. Furthermore, we added trend analysis for important metrics to help catch longer term degradations. We instrumented MDaaS with a real-time stream processing application called Datadog (you can learn more about it here). It allowed us to track events in real-time over the wire at container specific granularity, making debugging easier. Finally, we found it useful to have service-specific alerting to help identify the root causes of issues faster.

A dashboard showing alerts on high-abnormal-traffic by environments in Datadog

Creating incidents on exceptions that require Site Reliability Engineers' attention in Datadog

SaaS monitoring with the Datadog platform enables teams to onboard more quickly and easily, and eliminates the need for ongoing tool maintenance, capacity scaling, updating or management. These benefits mean more time for teams to work on the core product and not have to create a monitoring solution on their own.

Alert notifications from MetaDefender


• By migrating to MDaaS, the engine microservice is now more flexible to help meet FedRAMP moderate baseline security control requirements.

• Application performance monitoring is now enhanced with real-time alerting and dashboards. The new microservices architecture enables administrators to monitor the application and each component easily and effectively. It also facilitates easy deployment and scalability.

• As far as infrastructure is defined as code, it allows users to easily edit and distribute configurations while ensuring the desired state of the infrastructure. This means you can create reproducible infrastructure configurations.

Learn more about MetaDefender Cloud or contact us for more information.

Sign up for Blog updates
Get information and insight from the leaders in advanced threat prevention.