Skip to content

ScanXml

Collects metadata and extracts embedded files from XML files.

This scanner parses XML files to collect metadata and extract embedded files based on specified tags. It is used in forensic and malware analysis to extract and analyze structured data within XML documents.

Scanner Type: Collection

Options

extract_tags (list[str]): Tags whose content is extracted as child files. metadata_tags (list[str]): Tags whose content is logged as metadata.

Detection Use Cases

Detection Use Cases

  • Embedded File Extraction
    • Extracts files embedded within specific XML tags.
  • Metadata Extraction:
    • Collects metadata from specific XML tags.

Known Limitations

Known Limitations

  • Complex or malformed XML structures might lead to incomplete parsing or errors.
  • Excessive files may be scanned / collected if XML mimetypes are set in the backend.yml

To Do

To Do

  • Improve error handling for malformed XML structures.
  • Better extraction of tags / metadata tags

References

References

  • XML File Format Specification (https://www.w3.org/XML/)

Contributors

Source code in strelka/src/python/strelka/scanners/scan_xml.py
class ScanXml(strelka.Scanner):
    """
    Collects metadata and extracts embedded files from XML files.

    This scanner parses XML files to collect metadata and extract embedded files based on specified tags.
    It is used in forensic and malware analysis to extract and analyze structured data within XML documents.

    Scanner Type: Collection

    Attributes:
        None

    Options:
        extract_tags (list[str]): Tags whose content is extracted as child files.
        metadata_tags (list[str]): Tags whose content is logged as metadata.

    ## Detection Use Cases
    !!! info "Detection Use Cases"
        - **Embedded File Extraction**
            - Extracts files embedded within specific XML tags.
        - **Metadata Extraction**:
            - Collects metadata from specific XML tags.

    ## Known Limitations
    !!! warning "Known Limitations"
        - Complex or malformed XML structures might lead to incomplete parsing or errors.
        - Excessive files may be scanned / collected if XML mimetypes are set in the `backend.yml`

    ## To Do
    !!! question "To Do"
        - Improve error handling for malformed XML structures.
        - Better extraction of tags / metadata tags

    ## References
    !!! quote "References"
        - XML File Format Specification (https://www.w3.org/XML/)

    ## Contributors
    !!! example "Contributors"
        - [Josh Liburdi](https://github.com/jshlbrd)
        - [Paul Hutelmyer](https://github.com/phutelmyer)
        - [Sara Kalupa](https://github.com/skalupa)
    """

    def scan(
        self, data: bytes, file: strelka.File, options: dict, expire_at: int
    ) -> None:
        """
        Parses XML data to extract metadata and files.

        Args:
            data: XML data as bytes.
            file: File object containing metadata about the scan.
            options: Dictionary of scanner options.
            expire_at: Time when the scan should be considered expired.

        Scans the XML file, extracting data and metadata based on the specified tags,
        and emits files as necessary.

        If given file is not a XML file, then the scanner will append a flag denoting this and exit
        """

        # Prepare options with case-insensitive tag matching
        xml_options = {
            "extract_tags": [tag.lower() for tag in options.get("extract_tags", [])],
            "metadata_tags": [tag.lower() for tag in options.get("metadata_tags", [])],
        }

        # Initialize scan event data
        self.event["tags"] = []
        self.event["tag_data"] = []
        self.event["namespaces"] = []
        self.event["total"] = {"tags": 0, "extracted": 0}
        self.emitted_files: list[str] = []

        # Parse the XML content
        try:
            xml_buffer = data
            if xml_buffer.startswith(b"<?XML"):
                xml_buffer = b"<?xml" + xml_buffer[5:]
            xml = etree.fromstring(xml_buffer)
            docinfo = xml.getroottree().docinfo
            self.event["doc_type"] = docinfo.doctype if docinfo.doctype else ""
            self.event["version"] = docinfo.xml_version if docinfo.xml_version else ""

            # Recursively process each node in the XML
            self._recurse_node(xml, xml_options)

        except Exception as e:
            # If file given is not an XML file, do not proceed with ScanXML
            if "text/xml" not in file.flavors.get("mime", []):
                self.flags.append(
                    f"{self.__class__.__name__}: xml_file_format_error: File given to ScanXML is not an XML file, "
                    f"scanner did not run."
                )
            else:
                self.flags.append(
                    f"{self.__class__.__name__}: xml_parsing_error: Unable to scan XML file with error: {e}."
                )
            return

        # Finalize the event data for reporting
        self.event["tags"] = list(set(self.event["tags"]))
        self.event["total"]["tags"] = len(self.event["tags"])
        self.event["namespaces"] = list(set(self.event["namespaces"]))
        self.event["emitted_content"] = list(set(self.emitted_files))

        # Extract and add Indicators of Compromise (IOCs)
        self.add_iocs(extract_iocs_from_string(data.decode("utf-8")))

    def _recurse_node(self, node: etree._Element, xml_options: Dict[str, Any]) -> None:
        """
        Recursively processes each XML node to extract data and metadata.

        Args:
            node: The current XML node to process.
            xml_options: Options for data extraction and metadata logging.

        Iterates through XML nodes, extracting data and collecting metadata as specified
        by the scanner options.
        """
        if node is not None and hasattr(node.tag, "__getitem__"):
            namespace, _, tag = node.tag.partition("}")
            namespace = namespace[1:] if namespace.startswith("{") else ""
            tag = tag.lower()

            if tag:
                self.event["tags"].append(tag)
            if namespace:
                self.event["namespaces"].append(namespace)

            # Handle specific content extraction and emission
            if tag in xml_options["extract_tags"]:
                content = node.text.strip() if node.text else ""
                if content:
                    self.emit_file(content, name=tag)
                    self.emitted_files.append(content)
                    self.event["total"]["extracted"] += 1

            # Always process attributes to capture any relevant metadata or data for emission
            self._process_attributes(node, xml_options, tag)

            # Continue to recurse through child nodes to extract data
            for child in node.getchildren():
                self._recurse_node(child, xml_options)

    def _process_attributes(
        self, node: etree._Element, xml_options: Dict[str, Any], tag: str
    ) -> None:
        """
        Processes XML node attributes to extract or log data.

        Args:
            node: XML node whose attributes are being processed.
            xml_options: Configuration options for the scan.
            tag: The tag of the current XML node being processed.

        Extracts data from attributes specified in the extract_tags list and logs data
        from attributes specified in the metadata_tags list.
        """
        for attr_name, attr_value in node.attrib.items():
            attr_name_lower = attr_name.lower()
            if attr_name_lower in xml_options["metadata_tags"]:
                self.event["tag_data"].append(
                    {"tag": attr_name, "content": str(node.attrib)}
                )

scan(data, file, options, expire_at)

Parses XML data to extract metadata and files.

Parameters:

Name Type Description Default
data bytes

XML data as bytes.

required
file File

File object containing metadata about the scan.

required
options dict

Dictionary of scanner options.

required
expire_at int

Time when the scan should be considered expired.

required

Scans the XML file, extracting data and metadata based on the specified tags, and emits files as necessary.

If given file is not a XML file, then the scanner will append a flag denoting this and exit

Source code in strelka/src/python/strelka/scanners/scan_xml.py
def scan(
    self, data: bytes, file: strelka.File, options: dict, expire_at: int
) -> None:
    """
    Parses XML data to extract metadata and files.

    Args:
        data: XML data as bytes.
        file: File object containing metadata about the scan.
        options: Dictionary of scanner options.
        expire_at: Time when the scan should be considered expired.

    Scans the XML file, extracting data and metadata based on the specified tags,
    and emits files as necessary.

    If given file is not a XML file, then the scanner will append a flag denoting this and exit
    """

    # Prepare options with case-insensitive tag matching
    xml_options = {
        "extract_tags": [tag.lower() for tag in options.get("extract_tags", [])],
        "metadata_tags": [tag.lower() for tag in options.get("metadata_tags", [])],
    }

    # Initialize scan event data
    self.event["tags"] = []
    self.event["tag_data"] = []
    self.event["namespaces"] = []
    self.event["total"] = {"tags": 0, "extracted": 0}
    self.emitted_files: list[str] = []

    # Parse the XML content
    try:
        xml_buffer = data
        if xml_buffer.startswith(b"<?XML"):
            xml_buffer = b"<?xml" + xml_buffer[5:]
        xml = etree.fromstring(xml_buffer)
        docinfo = xml.getroottree().docinfo
        self.event["doc_type"] = docinfo.doctype if docinfo.doctype else ""
        self.event["version"] = docinfo.xml_version if docinfo.xml_version else ""

        # Recursively process each node in the XML
        self._recurse_node(xml, xml_options)

    except Exception as e:
        # If file given is not an XML file, do not proceed with ScanXML
        if "text/xml" not in file.flavors.get("mime", []):
            self.flags.append(
                f"{self.__class__.__name__}: xml_file_format_error: File given to ScanXML is not an XML file, "
                f"scanner did not run."
            )
        else:
            self.flags.append(
                f"{self.__class__.__name__}: xml_parsing_error: Unable to scan XML file with error: {e}."
            )
        return

    # Finalize the event data for reporting
    self.event["tags"] = list(set(self.event["tags"]))
    self.event["total"]["tags"] = len(self.event["tags"])
    self.event["namespaces"] = list(set(self.event["namespaces"]))
    self.event["emitted_content"] = list(set(self.emitted_files))

    # Extract and add Indicators of Compromise (IOCs)
    self.add_iocs(extract_iocs_from_string(data.decode("utf-8")))

Features

The features of this scanner are detailed below. These features represent the capabilities and the type of analysis the scanner can perform. This may include support for Indicators of Compromise (IOC), the ability to emit files for further analysis, and the presence of extended documentation for complex analysis techniques.

Feature
Support
IOC Support
Emit Files
Extended Docs
Malware Scanner
Image Thumbnails

Tastes

Strelka's file distribution system assigns scanners to files based on 'flavors' and 'tastes'. Flavors describe the type of file, typically determined by MIME types from libmagic, matches from YARA rules, or characteristics of parent files. Tastes are the criteria used within Strelka to determine which scanners are applied to which files, with positive and negative tastes defining files to be included or excluded respectively.

Source Filetype
Include / Exclude
application/xml
mso_file
soap_file
text/xml
xml_file

Scanner Fields

This section provides a list of fields that are extracted from the files processed by this scanner. These fields include the data elements that the scanner extracts from each file, representing the analytical results produced by the scanner. If the test file is missing or cannot be parsed, this section will not contain any data.

Field Name
Field Type
doc_type
str
elapsed
str
emitted_content
str
emitted_content
list
flags
list
iocs
str
namespaces
str
tag_data
str
tags
str
total
dict
total.extracted
int
total.tags
int
version
str

Sample Event

Below is a sample event generated by this scanner, demonstrating the kind of output that can be expected when it processes a file. This sample is derived from a mock scan event configured in the scanner's test file. If no test file is available, this section will not display a sample event.

    test_scan_event = {
        "elapsed": 0.001,
        "flags": [],
        "tags": unordered(["book", "author", "price", "year", "title"]),
        "tag_data": unordered(
            [
                {"tag": "category", "content": "{'category': 'science'}"},
                {"tag": "category", "content": "{'category': 'science'}"},
            ]
        ),
        "namespaces": unordered(["http://example.com/books"]),
        "total": {"tags": 5, "extracted": 0},
        "doc_type": '<!DOCTYPE bookstore SYSTEM "bookstore.dtd">',
        "version": "1.0",
        "emitted_content": [],
        "iocs": unordered(
            [
                {"ioc": "example.com", "ioc_type": "domain", "scanner": "ScanXml"},
                {"ioc": "www.w3.org", "ioc_type": "domain", "scanner": "ScanXml"},
            ]
        ),
    }