ScanXml
Collects metadata and extracts embedded files from XML files.
This scanner parses XML files to collect metadata and extract embedded files based on specified tags.
It is used in forensic and malware analysis to extract and analyze structured data within XML documents.
Scanner Type: Collection
Options
extract_tags (list[str]): Tags whose content is extracted as child files.
metadata_tags (list[str]): Tags whose content is logged as metadata.
Detection Use Cases
Detection Use Cases
- Embedded File Extraction
- Extracts files embedded within specific XML tags.
- Metadata Extraction:
- Collects metadata from specific XML tags.
Known Limitations
Known Limitations
- Complex or malformed XML structures might lead to incomplete parsing or errors.
- Excessive files may be scanned / collected if XML mimetypes are set in the
backend.yml
To Do
To Do
- Improve error handling for malformed XML structures.
- Better extraction of tags / metadata tags
References
References
- XML File Format Specification (https://www.w3.org/XML/)
Contributors
Source code in strelka/src/python/strelka/scanners/scan_xml.py
| class ScanXml(strelka.Scanner):
"""
Collects metadata and extracts embedded files from XML files.
This scanner parses XML files to collect metadata and extract embedded files based on specified tags.
It is used in forensic and malware analysis to extract and analyze structured data within XML documents.
Scanner Type: Collection
Attributes:
None
Options:
extract_tags (list[str]): Tags whose content is extracted as child files.
metadata_tags (list[str]): Tags whose content is logged as metadata.
## Detection Use Cases
!!! info "Detection Use Cases"
- **Embedded File Extraction**
- Extracts files embedded within specific XML tags.
- **Metadata Extraction**:
- Collects metadata from specific XML tags.
## Known Limitations
!!! warning "Known Limitations"
- Complex or malformed XML structures might lead to incomplete parsing or errors.
- Excessive files may be scanned / collected if XML mimetypes are set in the `backend.yml`
## To Do
!!! question "To Do"
- Improve error handling for malformed XML structures.
- Better extraction of tags / metadata tags
## References
!!! quote "References"
- XML File Format Specification (https://www.w3.org/XML/)
## Contributors
!!! example "Contributors"
- [Josh Liburdi](https://github.com/jshlbrd)
- [Paul Hutelmyer](https://github.com/phutelmyer)
- [Sara Kalupa](https://github.com/skalupa)
"""
def scan(
self, data: bytes, file: strelka.File, options: dict, expire_at: int
) -> None:
"""
Parses XML data to extract metadata and files.
Args:
data: XML data as bytes.
file: File object containing metadata about the scan.
options: Dictionary of scanner options.
expire_at: Time when the scan should be considered expired.
Scans the XML file, extracting data and metadata based on the specified tags,
and emits files as necessary.
If given file is not a XML file, then the scanner will append a flag denoting this and exit
"""
# Prepare options with case-insensitive tag matching
xml_options = {
"extract_tags": [tag.lower() for tag in options.get("extract_tags", [])],
"metadata_tags": [tag.lower() for tag in options.get("metadata_tags", [])],
}
# Initialize scan event data
self.event["tags"] = []
self.event["tag_data"] = []
self.event["namespaces"] = []
self.event["total"] = {"tags": 0, "extracted": 0}
self.emitted_files: list[str] = []
# Parse the XML content
try:
xml_buffer = data
if xml_buffer.startswith(b"<?XML"):
xml_buffer = b"<?xml" + xml_buffer[5:]
xml = etree.fromstring(xml_buffer)
docinfo = xml.getroottree().docinfo
self.event["doc_type"] = docinfo.doctype if docinfo.doctype else ""
self.event["version"] = docinfo.xml_version if docinfo.xml_version else ""
# Recursively process each node in the XML
self._recurse_node(xml, xml_options)
except Exception as e:
# If file given is not an XML file, do not proceed with ScanXML
if "text/xml" not in file.flavors.get("mime", []):
self.flags.append(
f"{self.__class__.__name__}: xml_file_format_error: File given to ScanXML is not an XML file, "
f"scanner did not run."
)
else:
self.flags.append(
f"{self.__class__.__name__}: xml_parsing_error: Unable to scan XML file with error: {e}."
)
return
# Finalize the event data for reporting
self.event["tags"] = list(set(self.event["tags"]))
self.event["total"]["tags"] = len(self.event["tags"])
self.event["namespaces"] = list(set(self.event["namespaces"]))
self.event["emitted_content"] = list(set(self.emitted_files))
# Extract and add Indicators of Compromise (IOCs)
self.add_iocs(extract_iocs_from_string(data.decode("utf-8")))
def _recurse_node(self, node: etree._Element, xml_options: Dict[str, Any]) -> None:
"""
Recursively processes each XML node to extract data and metadata.
Args:
node: The current XML node to process.
xml_options: Options for data extraction and metadata logging.
Iterates through XML nodes, extracting data and collecting metadata as specified
by the scanner options.
"""
if node is not None and hasattr(node.tag, "__getitem__"):
namespace, _, tag = node.tag.partition("}")
namespace = namespace[1:] if namespace.startswith("{") else ""
tag = tag.lower()
if tag:
self.event["tags"].append(tag)
if namespace:
self.event["namespaces"].append(namespace)
# Handle specific content extraction and emission
if tag in xml_options["extract_tags"]:
content = node.text.strip() if node.text else ""
if content:
self.emit_file(content, name=tag)
self.emitted_files.append(content)
self.event["total"]["extracted"] += 1
# Always process attributes to capture any relevant metadata or data for emission
self._process_attributes(node, xml_options, tag)
# Continue to recurse through child nodes to extract data
for child in node.getchildren():
self._recurse_node(child, xml_options)
def _process_attributes(
self, node: etree._Element, xml_options: Dict[str, Any], tag: str
) -> None:
"""
Processes XML node attributes to extract or log data.
Args:
node: XML node whose attributes are being processed.
xml_options: Configuration options for the scan.
tag: The tag of the current XML node being processed.
Extracts data from attributes specified in the extract_tags list and logs data
from attributes specified in the metadata_tags list.
"""
for attr_name, attr_value in node.attrib.items():
attr_name_lower = attr_name.lower()
if attr_name_lower in xml_options["metadata_tags"]:
self.event["tag_data"].append(
{"tag": attr_name, "content": str(node.attrib)}
)
|
scan(data, file, options, expire_at)
Parses XML data to extract metadata and files.
Parameters:
Name |
Type |
Description |
Default |
data |
bytes
|
|
required
|
file |
File
|
File object containing metadata about the scan.
|
required
|
options |
dict
|
Dictionary of scanner options.
|
required
|
expire_at |
int
|
Time when the scan should be considered expired.
|
required
|
Scans the XML file, extracting data and metadata based on the specified tags,
and emits files as necessary.
If given file is not a XML file, then the scanner will append a flag denoting this and exit
Source code in strelka/src/python/strelka/scanners/scan_xml.py
| def scan(
self, data: bytes, file: strelka.File, options: dict, expire_at: int
) -> None:
"""
Parses XML data to extract metadata and files.
Args:
data: XML data as bytes.
file: File object containing metadata about the scan.
options: Dictionary of scanner options.
expire_at: Time when the scan should be considered expired.
Scans the XML file, extracting data and metadata based on the specified tags,
and emits files as necessary.
If given file is not a XML file, then the scanner will append a flag denoting this and exit
"""
# Prepare options with case-insensitive tag matching
xml_options = {
"extract_tags": [tag.lower() for tag in options.get("extract_tags", [])],
"metadata_tags": [tag.lower() for tag in options.get("metadata_tags", [])],
}
# Initialize scan event data
self.event["tags"] = []
self.event["tag_data"] = []
self.event["namespaces"] = []
self.event["total"] = {"tags": 0, "extracted": 0}
self.emitted_files: list[str] = []
# Parse the XML content
try:
xml_buffer = data
if xml_buffer.startswith(b"<?XML"):
xml_buffer = b"<?xml" + xml_buffer[5:]
xml = etree.fromstring(xml_buffer)
docinfo = xml.getroottree().docinfo
self.event["doc_type"] = docinfo.doctype if docinfo.doctype else ""
self.event["version"] = docinfo.xml_version if docinfo.xml_version else ""
# Recursively process each node in the XML
self._recurse_node(xml, xml_options)
except Exception as e:
# If file given is not an XML file, do not proceed with ScanXML
if "text/xml" not in file.flavors.get("mime", []):
self.flags.append(
f"{self.__class__.__name__}: xml_file_format_error: File given to ScanXML is not an XML file, "
f"scanner did not run."
)
else:
self.flags.append(
f"{self.__class__.__name__}: xml_parsing_error: Unable to scan XML file with error: {e}."
)
return
# Finalize the event data for reporting
self.event["tags"] = list(set(self.event["tags"]))
self.event["total"]["tags"] = len(self.event["tags"])
self.event["namespaces"] = list(set(self.event["namespaces"]))
self.event["emitted_content"] = list(set(self.emitted_files))
# Extract and add Indicators of Compromise (IOCs)
self.add_iocs(extract_iocs_from_string(data.decode("utf-8")))
|
Features
The features of this scanner are detailed below. These features represent the capabilities and the type of analysis the scanner can perform. This may include support for Indicators of Compromise (IOC), the ability to emit files for further analysis, and the presence of extended documentation for complex analysis techniques.
Feature |
Support |
IOC Support |
|
Emit Files |
|
Extended Docs |
|
Malware Scanner |
|
Image Thumbnails |
|
Tastes
Strelka's file distribution system assigns scanners to files based on 'flavors' and 'tastes'. Flavors describe the type of file, typically determined by MIME types from libmagic, matches from YARA rules, or characteristics of parent files. Tastes are the criteria used within Strelka to determine which scanners are applied to which files, with positive and negative tastes defining files to be included or excluded respectively.
Source Filetype |
Include / Exclude |
application/xml |
|
mso_file |
|
soap_file |
|
text/xml |
|
xml_file |
|
Scanner Fields
This section provides a list of fields that are extracted from the files processed by this scanner. These fields include the data elements that the scanner extracts from each file, representing the analytical results produced by the scanner. If the test file is missing or cannot be parsed, this section will not contain any data.
Field Name |
Field Type |
doc_type |
str
|
elapsed |
str
|
emitted_content |
str
|
emitted_content |
list
|
flags |
list
|
iocs |
str
|
namespaces |
str
|
tag_data |
str
|
tags |
str
|
total |
dict
|
total.extracted |
int
|
total.tags |
int
|
version |
str
|
Sample Event
Below is a sample event generated by this scanner, demonstrating the kind of output that can be expected when it processes a file. This sample is derived from a mock scan event configured in the scanner's test file. If no test file is available, this section will not display a sample event.
test_scan_event = {
"elapsed": 0.001,
"flags": [],
"tags": unordered(["book", "author", "price", "year", "title"]),
"tag_data": unordered(
[
{"tag": "category", "content": "{'category': 'science'}"},
{"tag": "category", "content": "{'category': 'science'}"},
]
),
"namespaces": unordered(["http://example.com/books"]),
"total": {"tags": 5, "extracted": 0},
"doc_type": '<!DOCTYPE bookstore SYSTEM "bookstore.dtd">',
"version": "1.0",
"emitted_content": [],
"iocs": unordered(
[
{"ioc": "example.com", "ioc_type": "domain", "scanner": "ScanXml"},
{"ioc": "www.w3.org", "ioc_type": "domain", "scanner": "ScanXml"},
]
),
}