Collects metadata and extracts embedded files from XML files.
This scanner parses XML files to collect metadata and extract embedded files based on specified tags.
It is used in forensic and malware analysis to extract and analyze structured data within XML documents.
Scanner Type: Collection
Options
extract_tags (list[str]): Tags whose content is extracted as child files.
metadata_tags (list[str]): Tags whose content is logged as metadata.
classScanXml(strelka.Scanner):""" Collects metadata and extracts embedded files from XML files. This scanner parses XML files to collect metadata and extract embedded files based on specified tags. It is used in forensic and malware analysis to extract and analyze structured data within XML documents. Scanner Type: Collection Attributes: None Options: extract_tags (list[str]): Tags whose content is extracted as child files. metadata_tags (list[str]): Tags whose content is logged as metadata. ## Detection Use Cases !!! info "Detection Use Cases" - **Embedded File Extraction** - Extracts files embedded within specific XML tags. - **Metadata Extraction**: - Collects metadata from specific XML tags. ## Known Limitations !!! warning "Known Limitations" - Complex or malformed XML structures might lead to incomplete parsing or errors. - Excessive files may be scanned / collected if XML mimetypes are set in the `backend.yml` ## To Do !!! question "To Do" - Improve error handling for malformed XML structures. - Better extraction of tags / metadata tags ## References !!! quote "References" - XML File Format Specification (https://www.w3.org/XML/) ## Contributors !!! example "Contributors" - [Josh Liburdi](https://github.com/jshlbrd) - [Paul Hutelmyer](https://github.com/phutelmyer) - [Sara Kalupa](https://github.com/skalupa) """defscan(self,data:bytes,file:strelka.File,options:dict,expire_at:int)->None:""" Parses XML data to extract metadata and files. Args: data: XML data as bytes. file: File object containing metadata about the scan. options: Dictionary of scanner options. expire_at: Time when the scan should be considered expired. Scans the XML file, extracting data and metadata based on the specified tags, and emits files as necessary. If given file is not a XML file, then the scanner will append a flag denoting this and exit """# Prepare options with case-insensitive tag matchingxml_options={"extract_tags":[tag.lower()fortaginoptions.get("extract_tags",[])],"metadata_tags":[tag.lower()fortaginoptions.get("metadata_tags",[])],}# Initialize scan event dataself.event["tags"]=[]self.event["tag_data"]=[]self.event["namespaces"]=[]self.event["total"]={"tags":0,"extracted":0}self.emitted_files:list[str]=[]# Parse the XML contenttry:xml_buffer=dataifxml_buffer.startswith(b"<?XML"):xml_buffer=b"<?xml"+xml_buffer[5:]xml=etree.fromstring(xml_buffer)docinfo=xml.getroottree().docinfoself.event["doc_type"]=docinfo.doctypeifdocinfo.doctypeelse""self.event["version"]=docinfo.xml_versionifdocinfo.xml_versionelse""# Recursively process each node in the XMLself._recurse_node(xml,xml_options)exceptExceptionase:# If file given is not an XML file, do not proceed with ScanXMLif"text/xml"notinfile.flavors.get("mime",[]):self.flags.append(f"{self.__class__.__name__}: xml_file_format_error: File given to ScanXML is not an XML file, "f"scanner did not run.")else:self.flags.append(f"{self.__class__.__name__}: xml_parsing_error: Unable to scan XML file with error: {e}.")return# Finalize the event data for reportingself.event["tags"]=list(set(self.event["tags"]))self.event["total"]["tags"]=len(self.event["tags"])self.event["namespaces"]=list(set(self.event["namespaces"]))self.event["emitted_content"]=list(set(self.emitted_files))# Extract and add Indicators of Compromise (IOCs)self.add_iocs(extract_iocs_from_string(data.decode("utf-8")))def_recurse_node(self,node:etree._Element,xml_options:Dict[str,Any])->None:""" Recursively processes each XML node to extract data and metadata. Args: node: The current XML node to process. xml_options: Options for data extraction and metadata logging. Iterates through XML nodes, extracting data and collecting metadata as specified by the scanner options. """ifnodeisnotNoneandhasattr(node.tag,"__getitem__"):namespace,_,tag=node.tag.partition("}")namespace=namespace[1:]ifnamespace.startswith("{")else""tag=tag.lower()iftag:self.event["tags"].append(tag)ifnamespace:self.event["namespaces"].append(namespace)# Handle specific content extraction and emissioniftaginxml_options["extract_tags"]:content=node.text.strip()ifnode.textelse""ifcontent:self.emit_file(content,name=tag)self.emitted_files.append(content)self.event["total"]["extracted"]+=1# Always process attributes to capture any relevant metadata or data for emissionself._process_attributes(node,xml_options,tag)# Continue to recurse through child nodes to extract dataforchildinnode.getchildren():self._recurse_node(child,xml_options)def_process_attributes(self,node:etree._Element,xml_options:Dict[str,Any],tag:str)->None:""" Processes XML node attributes to extract or log data. Args: node: XML node whose attributes are being processed. xml_options: Configuration options for the scan. tag: The tag of the current XML node being processed. Extracts data from attributes specified in the extract_tags list and logs data from attributes specified in the metadata_tags list. """forattr_name,attr_valueinnode.attrib.items():attr_name_lower=attr_name.lower()ifattr_name_lowerinxml_options["metadata_tags"]:self.event["tag_data"].append({"tag":attr_name,"content":str(node.attrib)})
defscan(self,data:bytes,file:strelka.File,options:dict,expire_at:int)->None:""" Parses XML data to extract metadata and files. Args: data: XML data as bytes. file: File object containing metadata about the scan. options: Dictionary of scanner options. expire_at: Time when the scan should be considered expired. Scans the XML file, extracting data and metadata based on the specified tags, and emits files as necessary. If given file is not a XML file, then the scanner will append a flag denoting this and exit """# Prepare options with case-insensitive tag matchingxml_options={"extract_tags":[tag.lower()fortaginoptions.get("extract_tags",[])],"metadata_tags":[tag.lower()fortaginoptions.get("metadata_tags",[])],}# Initialize scan event dataself.event["tags"]=[]self.event["tag_data"]=[]self.event["namespaces"]=[]self.event["total"]={"tags":0,"extracted":0}self.emitted_files:list[str]=[]# Parse the XML contenttry:xml_buffer=dataifxml_buffer.startswith(b"<?XML"):xml_buffer=b"<?xml"+xml_buffer[5:]xml=etree.fromstring(xml_buffer)docinfo=xml.getroottree().docinfoself.event["doc_type"]=docinfo.doctypeifdocinfo.doctypeelse""self.event["version"]=docinfo.xml_versionifdocinfo.xml_versionelse""# Recursively process each node in the XMLself._recurse_node(xml,xml_options)exceptExceptionase:# If file given is not an XML file, do not proceed with ScanXMLif"text/xml"notinfile.flavors.get("mime",[]):self.flags.append(f"{self.__class__.__name__}: xml_file_format_error: File given to ScanXML is not an XML file, "f"scanner did not run.")else:self.flags.append(f"{self.__class__.__name__}: xml_parsing_error: Unable to scan XML file with error: {e}.")return# Finalize the event data for reportingself.event["tags"]=list(set(self.event["tags"]))self.event["total"]["tags"]=len(self.event["tags"])self.event["namespaces"]=list(set(self.event["namespaces"]))self.event["emitted_content"]=list(set(self.emitted_files))# Extract and add Indicators of Compromise (IOCs)self.add_iocs(extract_iocs_from_string(data.decode("utf-8")))
The features of this scanner are detailed below. These features represent the capabilities and the type of analysis the scanner can perform. This may include support for Indicators of Compromise (IOC), the ability to emit files for further analysis, and the presence of extended documentation for complex analysis techniques.
Strelka's file distribution system assigns scanners to files based on 'flavors' and 'tastes'. Flavors describe the type of file, typically determined by MIME types from libmagic, matches from YARA rules, or characteristics of parent files. Tastes are the criteria used within Strelka to determine which scanners are applied to which files, with positive and negative tastes defining files to be included or excluded respectively.
This section provides a list of fields that are extracted from the files processed by this scanner. These fields include the data elements that the scanner extracts from each file, representing the analytical results produced by the scanner. If the test file is missing or cannot be parsed, this section will not contain any data.
Below is a sample event generated by this scanner, demonstrating the kind of output that can be expected when it processes a file. This sample is derived from a mock scan event configured in the scanner's test file. If no test file is available, this section will not display a sample event.