Nothing Special   »   [go: up one dir, main page]

Published February 13, 2023 | Version v1
Dataset Open

Passive Operating System Fingerprinting Revisited - Network Flows Dataset

Description

For the evaluation of OS fingerprinting methods, we need a dataset with the following requirements:

  • First, the dataset needs to be big enough to capture the variability of the data. In this case, we need many connections from different operating systems.
  • Second, the dataset needs to be annotated, which means that the corresponding operating system needs to be known for each network connection captured in the dataset. Therefore, we cannot just capture any network traffic for our dataset; we need to be able to determine the OS reliably.

To overcome these issues, we have decided to create the dataset from the traffic of several web servers at our university. This allows us to address the first issue by collecting traces from thousands of devices ranging from user computers and mobile phones to web crawlers and other servers. The ground truth values are obtained from the HTTP User-Agent, which resolves the second of the presented issues. Even though most traffic is encrypted, the User-Agent can be recovered from the web server logs that record every connection’s details. By correlating the IP address and timestamp of each log record to the captured traffic, we can add the ground truth to the dataset.

For this dataset, we have selected a cluster of five web servers that host 475 unique university domains for public websites. The monitoring point recording the traffic was placed at the backbone network connecting the university to the Internet.

The dataset used in this paper was collected from approximately 8 hours of university web traffic throughout a single workday. The logs were collected from Microsoft IIS web servers and converted from W3C extended logging format to JSON. The logs are referred to as web logs and are used to annotate the records generated from packet capture obtained by using a network probe tapped into the link to the Internet.

The entire dataset creation process consists of seven steps:

  1. The packet capture was processed by the Flowmon flow exporter (https://www.flowmon.com) to obtain primary flow data containing information from TLS and HTTP protocols.
  2. Additional statistical features were extracted using GoFlows flow exporter (https://github.com/CN-TU/go-flows).
  3. The primary flows were filtered to remove incomplete records and network scans.
  4. The flows from both exporters were merged together into records containing fields from both sources.
  5. Web logs were filtered to cover the same time frame as the flow records.
  6. Web logs were paired with the flow records based on shared properties (IP address, port, time).
  7. The last step was to convert the User-Agent values into the operating system using a Python version of the open-source tool ua-parser (https://github.com/ua-parser/uap-python). We replaced the unstructured User-Agent string in the records with the resulting OS.

The collected and enriched flows contain 111 data fields that can be used as features for OS fingerprinting or any other data analyses. The fields grouped by their area are listed below:

  • basic flow properties - flow_ID;start;end;L3 PROTO;L4 PROTO;BYTES A;PACKETS A;SRC IP;DST IP;TCP flags A;SRC port;DST port;packetTotalCountforward;packetTotalCountbackward;flowDirection;flowEndReason;
  • IP parameters - IP ToS;maximumTTLforward;maximumTTLbackward;IPv4DontFragmentforward;IPv4DontFragmentbackward;
  • TCP parameters - TCP SYN Size;TCP Win Size;TCP SYN TTL;tcpTimestampFirstPacketbackward;tcpOptionWindowScaleforward;tcpOptionWindowScalebackward;tcpOptionSelectiveAckPermittedforward;tcpOptionSelectiveAckPermittedbackward;tcpOptionMaximumSegmentSizeforward;tcpOptionMaximumSegmentSizebackward;tcpOptionNoOperationforward;tcpOptionNoOperationbackward;synAckFlag;tcpTimestampFirstPacketforward;
  • HTTP - HTTP Request Host;URL;
  • User-agent - UA OS family;UA OS major;UA OS minor;UA OS patch;UA OS patch minor;
  • TLS - TLS_CONTENT_TYPE;TLS_HANDSHAKE_TYPE;TLS_SETUP_TIME;TLS_SERVER_VERSION;TLS_SERVER_RANDOM;TLS_SERVER_SESSION_ID;TLS_CIPHER_SUITE;TLS_ALPN;TLS_SNI;TLS_SNI_LENGTH;TLS_CLIENT_VERSION;TLS_CIPHER_SUITES;TLS_CLIENT_RANDOM;TLS_CLIENT_SESSION_ID;TLS_EXTENSION_TYPES;TLS_EXTENSION_LENGTHS;TLS_ELLIPTIC_CURVES;TLS_EC_POINT_FORMATS;TLS_CLIENT_KEY_LENGTH;TLS_ISSUER_CN;TLS_SUBJECT_CN;TLS_SUBJECT_ON;TLS_VALIDITY_NOT_BEFORE;TLS_VALIDITY_NOT_AFTER;TLS_SIGNATURE_ALG;TLS_PUBLIC_KEY_ALG;TLS_PUBLIC_KEY_LENGTH;TLS_JA3_FINGERPRINT;
  • Packet timings - NPM_CLIENT_NETWORK_TIME;NPM_SERVER_NETWORK_TIME;NPM_SERVER_RESPONSE_TIME;NPM_ROUND_TRIP_TIME;NPM_RESPONSE_TIMEOUTS_A;NPM_RESPONSE_TIMEOUTS_B;NPM_TCP_RETRANSMISSION_A;NPM_TCP_RETRANSMISSION_B;NPM_TCP_OUT_OF_ORDER_A;NPM_TCP_OUT_OF_ORDER_B;NPM_JITTER_DEV_A;NPM_JITTER_AVG_A;NPM_JITTER_MIN_A;NPM_JITTER_MAX_A;NPM_DELAY_DEV_A;NPM_DELAY_AVG_A;NPM_DELAY_MIN_A;NPM_DELAY_MAX_A;NPM_DELAY_HISTOGRAM_1_A;NPM_DELAY_HISTOGRAM_2_A;NPM_DELAY_HISTOGRAM_3_A;NPM_DELAY_HISTOGRAM_4_A;NPM_DELAY_HISTOGRAM_5_A;NPM_DELAY_HISTOGRAM_6_A;NPM_DELAY_HISTOGRAM_7_A;NPM_JITTER_DEV_B;NPM_JITTER_AVG_B;NPM_JITTER_MIN_B;NPM_JITTER_MAX_B;NPM_DELAY_DEV_B;NPM_DELAY_AVG_B;NPM_DELAY_MIN_B;NPM_DELAY_MAX_B;NPM_DELAY_HISTOGRAM_1_B;NPM_DELAY_HISTOGRAM_2_B;NPM_DELAY_HISTOGRAM_3_B;NPM_DELAY_HISTOGRAM_4_B;NPM_DELAY_HISTOGRAM_5_B;NPM_DELAY_HISTOGRAM_6_B;NPM_DELAY_HISTOGRAM_7_B;
  • ICMP - ICMP TYPE;

The details of OS distribution grouped by the OS family are summarized in the table below. The Other OS family contains records generated by web crawling bots that do not include OS information in the User-Agent.

OS Family Number of flows
Other 42474
Windows 40349
Android 10290
iOS 8840
Mac OS X 5324
Linux 1589
Ubuntu 653
Fedora 88
Chrome OS 53
Symbian OS 1
Slackware 1
Linux Mint 1

 

Files

anonymized_flows.zip

Files (22.9 MB)

Name Size Download all
md5:9fdda42ef8c1312b6613ddb7489039e1
22.9 MB Preview Download