article

Unstructured data

What is unstructured data?

Unstructured data is information that is not arranged according to a predefined manner, although it commonly has a native, internal structure (e.g., an image or audio file). Since it does not have a pre-set structure, unstructured data is stored in its native format.

Two common types of unstructured data are text data and multimedia data or rich data. Unstructured data represents the bulk of collected information, and its numbers are growing as digital systems continue to increase the volumes produced.

The value of unstructured data comes from the insights that can be garnered from it using advanced analytics, such as machine learning (ML) and artificial intelligence (AI).

Unstructured data can explain far more than the statistics and numbers associated with structured data.

Unstructured data vs structured data

Unstructured data	Structured data
Unstructured data is not actively managed in a transactional system.	Structured data is stored and managed in database environments, such as a relational database management system (RDBMS).
Unstructured data is not organized in a clearly defined framework or model.	Structured data is stored in frameworks of columns and rows relating to pre-set parameters.
Unstructured data is stored in non-relational (NoSQL) databases and data lakes.	Structured data is stored in databases with rows and columns (SQL-based), such as a data warehouse and RDBMS.
Unstructured data is usually stored in its native format.	Structured data exists in predefined formats.
Unstructured data is qualitative, identifying patterns and trends that explain why something is happening.	Structured data is quantitative, identifying patterns and trends that explain what is happening.
Unstructured data is difficult to analyze, requiring advanced analytics tools, such as machine learning (ML) and natural language processing (NLP).	Structured data is easy to analyze with simple tools, such as spreadsheets.
Unstructured data is highly scalable and can encompass any data type.	Structured data has less scalability than unstructured data and is limited to fixed data types.
Unstructured data supports predictive analytics.	Structured data supports statistical analytics.

Examples of unstructured data

Broad categories of unstructured data are rich media (i.e., multimedia) and text files. Examples of unstructured data include:

Customer feedback
Emails
Geospatial data (e.g., maps, elevation models, and population data)
Images (e.g., JPG, PNG, and TIFF)
Internet of Things (IoT) data (e.g., sensor data, ticker data, and device data)
Online reviews (e.g., Google Reviews, Yelp, Consumer Reports)
Open-ended survey responses
Satellite imagery
Server, website, and application logs
Social media posts (e.g., Facebook, X, Instagram, TikTok)
Speech, music, and other sound recordings (e.g., (MP3, WAV, and FLAC)
Surveillance data (e.g., health, security, and behavioral)
Text files (e.g., doc, pages, RTF, and txt)
Videos (e.g., (MP4, AVI, and MOV)
Weather data (e.g., temperature, wind speed, and rainfall)

What is semi-structured data?

Like unstructured data, semi-structured data does not have a pre-set format. However, it has a bit more structure than unstructured data, because it includes internal categories, meta tags, and markings. These are used to separate and differentiate the unstructured data with groups, pairings, and hierarchies.

Another similarity between semi-structured data and unstructured data is that it cannot be organized in relational databases. Examples of semi-structured data and related data formats include the following.

Email

Email is the most commonly cited example of unstructured data. It is organized in categories, such as date, sender, recipient, and subject, but the content of the email body or message is unstructured data. In addition, email messages are stored in folders, such as Inbox, Sent, Trash, Spam, or custom folders.

Web pages

Web pages are organized into hierarchical categories with top-level and sub-navigation (e.g., Company as a top-level and About, Leadership, and Careers as sub-navigation). Web pages use the loose structure of HTML to display unstructured data.

HTML

HTML (Hyper Text Markup Language) is a hierarchical language that is used to display data, such as web pages. The semi-structure characteristics of HTML are that it uses annotations to display unstructured data (e.g., text and images).

Semi-structured documents

CSV, XML, and JSON are the three languages commonly used for semi-structured data.

CSV (comma-separated values) stores plain text as a series of values separated by commas.
XML (extensible markup language) stores data as elements, attributes, and text marked with tags.
JSON (JavaScript object notation) is a text format that stores data as objects made up of key-value pairs.

Social media posts, comprised of unstructured data, are often organized into semi-structured data using CSV, XML, or JSON.

NoSQL databases

NoSQL (not only structured query language or non-SQL) databases are non-relational databases used to store semi-structured and unstructured data. The main types of NoSQL databases are document, key-value, wide-column, and graph.

Electronic data interchange (EDI)

EDI replaces paper business documents, such as purchase orders, inventory information, and invoices with an electronic document transmission system. Standard formats (e.g., NSI, EDIFACT, TRADACOMS, and ebXML) provide a common structure for sharing unstructured data.

Uses for unstructured data

Unstructured data is primarily used for business intelligence (BI) and analytics. Following are examples of how organizations use unstructured data.

Customer service

Unstructured data can be mined to improve digital and human customer service interactions by:

Helping agents find answers to customers’ questions more quickly
Improving chatbot-based routing
Surfacing the most frequently asked questions

Infrastructure and manufacturing

All types of organizations that maintain infrastructure can use unstructured data (e.g., sensor data and system logs) for predictive analytics to optimize operations by:

Detect equipment failures before they occur
Identifying areas where maintenance is required
Increase the efficacy of cybersecurity systems
Monitor usage and identify patterns
Prevent system crashes

Product development

Unstructured data analysis provides valuable insights that guide product development, such as:

Finding ways to improve products or services
Predicting future product interest
Identifying market trends
Monitoring competition

Regulatory compliance

Analysis of unstructured data can facilitate regulatory compliance efforts by supporting:

Data governance
Enforcement of data access policies
Identification of sensitive information

Sales and marketing

Retailers and many other types of organizations analyze unstructured data to:

Anticipate customers’ needs
Enable targeted marketing
Enhance customer satisfaction
Identify purchase trends
Improve customers’ experience
Make better product or service recommendations for new and existing customers
Determine timing for upsell programs for existing customers
Understand customers’ sentiments about products, customer service, and brands

Challenges of unstructured data

Difficult data governance

Organizations struggle to enforce data governance rules on unstructured data, such as:

Access controls
Encryption requirements
Privacy rights request responses
Retention and deletion periods

Difficulty using unstructured data

Must be transformed into a machine-readable format before processing it
Requires indexing and schema to be useful

Increased vulnerability to cyber attacks

Disparate, distributed unstructured data often lacks proper data protections
Volumes of unstructured data increase the attack surface

Regulatory non-compliance

Unstructured data often goes unchecked and includes sensitive information
Unregulated data can lead to numerous legal and compliance risks

Difficulties with scale

Unable to process unwieldy volumes of unstructured data
Expensive to store the quantity of unstructured data
Extensive resources required to maintain the storage and processing systems for massive volumes of unstructured data

Siloed data

Unstructured data collected and stored in data siloes across multiple destinations (e.g., chats, emails, and audio logs)
Disparate information stored across multiple systems

Untold value in unstructured data

Unstructured data is arguably one of the greatest business assets available. Leveraging powerful tools and services, the insights that can be gleaned from unstructured data are limitless. Internally generated data, external data, and the combination of the two allow organizations to identify trends and predict future behavior, giving them critical information to make data-driven tactical decisions and strategic plans.

Date: December 12, 2023Reading time: 6 minutes

Get started

See what SailPoint identity security can do for your organization

Discover how our solutions enable modern enterprises today to meet the challenge of ensuring secure access to resources without compromising productivity or innovation.

Request a demo Contact us