Article

Unstructured data

Time to read: 6 minutes

What is unstructured data?

Unstructured data is information that is not arranged according to a predefined manner, although it commonly has a native, internal structure (e.g., an image or audio file). Since it does not have a pre-set structure, unstructured data is stored in its native format.

Two common types of unstructured data are text data and multimedia data or rich data. Unstructured data represents the bulk of collected information, and its numbers are growing as digital systems continue to increase the volumes produced.

The value of unstructured data comes from the insights that can be garnered from it using advanced analytics, such as machine learning (ML) and artificial intelligence (AI).

Unstructured data can explain far more than the statistics and numbers associated with structured data.

Unstructured data vs structured data

Examples of unstructured data

Broad categories of unstructured data are rich media (i.e., multimedia) and text files. Examples of unstructured data include:

  1. Customer feedback
  2. Emails
  3. Geospatial data (e.g., maps, elevation models, and population data)
  4. Images (e.g., JPG, PNG, and TIFF)
  5. Internet of Things (IoT) data (e.g., sensor data, ticker data, and device data)
  6. Online reviews (e.g., Google Reviews, Yelp, Consumer Reports)
  7. Open-ended survey responses
  8. Satellite imagery
  9. Server, website, and application logs
  10. Social media posts (e.g., Facebook, X, Instagram, TikTok)
  11. Speech, music, and other sound recordings (e.g., (MP3, WAV, and FLAC)
  12. Surveillance data (e.g., health, security, and behavioral)
  13. Text files (e.g., doc, pages, RTF, and txt)
  14. Videos (e.g., (MP4, AVI, and MOV)
  15. Weather data (e.g., temperature, wind speed, and rainfall)

What is semi-structured data?

Like unstructured data, semi-structured data does not have a pre-set format. However, it has a bit more structure than unstructured data, because it includes internal categories, meta tags, and markings. These are used to separate and differentiate the unstructured data with groups, pairings, and hierarchies.

Another similarity between semi-structured data and unstructured data is that it cannot be organized in relational databases. Examples of semi-structured data and related data formats include the following.

Email

Email is the most commonly cited example of unstructured data. It is organized in categories, such as date, sender, recipient, and subject, but the content of the email body or message is unstructured data. In addition, email messages are stored in folders, such as Inbox, Sent, Trash, Spam, or custom folders.

Web pages

Web pages are organized into hierarchical categories with top-level and sub-navigation (e.g., Company as a top-level and About, Leadership, and Careers as sub-navigation). Web pages use the loose structure of HTML to display unstructured data.

HTML

HTML (Hyper Text Markup Language) is a hierarchical language that is used to display data, such as web pages. The semi-structure characteristics of HTML are that it uses annotations to display unstructured data (e.g., text and images).

Semi-structured documents

CSV, XML, and JSON are the three languages commonly used for semi-structured data.

  1. CSV (comma-separated values) stores plain text as a series of values separated by commas.
  2. XML (extensible markup language) stores data as elements, attributes, and text marked with tags.
  3. JSON (JavaScript object notation) is a text format that stores data as objects made up of key-value pairs.

Social media posts, comprised of unstructured data, are often organized into semi-structured data using CSV, XML, or JSON.

NoSQL databases

NoSQL (not only structured query language or non-SQL) databases are non-relational databases used to store semi-structured and unstructured data. The main types of NoSQL databases are document, key-value, wide-column, and graph.

Electronic data interchange (EDI)

EDI replaces paper business documents, such as purchase orders, inventory information, and invoices with an electronic document transmission system. Standard formats (e.g., NSI, EDIFACT, TRADACOMS, and ebXML) provide a common structure for sharing unstructured data.

Uses for unstructured data

Unstructured data is primarily used for business intelligence (BI) and analytics. Following are examples of how organizations use unstructured data.

Customer service

Unstructured data can be mined to improve digital and human customer service interactions by:

  1. Helping agents find answers to customers’ questions more quickly
  2. Improving chatbot-based routing
  3. Surfacing the most frequently asked questions

Infrastructure and manufacturing

All types of organizations that maintain infrastructure can use unstructured data (e.g., sensor data and system logs) for predictive analytics to optimize operations by:

  1. Detect equipment failures before they occur
  2. Identifying areas where maintenance is required
  3. Increase the efficacy of cybersecurity systems
  4. Monitor usage and identify patterns
  5. Prevent system crashes

Product development

Unstructured data analysis provides valuable insights that guide product development, such as:

  1. Finding ways to improve products or services
  2. Predicting future product interest
  3. Identifying market trends
  4. Monitoring competition

Regulatory compliance

Analysis of unstructured data can facilitate regulatory compliance efforts by supporting:

  1. Data governance
  2. Enforcement of data access policies
  3. Identification of sensitive information

Sales and marketing

Retailers and many other types of organizations analyze unstructured data to:

  1. Anticipate customers’ needs
  2. Enable targeted marketing
  3. Enhance customer satisfaction
  4. Identify purchase trends
  5. Improve customers’ experience
  6. Make better product or service recommendations for new and existing customers
  7. Determine timing for upsell programs for existing customers
  8. Understand customers’ sentiments about products, customer service, and brands

Challenges of unstructured data

Difficult data governance

Organizations struggle to enforce data governance rules on unstructured data, such as:

  1. Access controls
  2. Encryption requirements
  3. Privacy rights request responses
  4. Retention and deletion periods

Difficulty using unstructured data

  1. Must be transformed into a machine-readable format before processing it
  2. Requires indexing and schema to be useful

Increased vulnerability to cyber attacks

  1. Disparate, distributed unstructured data often lacks proper data protections
  2. Volumes of unstructured data increase the attack surface

Regulatory non-compliance

  1. Unstructured data often goes unchecked and includes sensitive information
  2. Unregulated data can lead to numerous legal and compliance risks

Difficulties with scale

  1. Unable to process unwieldy volumes of unstructured data
  2. Expensive to store the quantity of unstructured data
  3. Extensive resources required to maintain the storage and processing systems for massive volumes of unstructured data

Siloed data

  1. Unstructured data collected and stored in data siloes across multiple destinations (e.g., chats, emails, and audio logs)
  2. Disparate information stored across multiple systems

Untold value in unstructured data

Unstructured data is arguably one of the greatest business assets available. Leveraging powerful tools and services, the insights that can be gleaned from unstructured data are limitless. Internally generated data, external data, and the combination of the two allow organizations to identify trends and predict future behavior, giving them critical information to make data-driven tactical decisions and strategic plans.

Unleash the power of unified identity security.

Centralized control. Enterprise scale.

Mark and Sumit

S1 : E2

Identity Matters with Sumit Dhawan, Proofpoint CEO

Join Mark McClain and Sumit Dhawan to understand the future of cybersecurity and how security teams can support CISO customers in the midst of uncertainty.

Play podcast
Mark and Ron

S1 : E1

Identity Matters with Ron Green, cybersecurity fellow at Mastercard

Join Mark McClain and Ron Green to understand the future of cybersecurity and the critical role identity security plays in safeguarding our digital world.

Play podcast
Dynamic Access Roles

Dynamic Access Roles

Build the next generation role and access model with dramatically fewer role and flexibility

View the solution brief