Article
Unstructured data
What is unstructured data?
Unstructured data is information that is not arranged according to a predefined manner, although it commonly has a native, internal structure (e.g., an image or audio file). Since it does not have a pre-set structure, unstructured data is stored in its native format.
Two common types of unstructured data are text data and multimedia data or rich data. Unstructured data represents the bulk of collected information, and its numbers are growing as digital systems continue to increase the volumes produced.
The value of unstructured data comes from the insights that can be garnered from it using advanced analytics, such as machine learning (ML) and artificial intelligence (AI).
Unstructured data can explain far more than the statistics and numbers associated with structured data.
Unstructured data vs structured data
Examples of unstructured data
Broad categories of unstructured data are rich media (i.e., multimedia) and text files. Examples of unstructured data include:
- Customer feedback
- Emails
- Geospatial data (e.g., maps, elevation models, and population data)
- Images (e.g., JPG, PNG, and TIFF)
- Internet of Things (IoT) data (e.g., sensor data, ticker data, and device data)
- Online reviews (e.g., Google Reviews, Yelp, Consumer Reports)
- Open-ended survey responses
- Satellite imagery
- Server, website, and application logs
- Social media posts (e.g., Facebook, X, Instagram, TikTok)
- Speech, music, and other sound recordings (e.g., (MP3, WAV, and FLAC)
- Surveillance data (e.g., health, security, and behavioral)
- Text files (e.g., doc, pages, RTF, and txt)
- Videos (e.g., (MP4, AVI, and MOV)
- Weather data (e.g., temperature, wind speed, and rainfall)
What is semi-structured data?
Like unstructured data, semi-structured data does not have a pre-set format. However, it has a bit more structure than unstructured data, because it includes internal categories, meta tags, and markings. These are used to separate and differentiate the unstructured data with groups, pairings, and hierarchies.
Another similarity between semi-structured data and unstructured data is that it cannot be organized in relational databases. Examples of semi-structured data and related data formats include the following.
Email is the most commonly cited example of unstructured data. It is organized in categories, such as date, sender, recipient, and subject, but the content of the email body or message is unstructured data. In addition, email messages are stored in folders, such as Inbox, Sent, Trash, Spam, or custom folders.
Web pages
Web pages are organized into hierarchical categories with top-level and sub-navigation (e.g., Company as a top-level and About, Leadership, and Careers as sub-navigation). Web pages use the loose structure of HTML to display unstructured data.
HTML
HTML (Hyper Text Markup Language) is a hierarchical language that is used to display data, such as web pages. The semi-structure characteristics of HTML are that it uses annotations to display unstructured data (e.g., text and images).
Semi-structured documents
CSV, XML, and JSON are the three languages commonly used for semi-structured data.
- CSV (comma-separated values) stores plain text as a series of values separated by commas.
- XML (extensible markup language) stores data as elements, attributes, and text marked with tags.
- JSON (JavaScript object notation) is a text format that stores data as objects made up of key-value pairs.
Social media posts, comprised of unstructured data, are often organized into semi-structured data using CSV, XML, or JSON.
NoSQL databases
NoSQL (not only structured query language or non-SQL) databases are non-relational databases used to store semi-structured and unstructured data. The main types of NoSQL databases are document, key-value, wide-column, and graph.
Electronic data interchange (EDI)
EDI replaces paper business documents, such as purchase orders, inventory information, and invoices with an electronic document transmission system. Standard formats (e.g., NSI, EDIFACT, TRADACOMS, and ebXML) provide a common structure for sharing unstructured data.
Uses for unstructured data
Unstructured data is primarily used for business intelligence (BI) and analytics. Following are examples of how organizations use unstructured data.
Customer service
Unstructured data can be mined to improve digital and human customer service interactions by:
- Helping agents find answers to customers’ questions more quickly
- Improving chatbot-based routing
- Surfacing the most frequently asked questions
Infrastructure and manufacturing
All types of organizations that maintain infrastructure can use unstructured data (e.g., sensor data and system logs) for predictive analytics to optimize operations by:
- Detect equipment failures before they occur
- Identifying areas where maintenance is required
- Increase the efficacy of cybersecurity systems
- Monitor usage and identify patterns
- Prevent system crashes
Product development
Unstructured data analysis provides valuable insights that guide product development, such as:
- Finding ways to improve products or services
- Predicting future product interest
- Identifying market trends
- Monitoring competition
Regulatory compliance
Analysis of unstructured data can facilitate regulatory compliance efforts by supporting:
- Data governance
- Enforcement of data access policies
- Identification of sensitive information
Sales and marketing
Retailers and many other types of organizations analyze unstructured data to:
- Anticipate customers’ needs
- Enable targeted marketing
- Enhance customer satisfaction
- Identify purchase trends
- Improve customers’ experience
- Make better product or service recommendations for new and existing customers
- Determine timing for upsell programs for existing customers
- Understand customers’ sentiments about products, customer service, and brands
Challenges of unstructured data
Difficult data governance
Organizations struggle to enforce data governance rules on unstructured data, such as:
- Access controls
- Encryption requirements
- Privacy rights request responses
- Retention and deletion periods
Difficulty using unstructured data
- Must be transformed into a machine-readable format before processing it
- Requires indexing and schema to be useful
Increased vulnerability to cyber attacks
- Disparate, distributed unstructured data often lacks proper data protections
- Volumes of unstructured data increase the attack surface
Regulatory non-compliance
- Unstructured data often goes unchecked and includes sensitive information
- Unregulated data can lead to numerous legal and compliance risks
Difficulties with scale
- Unable to process unwieldy volumes of unstructured data
- Expensive to store the quantity of unstructured data
- Extensive resources required to maintain the storage and processing systems for massive volumes of unstructured data
Siloed data
- Unstructured data collected and stored in data siloes across multiple destinations (e.g., chats, emails, and audio logs)
- Disparate information stored across multiple systems
Untold value in unstructured data
Unstructured data is arguably one of the greatest business assets available. Leveraging powerful tools and services, the insights that can be gleaned from unstructured data are limitless. Internally generated data, external data, and the combination of the two allow organizations to identify trends and predict future behavior, giving them critical information to make data-driven tactical decisions and strategic plans.
Unleash the power of unified identity security.
Centralized control. Enterprise scale.