Splitting Strings on Tokens and Aggregating on Split Words

As a data engineer or analyst, you frequently encounter files with names that contain separate words separated by an underscore (_). You want to split these strings into individual words and then count the occurrences of each word. In this article, we’ll explore how to achieve this in SQL using a combination of generated numbers tables, substring functions, and derived tables.

Background

When working with files, it’s common to encounter names that include extensions or additional information. For example, the file name HD_VIDEO_720p.mov contains two separate words separated by an underscore: HD and VIDEO. You want to extract these individual words and count their occurrences.

SQL Solution Overview

To solve this problem, we’ll use a few SQL techniques:

Generated Numbers Table: We’ll create a generated numbers table with values that cover the maximum number of words in a filename.
Substring Functions: We’ll use substring functions to split the file name into individual words separated by underscores.
Derived Tables: We’ll create derived tables to manipulate and transform the data.

SQL Solution

Step 1: Create Generated Numbers Table

We need to create a generated numbers table that covers the maximum number of words in a filename. In this case, we assume the maximum number of words is 12 (e.g., MY_NEW_MOVIE.mov).

-- Create generated numbers table with values from 1 to 12
WITH numbers AS (
    SELECT 1 n UNION ALL
    SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL
    SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL
    SELECT 8 UNION ALL SELECT 9 UNION ALL SELECT 10 UNION ALL
    SELECT 11 UNION ALL SELECT 12
)

Step 2: Split File Names into Individual Words

We’ll use the SUBSTRING_INDEX function to split each file name into individual words separated by underscores.

-- Split file names into individual words
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(LEFT(file_name, LENGTH(file_name)-4), '_', numbers.n), '_', -1) AS word
FROM (
    -- Join generated numbers table with asset table
    select 1 n union all
    select 2 union all select 3 union all select 4 union all
    select 5 union all select 6 union all select 7 union all
    select 8 union all select 9 union all select 10 union all
    select 11 union all select 12
) numbers
JOIN asset ON LENGTH(file_name)
              - LENGTH(REPLACE(file_name, '_', '')) >= numbers.n - 1

Step 3: Count Occurrences of Each Word

We’ll create a derived table to count the occurrences of each word. We’ll use a self-join to join the same derived table with itself on the word column.

-- Count occurrences of each word
SELECT w1.word, COUNT(w2.word) AS count
FROM (
    -- Derived table with split words
    SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(LEFT(file_name, LENGTH(file_name)-4), '_', numbers.n), '_', -1) AS word
    FROM (
        select 1 n union all
        select 2 union all select 3 union all select 4 union all
        select 5 union all select 6 union all select 7 union all
        select 8 union all select 9 union all select 10 union all
        select 11 union all select 12
    ) numbers
    JOIN asset ON LENGTH(file_name)
              - LENGTH(REPLACE(file_name, '_', '')) >= numbers.n - 1
) w1
LEFT JOIN (
    SELECT word AS word, COUNT(*) AS count
    FROM (
        -- Derived table with split words
        SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(LEFT(file_name, LENGTH(file_name)-4), '_', numbers.n), '_', -1) AS word
        FROM (
            select 1 n union all
            select 2 union all select 3 union all select 4 union all
            select 5 union all select 6 union all select 7 union all
            select 8 union all select 9 union all select 10 union all
            select 11 union all select 12
        ) numbers
        JOIN asset ON LENGTH(file_name)
              - LENGTH(REPLACE(file_name, '_', '')) >= numbers.n - 1
    ) w2
    GROUP BY word
) w2 ON w1.word = w2.word
GROUP BY w1.word

Step 4: Execute Query and Display Results

We’ll execute the query and display the results.

-- Execute query and display results
SELECT word, COUNT(*) AS count
FROM (
    -- Derived table with split words
    SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(LEFT(file_name, LENGTH(file_name)-4), '_', numbers.n), '_', -1) AS word
    FROM (
        select 1 n union all
        select 2 union all select 3 union all select 4 union all
        select 5 union all select 6 union all select 7 union all
        select 8 union all select 9 union all select 10 union all
        select 11 union all select 12
    ) numbers
    JOIN asset ON LENGTH(file_name)
              - LENGTH(REPLACE(file_name, '_', '')) >= numbers.n - 1
) w
GROUP BY word

Output

The query will return the following output:

word	count
16X9	1
235	1
2398	1
720p	2
8CH	1
DISNEY	1
EN	1
FEATURE	1
FINAL	1
FRSUB	1
HD	2
L9714343	1
LG	1
MOVIE	2
MY	1
NEW	1
TYLERPERRY	1
VIDEO	1

Conclusion

In this article, we explored how to split a string into individual words separated by underscores and count the occurrences of each word using SQL. We used a combination of generated numbers tables, substring functions, and derived tables to achieve this.

By following these steps, you can solve similar problems involving file names with multiple words separated by underscores. Remember to adjust the maximum number of values in the generated numbers table according to your specific use case.

Last modified on 2024-05-15