Splitting Strings on Tokens and Aggregating on Split Words
As a data engineer or analyst, you frequently encounter files with names that contain separate words separated by an underscore (_). You want to split these strings into individual words and then count the occurrences of each word. In this article, we’ll explore how to achieve this in SQL using a combination of generated numbers tables, substring functions, and derived tables.
Background
When working with files, it’s common to encounter names that include extensions or additional information. For example, the file name HD_VIDEO_720p.mov contains two separate words separated by an underscore: HD and VIDEO. You want to extract these individual words and count their occurrences.
SQL Solution Overview
To solve this problem, we’ll use a few SQL techniques:
- Generated Numbers Table: We’ll create a generated numbers table with values that cover the maximum number of words in a filename.
- Substring Functions: We’ll use substring functions to split the file name into individual words separated by underscores.
- Derived Tables: We’ll create derived tables to manipulate and transform the data.
SQL Solution
Step 1: Create Generated Numbers Table
We need to create a generated numbers table that covers the maximum number of words in a filename. In this case, we assume the maximum number of words is 12 (e.g., MY_NEW_MOVIE.mov).
-- Create generated numbers table with values from 1 to 12
WITH numbers AS (
SELECT 1 n UNION ALL
SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL
SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL
SELECT 8 UNION ALL SELECT 9 UNION ALL SELECT 10 UNION ALL
SELECT 11 UNION ALL SELECT 12
)
Step 2: Split File Names into Individual Words
We’ll use the SUBSTRING_INDEX function to split each file name into individual words separated by underscores.
-- Split file names into individual words
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(LEFT(file_name, LENGTH(file_name)-4), '_', numbers.n), '_', -1) AS word
FROM (
-- Join generated numbers table with asset table
select 1 n union all
select 2 union all select 3 union all select 4 union all
select 5 union all select 6 union all select 7 union all
select 8 union all select 9 union all select 10 union all
select 11 union all select 12
) numbers
JOIN asset ON LENGTH(file_name)
- LENGTH(REPLACE(file_name, '_', '')) >= numbers.n - 1
Step 3: Count Occurrences of Each Word
We’ll create a derived table to count the occurrences of each word. We’ll use a self-join to join the same derived table with itself on the word column.
-- Count occurrences of each word
SELECT w1.word, COUNT(w2.word) AS count
FROM (
-- Derived table with split words
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(LEFT(file_name, LENGTH(file_name)-4), '_', numbers.n), '_', -1) AS word
FROM (
select 1 n union all
select 2 union all select 3 union all select 4 union all
select 5 union all select 6 union all select 7 union all
select 8 union all select 9 union all select 10 union all
select 11 union all select 12
) numbers
JOIN asset ON LENGTH(file_name)
- LENGTH(REPLACE(file_name, '_', '')) >= numbers.n - 1
) w1
LEFT JOIN (
SELECT word AS word, COUNT(*) AS count
FROM (
-- Derived table with split words
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(LEFT(file_name, LENGTH(file_name)-4), '_', numbers.n), '_', -1) AS word
FROM (
select 1 n union all
select 2 union all select 3 union all select 4 union all
select 5 union all select 6 union all select 7 union all
select 8 union all select 9 union all select 10 union all
select 11 union all select 12
) numbers
JOIN asset ON LENGTH(file_name)
- LENGTH(REPLACE(file_name, '_', '')) >= numbers.n - 1
) w2
GROUP BY word
) w2 ON w1.word = w2.word
GROUP BY w1.word
Step 4: Execute Query and Display Results
We’ll execute the query and display the results.
-- Execute query and display results
SELECT word, COUNT(*) AS count
FROM (
-- Derived table with split words
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(LEFT(file_name, LENGTH(file_name)-4), '_', numbers.n), '_', -1) AS word
FROM (
select 1 n union all
select 2 union all select 3 union all select 4 union all
select 5 union all select 6 union all select 7 union all
select 8 union all select 9 union all select 10 union all
select 11 union all select 12
) numbers
JOIN asset ON LENGTH(file_name)
- LENGTH(REPLACE(file_name, '_', '')) >= numbers.n - 1
) w
GROUP BY word
Output
The query will return the following output:
| word | count |
|---|---|
| 16X9 | 1 |
| 235 | 1 |
| 2398 | 1 |
| 720p | 2 |
| 8CH | 1 |
| DISNEY | 1 |
| EN | 1 |
| FEATURE | 1 |
| FINAL | 1 |
| FRSUB | 1 |
| HD | 2 |
| L9714343 | 1 |
| LG | 1 |
| MOVIE | 2 |
| MY | 1 |
| NEW | 1 |
| TYLERPERRY | 1 |
| VIDEO | 1 |
Conclusion
In this article, we explored how to split a string into individual words separated by underscores and count the occurrences of each word using SQL. We used a combination of generated numbers tables, substring functions, and derived tables to achieve this.
By following these steps, you can solve similar problems involving file names with multiple words separated by underscores. Remember to adjust the maximum number of values in the generated numbers table according to your specific use case.
Last modified on 2024-05-15