Everyone who's worked with a web analytics table has written that one regex to pull the domain out of a URL. It starts simple. Then you hit a URL with no protocol. Then one with a subdomain. Then a country-code TLD that your pattern didn't see coming. Three hours later, your regex is a line wrap long and you're not sure what it does anymore.
You don't have to. BigQuery has a NET function family for exactly this.
Two functions you need to know
NET.HOST()
Returns the full host, including subdomains.
NET.HOST("https://developers.google.com/bigquery") -- developers.google.comUseful when subdomain matters — referrer analysis, multi-tenant SaaS logs, anywhere the subdomain carries signal.
NET.REG_DOMAIN()
Returns just the registrable domain. Subdomain stripped, path ignored.
NET.REG_DOMAIN("https://developers.google.com/bigquery") -- google.comUseful when you want to group things: "all traffic from Google" regardless of which subdomain it came from.
A practical example
Drop these into a staging model:
SELECT
raw_url,
NET.HOST(raw_url) AS host,
NET.REG_DOMAIN(raw_url) AS root_domain
FROM {{ source('analytics', 'raw_sessions') }}
WHERE raw_url IS NOT NULLTwo clean columns. Zero regex. Done.
A note on NULL handling
Both functions return NULL for malformed or unrecognizable input. That's a feature, not a bug — you get explicit signal you can test on, instead of a silent empty string.
Add it to your dbt tests:
columns:
- name: root_domain
tests:
- not_null:
config:
severity: warnA non-zero NULL rate gets surfaced instead of hiding in your data.
Where this lives in a real stack
URL parsing belongs in your staging layer, not in a BI tool's calculated field and definitely not in ad-hoc dashboard SQL. Parse once, store clean columns, have everything downstream consume the clean version.
That keeps three things true:
- Definitions stay consistent across tools.
- Query cost drops (no repeated regex on every dashboard load).
- A change to parsing logic is a one-file PR.
The NET namespace has more
NET.IP_FROM_STRING() and NET.IP_TO_STRING() exist if you're doing IP work. The full reference has a handful of other utilities worth a scan.
The bigger point
Infrastructure gets reliable when teams use standardized primitives instead of ad-hoc ones. A small decision made consistently early — like parse URLs in staging, using the standard function — saves you from tens of dashboards each doing it their own slightly-wrong way.
Boring choices, made once, compound well.
We help teams make hundreds of these boring-but-compounding decisions across their data stack. If your BigQuery project could use a cleanup pass, let's talk.