A data pipeline and analytical framework that transformed 269,353 raw USPTO patent records into actionable venture capital intelligence on cancer therapy innovation.
BioVenture Capital Partners (BVCP), a venture capital firm specializing in life sciences, needed a systematic way to evaluate cancer therapy innovation trends across four decades of USPTO patent filings. Their investment committee faced three core challenges:
A three-phase pipeline: Extract and Transform raw data, Load into a normalized schema, then Analyze through SQL procedures.
Python ETL script that parsed the raw Cancer Moonshot CSV, handled missing values, normalized dates, and decomposed multi-valued fields (domains, classifications, grants) into separate relational entities ready for database ingestion.
Designed a hub-and-spoke relational schema with 6 normalized tables in PostgreSQL 16. The patents table serves as the central hub, with spokes for classifications, domains, NIH grants, FDA approvals, and patent families.
Developed 10 reusable SQL analytical procedures targeting domain momentum, NIH-FDA translation rates, cross-domain innovation, and patent family clustering to surface investment-grade insights for BVCP.
Six normalized tables centered around the patents hub, reducing redundancy from the flat-file source while enabling flexible analytical joins.
Central Hub
Each procedure was designed to answer a specific investment question for BVCP's due diligence process.
Which cancer therapy domains have the fastest-growing filing volume (2011-2016)?
Which domains are accelerating — growth speeding up, not just growing?
Which patents carry both NIH grant funding AND FDA drug approval?
What percentage of patents in each domain have NIH grant linkage?
Which organizations produce the most NIH-linked patents across the broadest domain spread?
Which post-2008 patents in top-momentum domains have the largest patent families?
How many patents span 3+ active domains, indicating cross-domain innovation?
Which domain pairs most frequently co-occur on the same patent?
What is the balance of diagnostic vs. treatment patents per domain over time?
Do high-filing domains also have high FDA approval linkage?
Patent records normalized from the Cancer Moonshot dataset into 6 relational tables
Patents spanning 2+ therapeutic domains, revealing widespread cross-domain innovation
FDA linkage rate for most high-volume domains — a massive patent-to-approval translation gap
Patent families identified, with the largest concentrating in high-momentum domains
The analysis delivers five strategic recommendations for BVCP's cancer therapy investment strategy:
ETL pipeline, data cleaning, CSV parsing and transformation with Pandas and psycopg2
Relational database, hub-and-spoke schema design, analytical SQL procedures
Database administration, query execution, and result visualization
Cancer Moonshot Patent Dataset — 269,353 records spanning 1976 to 2016
I apply the same analytical rigor to research partnerships and data-driven projects. Let's discuss how structured thinking can inform your investment or research decisions.