Muhammad Aliyu
← Back to Projects Quantic MSBA · Business Analytics Capstone

Cancer Moonshot Patent Analytics

A data pipeline and analytical framework that transformed 269,353 raw USPTO patent records into actionable venture capital intelligence on cancer therapy innovation.

Client BioVenture Capital Partners
Dataset 269,353 USPTO Records
Stack Python + PostgreSQL 16
Deliverables 6 Tables · 10 SQL Procedures
Context

The Business Problem

BioVenture Capital Partners (BVCP), a venture capital firm specializing in life sciences, needed a systematic way to evaluate cancer therapy innovation trends across four decades of USPTO patent filings. Their investment committee faced three core challenges:

  1. Volume Overload — The Cancer Moonshot dataset contains 269,353 patent records spanning 1976 to 2016, with 29 columns of unstructured data unsuitable for analytical querying.
  2. Signal Extraction — Identifying which therapeutic domains are accelerating, which have strong NIH-to-FDA translation pipelines, and where cross-domain innovation is emerging.
  3. Due Diligence Efficiency — BVCP needed repeatable, query-driven analytics rather than one-off manual reviews to support quarterly investment decisions.
Approach

Solution Architecture

A three-phase pipeline: Extract and Transform raw data, Load into a normalized schema, then Analyze through SQL procedures.

01

Extract & Transform

Python ETL script that parsed the raw Cancer Moonshot CSV, handled missing values, normalized dates, and decomposed multi-valued fields (domains, classifications, grants) into separate relational entities ready for database ingestion.

Python 3 Pandas Data Cleaning
02

Load & Model

Designed a hub-and-spoke relational schema with 6 normalized tables in PostgreSQL 16. The patents table serves as the central hub, with spokes for classifications, domains, NIH grants, FDA approvals, and patent families.

PostgreSQL 16 pgAdmin 4 Data Modeling
03

Analyze & Report

Developed 10 reusable SQL analytical procedures targeting domain momentum, NIH-FDA translation rates, cross-domain innovation, and patent family clustering to surface investment-grade insights for BVCP.

SQL Analytics Window Functions Business Intelligence
Data Model

Hub-and-Spoke Schema

Six normalized tables centered around the patents hub, reducing redundancy from the flat-file source while enabling flexible analytical joins.

patents

Central Hub

  • patent_id (PK)
  • family_id (FK)
  • application_number
  • filing_date
  • grant_date
  • patent_title
266,756 rows

patent_families

  • family_id (PK)
  • family_size
87,474 rows

patent_classifications

  • id (PK)
  • patent_id (FK)
  • classification_code
  • classification_type
34,975 rows

patent_domains

  • id (PK)
  • patent_id (FK)
  • domain_name
426,258 rows

nih_grants

  • id (PK)
  • patent_id (FK)
  • grant_number
  • recipient_organization
31,553 rows

fda_approvals

  • id (PK)
  • patent_id (FK)
  • fda_application_number
  • drug_trade_name
  • approval_date
  • applicant, ingredient
1,090 rows
Analysis

10 Analytical Procedures

Each procedure was designed to answer a specific investment question for BVCP's due diligence process.

01

Domain Momentum

Which cancer therapy domains have the fastest-growing filing volume (2011-2016)?

Finding: DNA/RNA/Protein Sequence and Data Science showed the strongest growth momentum, signaling a shift toward precision medicine and computational approaches.
02

Domain Acceleration

Which domains are accelerating — growth speeding up, not just growing?

Finding: Domains that were niche pre-2008 showed exponential acceleration post-2012, while traditional Drugs & Chemistry plateaued in growth rate.
03

Dual-Linked Patents

Which patents carry both NIH grant funding AND FDA drug approval?

Finding: Only a small subset of patents carry both markers — these represent the highest-confidence commercial translation opportunities for VC investment.
04

NIH Linkage by Domain

What percentage of patents in each domain have NIH grant linkage?

Finding: Government-funded research concentrates in specific therapeutic domains, revealing where public investment is betting on breakthrough therapies.
05

Top NIH Organizations

Which organizations produce the most NIH-linked patents across the broadest domain spread?

Finding: A handful of research universities and medical centers dominate NIH-linked output — potential partnership or licensing targets for BVCP.
06

High Family-Size Patents

Which post-2008 patents in top-momentum domains have the largest patent families?

Finding: Large patent families signal commercially valuable IP protected across jurisdictions — these concentrate in the accelerating domains.
07

Multi-Domain Patents

How many patents span 3+ active domains, indicating cross-domain innovation?

Finding: 113,484 patents span 2+ domains. Multi-domain patents signal platform technologies with broader therapeutic applicability — high-value VC targets.
08

Domain Co-Occurrence

Which domain pairs most frequently co-occur on the same patent?

Finding: The strongest co-occurrences reveal convergence zones between therapeutic areas, identifying where combinatory innovation is highest.
09

Diagnostic vs. Treatment Ratio

What is the balance of diagnostic vs. treatment patents per domain over time?

Finding: Domains skewing toward diagnostics represent earlier-stage markets where companion diagnostic strategies could drive investment returns.
10

Filing Volume vs. FDA Linkage

Do high-filing domains also have high FDA approval linkage?

Finding: Most high-volume domains have FDA linkage rates below 0.5%, revealing a massive translation gap — the central insight for BVCP's investment thesis.
Results

Key Findings at a Glance

269K

Patent records normalized from the Cancer Moonshot dataset into 6 relational tables

113K

Patents spanning 2+ therapeutic domains, revealing widespread cross-domain innovation

<0.5%

FDA linkage rate for most high-volume domains — a massive patent-to-approval translation gap

87K

Patent families identified, with the largest concentrating in high-momentum domains

Recommendations

Business Implications for BVCP

The analysis delivers five strategic recommendations for BVCP's cancer therapy investment strategy:

  1. Prioritize Accelerating Domains — Focus due diligence on DNA/RNA/Protein Sequence and genomic-related domains showing the steepest momentum curves post-2011. These represent the frontier of cancer therapy innovation.
  2. Target Dual-Linked Patents — The small cohort of patents with both NIH funding and FDA approval represents validated, commercially viable IP. Pursue licensing or partnership with the organizations holding these patents.
  3. Monitor the Translation Gap — The extreme disparity between filing volume and FDA linkage rates means most patent-heavy domains have unproven commercial pathways. Use FDA linkage rate as a risk filter in portfolio screening.
  4. Exploit Cross-Domain Convergence — Multi-domain patents and high-frequency domain pair co-occurrences identify platform technologies with diversified therapeutic applicability and lower concentration risk.
  5. Build Ongoing Intelligence — The 6-table schema and 10 reusable SQL procedures provide a repeatable analytics framework. BVCP can refresh the analysis annually as new USPTO data becomes available.
Technology

Tech Stack

Python 3

ETL pipeline, data cleaning, CSV parsing and transformation with Pandas and psycopg2

PostgreSQL 16

Relational database, hub-and-spoke schema design, analytical SQL procedures

pgAdmin 4

Database administration, query execution, and result visualization

USPTO Dataset

Cancer Moonshot Patent Dataset — 269,353 records spanning 1976 to 2016

Interested in this kind of work?

I apply the same analytical rigor to research partnerships and data-driven projects. Let's discuss how structured thinking can inform your investment or research decisions.