It would seem that I have far too much time on my hands. After the post about a Star Trek “test”, I started wondering if there could be any data to back it up and… well here we go:
The Next Generation
Name |
Percentage of Lines |
PICARD |
20.16 |
RIKER |
11.64 |
DATA |
10.1 |
LAFORGE |
6.93 |
WORF |
6.14 |
TROI |
5.4 |
CRUSHER |
5.11 |
WESLEY |
2.32 |
DS9
Name |
Percentage of Lines |
SISKO |
13.0 |
KIRA |
8.23 |
BASHIR |
7.79 |
O’BRIEN |
7.31 |
ODO |
7.26 |
QUARK |
6.98 |
DAX |
5.73 |
WORF |
3.18 |
JAKE |
2.31 |
GARAK |
2.29 |
NOG |
2.01 |
ROM |
1.89 |
DUKAT |
1.76 |
EZRI |
1.53 |
Voyager
Name |
Percentage of Lines |
JANEWAY |
17.7 |
CHAKOTAY |
8.76 |
EMH |
8.34 |
PARIS |
7.63 |
TUVOK |
6.9 |
KIM |
6.57 |
TORRES |
6.45 |
SEVEN |
6.1 |
NEELIX |
4.99 |
KES |
2.06 |
Enterprise
Name |
Percentage of Lines |
ARCHER |
24.52 |
T’POL |
13.09 |
TUCKER |
12.72 |
REED |
7.34 |
PHLOX |
5.71 |
HOSHI |
4.63 |
TRAVIS |
3.83 |
SHRAN |
1.26 |
Discovery
Note: This is a limited dataset, as the source site only has transcripts for seasons 1, 2, and 4
Name |
Percentage of Lines |
BURNHAM |
22.92 |
SARU |
8.2 |
BOOK |
6.21 |
STAMETS |
5.44 |
TILLY |
5.17 |
LORCA |
4.99 |
TARKA |
3.32 |
TYLER |
3.18 |
GEORGIOU |
2.96 |
CULBER |
2.83 |
RILLAK |
2.17 |
DETMER |
1.97 |
OWOSEKUN |
1.79 |
ADIRA |
1.63 |
COMPUTER |
1.61 |
ZORA |
1.6 |
VANCE |
1.07 |
CORNWELL |
1.07 |
SAREK |
1.06 |
T’RINA |
1.02 |
If anyone is interested, here’s the (rather hurried) Python used:
import re
from collections import defaultdict
from pathlib import Path
EPISODE_REGEX = re.compile(r"^\d+\.html?$")
LINE_REGEX = re.compile(r"^(?P<name>[A-Z']+): ")
EPISODES = Path("www.chakoteya.net")
DISCO = EPISODES / "STDisco17"
ENT = EPISODES / "Enterprise"
TNG = EPISODES / "NextGen"
DS9 = EPISODES / "DS9"
VOY = EPISODES / "Voyager"
class CharacterLines:
def __init__(self, path: Path) -> None:
self.path = path
self.line_count = defaultdict(int)
def collect(self) -> None:
for episode in self.path.glob("*.htm*"):
if EPISODE_REGEX.match(episode.name):
for line in episode.read_text().split("\n"):
if m := LINE_REGEX.match(line):
self.line_count[m.group("name")] += 1
@property
def as_percentages(self) -> dict[str, float]:
total = sum(self.line_count.values())
r = {}
for k, v in self.line_count.items():
percentage = round(v * 100 / total, 2)
if percentage > 1:
r[k] = percentage
return {k: v for k, v in reversed(sorted(r.items(), key=lambda _: _[1]))}
def render(self) -> None:
print(self.path.name)
print("| Name | Percentage of Lines |")
print("| ---------------- | ------------------- |")
for character, pct in self.as_percentages.items():
print(f"| {character:16} | {pct} |")
if __name__ == "__main__":
for series in (TNG, DS9, VOY, ENT, DISCO):
counter = CharacterLines(series)
counter.collect()
counter.render()
Thanks for sharing. I notice chakoteya.net has TOS scripts. Is there any reason they weren’t included in the analysis?
Honestly, it’s 'cause I forgot to include it! I’ll see if I can add it tonight. Check back in 24hrs :-)
Thanks for the update.
Poor Chekov has almost no lines, but Koenig was great as Bester on B5.