It would seem that I have far too much time on my hands. After the post about a Star Trek “test”, I started wondering if there could be any data to back it up and… well here we go:

The Next Generation

Name	Percentage of Lines
PICARD	20.16
RIKER	11.64
DATA	10.1
LAFORGE	6.93
WORF	6.14
TROI	5.4
CRUSHER	5.11
WESLEY	2.32

DS9

Name	Percentage of Lines
SISKO	13.0
KIRA	8.23
BASHIR	7.79
O’BRIEN	7.31
ODO	7.26
QUARK	6.98
DAX	5.73
WORF	3.18
JAKE	2.31
GARAK	2.29
NOG	2.01
ROM	1.89
DUKAT	1.76
EZRI	1.53

Voyager

Name	Percentage of Lines
JANEWAY	17.7
CHAKOTAY	8.76
EMH	8.34
PARIS	7.63
TUVOK	6.9
KIM	6.57
TORRES	6.45
SEVEN	6.1
NEELIX	4.99
KES	2.06

Enterprise

Name	Percentage of Lines
ARCHER	24.52
T’POL	13.09
TUCKER	12.72
REED	7.34
PHLOX	5.71
HOSHI	4.63
TRAVIS	3.83
SHRAN	1.26

Discovery

Note: This is a limited dataset, as the source site only has transcripts for seasons 1, 2, and 4

Name	Percentage of Lines
BURNHAM	22.92
SARU	8.2
BOOK	6.21
STAMETS	5.44
TILLY	5.17
LORCA	4.99
TARKA	3.32
TYLER	3.18
GEORGIOU	2.96
CULBER	2.83
RILLAK	2.17
DETMER	1.97
OWOSEKUN	1.79
ADIRA	1.63
COMPUTER	1.61
ZORA	1.6
VANCE	1.07
CORNWELL	1.07
SAREK	1.06
T’RINA	1.02

If anyone is interested, here’s the (rather hurried) Python used:

#!/usr/bin/env python

#
# This script assumes that you've already downloaded all the episode lines from
# the fantastic chakoteya.net:
#
# wget --accept=html,htm --relative --wait=2 --include-directories=/STDisco17/ http://www.chakoteya.net/STDisco17/episodes.html -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/Enterprise/ http://www.chakoteya.net/Enterprise/episodes.htm -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/Voyager/ http://www.chakoteya.net/Voyager/episode_listing.htm -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/DS9/ http://www.chakoteya.net/DS9/episodes.htm -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/NextGen/ http://www.chakoteya.net/NextGen/episodes.htm -m
#
# Then you'll probably have to convert the following files to UTF-8 as they
# differ from the rest:
#
# * Voyager/709.htm
# * Voyager/515.htm
# * Voyager/416.htm
# * Enterprise/41.htm
#

import re
from collections import defaultdict
from pathlib import Path

EPISODE_REGEX = re.compile(r"^\d+\.html?$")
LINE_REGEX = re.compile(r"^(?P<name>[A-Z']+): ")

EPISODES = Path("www.chakoteya.net")
DISCO = EPISODES / "STDisco17"
ENT = EPISODES / "Enterprise"
TNG = EPISODES / "NextGen"
DS9 = EPISODES / "DS9"
VOY = EPISODES / "Voyager"


class CharacterLines:
    def __init__(self, path: Path) -> None:
        self.path = path
        self.line_count = defaultdict(int)

    def collect(self) -> None:
        for episode in self.path.glob("*.htm*"):
            if EPISODE_REGEX.match(episode.name):
                for line in episode.read_text().split("\n"):
                    if m := LINE_REGEX.match(line):
                        self.line_count[m.group("name")] += 1

    @property
    def as_percentages(self) -> dict[str, float]:
        total = sum(self.line_count.values())
        r = {}
        for k, v in self.line_count.items():
            percentage = round(v * 100 / total, 2)
            if percentage > 1:
                r[k] = percentage
        return {k: v for k, v in reversed(sorted(r.items(), key=lambda _: _[1]))}

    def render(self) -> None:
        print(self.path.name)
        print("| Name             | Percentage of Lines |")
        print("| ---------------- | ------------------- |")
        for character, pct in self.as_percentages.items():
            print(f"| {character:16} | {pct} |")


if __name__ == "__main__":
    for series in (TNG, DS9, VOY, ENT, DISCO):
        counter = CharacterLines(series)
        counter.collect()
        counter.render()

deegeese@sopuli.xyz English

9·

7 months ago

Thanks for sharing. I notice chakoteya.net has TOS scripts. Is there any reason they weren’t included in the analysis?

Daniel Quinn@lemmy.caOP
fedilink
English
arrow-up
12·
7 months ago
Honestly, it’s 'cause I forgot to include it! I’ll see if I can add it tonight. Check back in 24hrs :-)
- deegeese@sopuli.xyz
  fedilink
  English
  arrow-up
  4·
  edit-2
  7 months ago
  Thanks for the update.
  
  Poor Chekov has almost no lines, but Koenig was great as Bester on B5.

Date	Episode	Title
11-28	LD 5x07	“Fully Dilated”
12-05	LD 5x08	“Upper Decks”
12-12	LD 5x09	“Fissure Quest”
12-19	LD 5x10	“The New Next Generation”
01-24	Film	“Section 31”

The number of lines for each character by percentage of the series

The number of lines for each character by percentage of the series

The Next Generation

DS9

Voyager

Enterprise

Discovery