Skip to content

Latest commit

 

History

History
45 lines (26 loc) · 2.76 KB

README.md

File metadata and controls

45 lines (26 loc) · 2.76 KB

Explore the IMDB Dataset using Malloy

Malloy is a new data programming language. The following is an analysis of the IMDB dataset using Malloy.

To run these examples, press '.' from github.com (which will bring you into VSCode and github.dev) and when prompted, install the Malloy extension.

  • Basic IMDB Queries - Simple queries against the IMDB data model
  • Harvard CS O50 - Harvard beginning CS class has a section on SQL. Here are the questions and and answers written in Malloy instead of SQL.
  • Stars and Jobs - Movie people work lots of jobs. Take a look at the top stars, what they work on, how long they have been working and what they do.
  • Directors - Who are the top directors? What did they direct? In each genres?
  • Steven Spielberg - What did he direct? When? Who does he commonly work with and on what?
  • Genre Analysis - Top movies, top people, popularity over time
  • Genre Matrix - Genre combinations tell you what to expect in a movie. What are the top movies in each combination pair?
  • Star Trek Dashboard - All the Star Trek Movies
  • Tom Hanks in Detail - A very complete dashboard about Tom Hanks

About the IMDb Dataset

IMDb makes data available for download via their website. Used with permission. For personal / educational use only.

Tables

People - All of the people who worked on a title, ranging from actors and directors to composers and crew.

Movies - The titles table (aliased to movies in the model) has been filtered only to titles with > 10,000 ratings. Note: This model frequently uses ratings.numVotes, the number of votes the title has received. Where ratings.averageRating of course indicates how much people liked a movie, numVotes is a better proxy for overall popularity of a title.

Principals A mapping table between people and titles, principals links the cast/crew to the titles they worked on.

Included Files

imdb.malloy: This is the base model. It names the three tables used, defines relationships between them, and declares measures, dimensions and queries to be used for analysis.

imdb-queries.malloy: Extends the base model to add a large variety of potentially interesting queries.

imdb-queries2.malloy: Extends the base model to add a much more curated set of potentially interesting queries. The movies2 source includes full indexing for field/value search.

Building the parquet files

The original data is in TSV format. Run 'make' in this directory to download and create a small subset of the IMDB Data to explore.