Downloading and examining the MovieLens 25m data

Author

Douglas Bates

Published

July 5, 2022

Load the packages to be used

Code

using Arrow         # compact, cross-language data tables
using CSV
using DataAPI       # generics for common data representations
using DataFrames
using Dates
using Downloads
using Missings
using SparseArrays
using TypedTables
using ZipFile

Download and re-configure the data

The data are available as a .zip archive at

ml25murl = "https://files.grouplens.org/datasets/movielens/ml-25m.zip";

Download

Download the zip file, if necessary.

zipfile = last(splitpath(ml25murl))
isfile(zipfile) || Downloads.download(ml25murl, zipfile)

true

Read the CSV file from the archive and write as Arrow

isdir("data") || mkdir("data")
ratingsarrow = joinpath("data", "ratings.arrow")
if !isfile(ratingsarrow)
    Arrow.write(
        ratingsarrow,
        CSV.File(
            only(
                filter(f -> endswith(f.name, "ratings.csv"), ZipFile.Reader(zipfile).files),
            );
            delim = ',',
            header = 1,
            types = [Int32, Int32, Float32, Int32],
            pool = [false, true, true, false],
        );
        compress = :lz4,
    )
end;

As the CSV file is being read, the second and third columns, which are movieId and rating, are “pooled” so they will be written as DictEncoded in the arrow file.

Read the data from the Arrow file

tbl = Arrow.Table(ratingsarrow)

Arrow.Table with 25000095 rows, 4 columns, and schema:
 :userId     Int32
 :movieId    Int32
 :rating     Float32
 :timestamp  Int32

The schema indicates that movieId and rating are Int32 and Float32, respectively but they are stored as DictEncoded

typeof(tbl.movieId), typeof(tbl.rating)

(Arrow.DictEncoded{Int32, Int32, Arrow.Primitive{Int32, Vector{Int32}}}, Arrow.DictEncoded{Float32, Int8, Arrow.Primitive{Float32, Vector{Float32}}})

Usually DictEncoding (analogous to the factor type in R) is used to save space. For example, the rating column could be stored as 32-bit floats but there are only 10 distinct values

show(sort(unique(tbl.rating)))

Float32[0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]

and these are stored as 8-bit integer indices into a table of 10 Float32 values.

Having the movieId column dict-encoded is not for saving space but rather to be able to use the indices into the DictEncoding table, called the refarray, as row numbers when constructing a sparse matrix. There are many gaps in the original movie numbers and we want to use a contiguous set of row numbers.

extrema(tbl.movieId), length(unique(tbl.movieId)), extrema(DataAPI.refarray(tbl.movieId))

((1, 209171), 59047, (1, 59047))

The timestamp column is in the old-style Unix form of an Int32 representing the number of seconds since the epoch (midnight, January 1, 1970).

These values can be converted to the DateTime type with Dates.unix2datetime.

show(first(Dates.unix2datetime.(tbl.timestamp), 3))

[DateTime

("2006-05-17T15:34:04"), DateTime("2006-05-17T12:26:57"), DateTime("2006-05-17T12:27:08")]

The adjacency matrix of the bipartite graph

A cross-tabulation of users and movies they have rated can be considered as the adjacency matrix of a bipartite graph. If a user rates a movie there is an edge in the graph between the user and the movie. All edges are between a user and a movie which is what makes the graph bipartitie (two separate groups of vertices and the only edges are between the groups).

Whether that is particularly helpful in the cases I am considering I don’t know.

I will call this adjacency matrix A21 as it is in the [2,1] position of a matrix we usually call A (we’re quite creative about naming things).

A21 = sparse(DataAPI.refarray(tbl.movieId), tbl.userId, true)

59047×162541 SparseMatrixCSC{Bool, Int32} with 25000095 stored entries:
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⠈⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⠀⠀⠈⠻⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⠀⠀⠀⠀⠀⠉⠛⠻⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠙⠛⠛⠻⠿⠿⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠘⠛⠿⠿⠿⠿⠿⠿⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠉⠉⠉⠉⠉

Base.summarysize(A21) / (2^20)   # total size of the sparse matrix in MB

119.82994365692139

Although the display of the matrix appears to be nearly dense, it is, in fact, quite sparse

nnz(A21) / length(A21)

0.002604839052572928

Diagonal blocks in A

There are two other non-redundant blocks in A, both of which are diagonal. The diagonal of A11 is simply the number of movies each user has rated and the diagonal of A22 is the number of ratings for each movie.

Number of movies rated per user

A11diag = vec(Int32.(sum(A21; dims=1)))
A11diag'

1×162541 adjoint(::Vector{Int32}) with eltype Int32:
 70  184  656  242  101  26  25  155  …  487  65  79  101  154  47  88  182

extrema(A11diag), sum(≤(100), A11diag)

((20, 32202), 99439)

Apparently someone rated over 32,000 movies; one hopes they are or were a professional movie critic. Nearly 100,000 of the 162,000 users represented here rated between 20 and 100 movies.

Number of users who rated each movie

A22diag = vec(Int32.(sum(A21; dims=2)))
A22diag'

1×59047 adjoint(::Vector{Int32}) with eltype Int32:
 79672  7058  6616  1269  10895  11935  …  1  1  1  1  1  1  1  1  1  1  1  1

Notice that there are many movies that have only a single rating in this subset of the overall ratings data.

extrema(A22diag), sum(isone, A22diag), sum(<(4), A22diag)

((1, 81491), 10298, 22854)

That is, over 1/6 of these 59,047 movies have only one rating and over 1/3 have fewer than 4 ratings.

With so little information these movies will not affect parameter estimates for models but they will cost us dearly in terms of the amount of storage required, which in the methods described below is quadratic in the number of movies. Reducing the number of movies by 1/3 cuts the amount of storage to 4/9 of the original amount.

Computational methods

I wish to obtain the Cholesky factor of the symmetric blocked matrix, \(\bbLambda'\bbA\bbLambda + \bbI\), where

\[ \bbA = \begin{bmatrix} \bbA_{1,1} & \bbA_{2,1}^\prime \\ \bbA_{2,1} & \bbA_{2,2} \end{bmatrix} \tag{1}\]

and \(\bbLambda\) is diagonal with non-negative diagonal elements. (In other cases \(\bbLambda\) can be block-diagonal with small triangular blocks, which is why the formula includes the transpose.)

In fact, in this case \(\bbLambda\) consists of two non-negative multiples of identity matrices, the first the same size as \(\bbA_{1,1}\) and the second the same size as \(\bbA_{2,2}\). To experiment with computational methods you can just set \(\bbLambda = \bbI\).

Suppose we write the lower-triangular Cholesky factor as

\[ \bbL = \begin{bmatrix} \bbL_{1,1} & \mathbf{0} \\ \bbL_{2,1} & \bbL_{2,2} \end{bmatrix} \tag{2}\]

Then \(\bbL_{1,1}\) and \(\bbL_{2,1}\) have the same non-zero patterns as \(\bbA_{1,1}\) and \(\bbA_{2,1}\) but \(\bbL_{2,2}\) will, in general, experience fill-in. I haven’t checked but I expect that \(\bbL_{2,2}\) will have nearly complete fill-in in this case, even after a fill-reducing permutation. At least I expect that there will be sufficient fill-in that it will be easiest to store it as a dense triangular matrix.

Storage problems

My problem is that I don’t have enough memory/swap space to store such a dense matrix.

As an Array{Float64, 2} it would require nearly 26 GB.

abs2(length(A22diag)) * 8 / (2^30)

25.976808436214924

Various options I have considered are:

Get access to a computer with more memory.
Use Float32 instead of Float64
Use mmap arrays
Use a packed symmetric storage, probably rectangular full packed
Trim the movies with fewer than 4 ratings
Some combination of the above

Suggestions?

Movie titles and genres

Create a single table containing the information on each movie, including the movie numbers in the imdb.com and themoviedb.org (tmdb) databases.

moviesarrow = joinpath("data", "movies.arrow")
if !isfile(moviesarrow)
    movies = CSV.read(
        only(filter(f -> endswith(f.name, "movies.csv"), ZipFile.Reader(zipfile).files)),
        DataFrame;
        types=[Int32, String, String],
        pool=[false, false, true]
    )
    replace!(movies.genres, "(no genres listed)" => "")
    title = String[]
    year = Union{Missing,Int16}[]
    yearregex = r"(.*)\((\d+)\)+\s*$"
    for t in movies.title
        m = match(yearregex, t)
        if isnothing(m)
            push!(title, strip(t))
            push!(year, missing)
        else
            push!(title, strip(first(m.captures)))
            push!(year, parse(Int16, last(m.captures)))
        end
    end
    movies.title = title
    movies.year = year
    disallowmissing!(
        leftjoin!(
            movies,
            CSV.read(
                only(filter(f -> endswith(f.name, "links.csv"), ZipFile.Reader(zipfile).files)),
                DataFrame;
                types=[Int32,Int32,Int32]
            );
            on=:movieId
        );
        error=false,
    )
    select!(movies, :title, :year, :genres, :tmdbId, :imdbId, :movieId)
    Arrow.write(moviesarrow, movies; compress=:lz4)
    Table(movies)
end

Table with 6 columns and 62423 rows:
      title                 year  genres                tmdbId  imdbId  movieId
    ┌──────────────────────────────────────────────────────────────────────────
 1  │ Toy Story             1995  Adventure|Animation…  862     114709  1
 2  │ Jumanji               1995  Adventure|Children|…  8844    113497  2
 3  │ Grumpier Old Men      1995  Comedy|Romance        15602   113228  3
 4  │ Waiting to Exhale     1995  Comedy|Drama|Romance  31357   114885  4
 5  │ Father of the Bride…  1995  Comedy                11862   113041  5
 6  │ Heat                  1995  Action|Crime|Thrill…  949     113277  6
 7  │ Sabrina               1995  Comedy|Romance        11860   114319  7
 8  │ Tom and Huck          1995  Adventure|Children    45325   112302  8
 9  │ Sudden Death          1995  Action                9091    114576  9
 10 │ GoldenEye             1995  Action|Adventure|Th…  710     113189  10
 11 │ American President,…  1995  Comedy|Drama|Romance  9087    112346  11
 12 │ Dracula: Dead and L…  1995  Comedy|Horror         12110   112896  12
 13 │ Balto                 1995  Adventure|Animation…  21032   112453  13
 14 │ Nixon                 1995  Drama                 10858   113987  14
 15 │ Cutthroat Island      1995  Action|Adventure|Ro…  1408    112760  15
 16 │ Casino                1995  Crime|Drama           524     112641  16
 17 │ Sense and Sensibili…  1995  Drama|Romance         4584    114388  17
 18 │ Four Rooms            1995  Comedy                5       113101  18
 19 │ Ace Ventura: When N…  1995  Comedy                9273    112281  19
 20 │ Money Train           1995  Action|Comedy|Crime…  11517   113845  20
 21 │ Get Shorty            1995  Comedy|Crime|Thrill…  8012    113161  21
 22 │ Copycat               1995  Crime|Drama|Horror|…  1710    112722  22
 23 │ Assassins             1995  Action|Crime|Thrill…  9691    112401  23
 ⋮  │          ⋮             ⋮             ⋮              ⋮       ⋮        ⋮

More information on individual movies can be retrieved from https://themoviedb.org/movie/<tmdbId>