= last(splitpath(ml25murl))
zipfile isfile(zipfile) || Downloads.download(ml25murl, zipfile)
true
Douglas Bates
July 5, 2022
\[ \newcommand\bbA{{\mathbf{A}}} \newcommand\bbI{{\mathbf{I}}} \newcommand\bbL{{\mathbf{L}}} \newcommand\bbLambda{{\boldsymbol{\Lambda}}} \]
Load the packages to be used
The data are available as a .zip
archive at
Download the zip file, if necessary.
isdir("data") || mkdir("data")
ratingsarrow = joinpath("data", "ratings.arrow")
if !isfile(ratingsarrow)
Arrow.write(
ratingsarrow,
CSV.File(
only(
filter(f -> endswith(f.name, "ratings.csv"), ZipFile.Reader(zipfile).files),
);
delim = ',',
header = 1,
types = [Int32, Int32, Float32, Int32],
pool = [false, true, true, false],
);
compress = :lz4,
)
end;
As the CSV file is being read, the second and third columns, which are movieId
and rating
, are “pooled” so they will be written as DictEncoded
in the arrow file.
Arrow.Table with 25000095 rows, 4 columns, and schema:
:userId Int32
:movieId Int32
:rating Float32
:timestamp Int32
The schema indicates that movieId
and rating
are Int32
and Float32
, respectively but they are stored as DictEncoded
(Arrow.DictEncoded{Int32, Int32, Arrow.Primitive{Int32, Vector{Int32}}}, Arrow.DictEncoded{Float32, Int8, Arrow.Primitive{Float32, Vector{Float32}}})
Usually DictEncoding
(analogous to the factor
type in R) is used to save space. For example, the rating
column could be stored as 32-bit floats but there are only 10 distinct values
and these are stored as 8-bit integer indices into a table of 10 Float32
values.
Having the movieId
column dict-encoded is not for saving space but rather to be able to use the indices into the DictEncoding table, called the refarray
, as row numbers when constructing a sparse matrix. There are many gaps in the original movie numbers and we want to use a contiguous set of row numbers.
((1, 209171), 59047, (1, 59047))
The timestamp
column is in the old-style Unix form of an Int32 representing the number of seconds since the epoch (midnight, January 1, 1970).
These values can be converted to the DateTime
type with Dates.unix2datetime
.
A cross-tabulation of users and movies they have rated can be considered as the adjacency matrix of a bipartite graph. If a user rates a movie there is an edge in the graph between the user and the movie. All edges are between a user and a movie which is what makes the graph bipartitie (two separate groups of vertices and the only edges are between the groups).
Whether that is particularly helpful in the cases I am considering I don’t know.
I will call this adjacency matrix A21
as it is in the [2,1] position of a matrix we usually call A
(we’re quite creative about naming things).
59047×162541 SparseMatrixCSC{Bool, Int32} with 25000095 stored entries:
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⠈⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⠀⠀⠈⠻⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⠀⠀⠀⠀⠀⠉⠛⠻⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠙⠛⠛⠻⠿⠿⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠘⠛⠿⠿⠿⠿⠿⠿⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠉⠉⠉⠉⠉
Although the display of the matrix appears to be nearly dense, it is, in fact, quite sparse
There are two other non-redundant blocks in A
, both of which are diagonal. The diagonal of A11
is simply the number of movies each user has rated and the diagonal of A22
is the number of ratings for each movie.
1×162541 adjoint(::Vector{Int32}) with eltype Int32:
70 184 656 242 101 26 25 155 … 487 65 79 101 154 47 88 182
Apparently someone rated over 32,000 movies; one hopes they are or were a professional movie critic. Nearly 100,000 of the 162,000 users represented here rated between 20 and 100 movies.
1×59047 adjoint(::Vector{Int32}) with eltype Int32:
79672 7058 6616 1269 10895 11935 … 1 1 1 1 1 1 1 1 1 1 1 1
Notice that there are many movies that have only a single rating in this subset of the overall ratings data.
That is, over 1/6 of these 59,047 movies have only one rating and over 1/3 have fewer than 4 ratings.
With so little information these movies will not affect parameter estimates for models but they will cost us dearly in terms of the amount of storage required, which in the methods described below is quadratic in the number of movies. Reducing the number of movies by 1/3 cuts the amount of storage to 4/9 of the original amount.
I wish to obtain the Cholesky factor of the symmetric blocked matrix, \(\bbLambda'\bbA\bbLambda + \bbI\), where
\[ \bbA = \begin{bmatrix} \bbA_{1,1} & \bbA_{2,1}^\prime \\ \bbA_{2,1} & \bbA_{2,2} \end{bmatrix} \tag{1}\]
and \(\bbLambda\) is diagonal with non-negative diagonal elements. (In other cases \(\bbLambda\) can be block-diagonal with small triangular blocks, which is why the formula includes the transpose.)
In fact, in this case \(\bbLambda\) consists of two non-negative multiples of identity matrices, the first the same size as \(\bbA_{1,1}\) and the second the same size as \(\bbA_{2,2}\). To experiment with computational methods you can just set \(\bbLambda = \bbI\).
Suppose we write the lower-triangular Cholesky factor as
\[ \bbL = \begin{bmatrix} \bbL_{1,1} & \mathbf{0} \\ \bbL_{2,1} & \bbL_{2,2} \end{bmatrix} \tag{2}\]
Then \(\bbL_{1,1}\) and \(\bbL_{2,1}\) have the same non-zero patterns as \(\bbA_{1,1}\) and \(\bbA_{2,1}\) but \(\bbL_{2,2}\) will, in general, experience fill-in. I haven’t checked but I expect that \(\bbL_{2,2}\) will have nearly complete fill-in in this case, even after a fill-reducing permutation. At least I expect that there will be sufficient fill-in that it will be easiest to store it as a dense triangular matrix.
My problem is that I don’t have enough memory/swap space to store such a dense matrix.
As an Array{Float64, 2}
it would require nearly 26 GB.
Various options I have considered are:
Get access to a computer with more memory.
Use Float32
instead of Float64
Use mmap
arrays
Use a packed symmetric storage, probably rectangular full packed
Trim the movies with fewer than 4 ratings
Some combination of the above
Suggestions?
Create a single table containing the information on each movie, including the movie numbers in the imdb.com
and themoviedb.org
(tmdb) databases.
moviesarrow = joinpath("data", "movies.arrow")
if !isfile(moviesarrow)
movies = CSV.read(
only(filter(f -> endswith(f.name, "movies.csv"), ZipFile.Reader(zipfile).files)),
DataFrame;
types=[Int32, String, String],
pool=[false, false, true]
)
replace!(movies.genres, "(no genres listed)" => "")
title = String[]
year = Union{Missing,Int16}[]
yearregex = r"(.*)\((\d+)\)+\s*$"
for t in movies.title
m = match(yearregex, t)
if isnothing(m)
push!(title, strip(t))
push!(year, missing)
else
push!(title, strip(first(m.captures)))
push!(year, parse(Int16, last(m.captures)))
end
end
movies.title = title
movies.year = year
disallowmissing!(
leftjoin!(
movies,
CSV.read(
only(filter(f -> endswith(f.name, "links.csv"), ZipFile.Reader(zipfile).files)),
DataFrame;
types=[Int32,Int32,Int32]
);
on=:movieId
);
error=false,
)
select!(movies, :title, :year, :genres, :tmdbId, :imdbId, :movieId)
Arrow.write(moviesarrow, movies; compress=:lz4)
Table(movies)
end
Table with 6 columns and 62423 rows:
title year genres tmdbId imdbId movieId
┌──────────────────────────────────────────────────────────────────────────
1 │ Toy Story 1995 Adventure|Animation… 862 114709 1
2 │ Jumanji 1995 Adventure|Children|… 8844 113497 2
3 │ Grumpier Old Men 1995 Comedy|Romance 15602 113228 3
4 │ Waiting to Exhale 1995 Comedy|Drama|Romance 31357 114885 4
5 │ Father of the Bride… 1995 Comedy 11862 113041 5
6 │ Heat 1995 Action|Crime|Thrill… 949 113277 6
7 │ Sabrina 1995 Comedy|Romance 11860 114319 7
8 │ Tom and Huck 1995 Adventure|Children 45325 112302 8
9 │ Sudden Death 1995 Action 9091 114576 9
10 │ GoldenEye 1995 Action|Adventure|Th… 710 113189 10
11 │ American President,… 1995 Comedy|Drama|Romance 9087 112346 11
12 │ Dracula: Dead and L… 1995 Comedy|Horror 12110 112896 12
13 │ Balto 1995 Adventure|Animation… 21032 112453 13
14 │ Nixon 1995 Drama 10858 113987 14
15 │ Cutthroat Island 1995 Action|Adventure|Ro… 1408 112760 15
16 │ Casino 1995 Crime|Drama 524 112641 16
17 │ Sense and Sensibili… 1995 Drama|Romance 4584 114388 17
18 │ Four Rooms 1995 Comedy 5 113101 18
19 │ Ace Ventura: When N… 1995 Comedy 9273 112281 19
20 │ Money Train 1995 Action|Comedy|Crime… 11517 113845 20
21 │ Get Shorty 1995 Comedy|Crime|Thrill… 8012 113161 21
22 │ Copycat 1995 Crime|Drama|Horror|… 1710 112722 22
23 │ Assassins 1995 Action|Crime|Thrill… 9691 112401 23
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
More information on individual movies can be retrieved from https://themoviedb.org/movie/<tmdbId>