Catholic Diocese of Homabay


PageRank

PageRank

Micro Tells Macro

Standing on the Shoulders of Giants

Predicting the Popularity of Micro-Videos via a Transductive Model

PageRank:
Standing on
the Shoulders
of Giants
Wholly new forms of encyclopedias will appear, ready made with
a mesh of associative trails running through them, ready to be
dropped into the Memex and there
amplified.
Bush’s prediction came true in 1989,
when Tim Berners-Lee proposed the
Hypertext Markup Language (HTML) to
keep track of experimental data at the
European Organization for Nuclear
Research (CERN). In the original farsighted proposal in which Berners-Lee
attempts to persuade CERN management to adopt the new global hypertext system we can read the following
paragraphb:
We should work toward a universal linked information system, in
which generality and portability
are more important than fancy
graphics techniques and complex
extra facilities. The aim would be
to allow a place to be found for any
information or reference which
one felt was important, and a way
of finding it afterwards. The result
should be sufficiently attractive
to use that the information contained would grow past a critical
threshold.
b http://www.w3.org/History/1989/proposal.html
PageRank3 is a Web page ranking technique that has
been a fundamental ingredient in the development
and success of the Google search engine. The
method is still one of the many signals Google uses to
determine which pages are most important.a
The main
idea behind PageRank is to determine the importance
of a Web page in terms of the importance assigned to
the pages hyperlinking to it. In fact, this thesis is not
new, and has been previously successfully exploited
in different contexts. We review the PageRank method
and link it to some renowned previous techniques
that we have found in the fields of Web information
retrieval, bibliometrics, sociometry, and econometrics.
In 1945 Vannevar Bush wrote a celebrated article
in The Atlantic Monthly entitled “As We May Think”
describing a futuristic device he called Memex.5
Bush wrote:
a http://www.google.com/corporate/tech.html
key insights
Great, pioneering research paved the
way for PageRank, the well-known
algorithm of Brin and Page that became
the basis for the Google search engine
and that is still one of the many signals
Google uses to determine which pages
are most important.
This article examines the mathematical
predecessors to PageRank in ostensibly
disparate disciplines such as Web
information retrieval, bibliometrics,
sociometry, and econometrics.
In tracing these early efforts we find the
fundamental idea behind PageRank—
its circular thesis that a Web page is
important if it is pointed to by other
important pages—was not entirely
new, and had basis in work going
back decades.
credit tk
ju
ne 2011
| vo
l. 54
|
no. 6
|
c
ommunications of
the acm 93
94 communications of the acm | june 2011 | vol. 54 | no. 6
review articles
on a page, the higher the PageRank
importance of the page.
A little more formally, the method
can be described as follows. Let us
denote by qi
the number of distinct outgoing (hyper)links of page i. Let H = (hi,j
)
be a square matrix of size equal to the
number n of Web pages such that hi,j
=
1/qi
if there exists a link from page i to
page j and hi,j
= 0 otherwise. The value
hi,j
can be interpreted as the probability that the random surfer moves from
page i to page j by clicking on one of the
distinct links of page i. The PageRank
πj
of page j is recursively defined as
or, in matrix notation, π = π H. Hence,
the PageRank of page j is the sum of the
PageRank scores of pages i linking to j,
weighted by the probability of going
from i to j. In words, the PageRank
thesis reads as follows:
A Web page is important if it is
pointed to by other important
pages.
There are in fact three distinct factors
that determine the PageRank of a page:
the number of links it receives; the link
propensity, that is, the number of outgoing links, of the linking pages; and
the PageRank of the linking pages. The
first factor is not surprising: the more
links a page receives, the more important it is perceived. Reasonably, the
link value depreciates proportionally to
the number of links given out by a page:
endorsements coming from parsimonious pages are worthier than those
emanated by spendthrift ones. Finally,
not all pages are created equal: links
from important pages are more valuable than those from obscure ones.
Unfortunately, this ideal model has
two problems that prevent the solution
of the system. The first one is due to
the presence of dangling nodes, that
are pages with no forward links.c
These
pages capture the random surfer
indefinitely. Notice that a dangling
node corresponds to a row in matrix
H with all entries equal to 0. To tackle
c The term dangling refers to the fact that many
dangling nodes are in fact pendent Web pages
found by the crawling spiders but whose links
have not been yet explored.
As we all know, the proposal was accepted and later implemented in a
mesh—this was the only name that Berners-Lee originally used to describe the
Web—of interconnected documents
that rapidly grew beyond the CERN
threshold, as Berners-Lee anticipated,
and became the World Wide Web.
Today, the Web is a huge, dynamic,
self-organized, and hyperlinked data
source, very different from traditional document collections that are
nonlinked, mostly static, centrally
collected and organized by specialists. These features make Web information retrieval quite different from
traditional information retrieval and
call for new search abilities, like automatic crawling and indexing of the
Web. Moreover, early search engines
ranked responses using only a content
score, which measures the similarity between the page and the query.
One simple example is just a count
of the number of times the query
words occur on the page, or perhaps
a weighted count with more weight on
title words. These traditional querydependent techniques suffered under
the gigantic size of the Web and the
death grip of spammers.
In 1998, Sergey Brin and Larry Page
revolutionized the field of Web information retrieval by introducing the
notion of an importance score, which
gauges the status of a page, independently from the user query, by analyzing the topology of the Web graph.
The method was implemented in the
famous PageRank algorithm and both
the traditional content score and the
new importance score were efficiently
combined in a new search engine
named Google.
Ranking Web Pages using
PageRank
We briefly recall how the PageRank
method works keeping the mathematical machinery to the minimum.
Interested readers can more thoroughly
investigate the topic in a recent book
by Langville and Meyer that elegantly
describes the science of search engine
rankings in a rigorous yet playful style.16
We start by providing an intuitive
interpretation of PageRank in terms of
random walks on graphs.22 The Web is
viewed as a directed graph of pages connected by hyperlinks. A random surfer
starts from an arbitrary page and simply keeps clicking on successive links at
random, bouncing from page to page.
The PageRank value of a page corresponds to the relative frequency the random surfer visits that page, assuming
that the surfer goes on infinitely. The
more time spent by the random surfer
Figure 1. A PageRank instance with solution. Each node is labeled with its PageRank score.
Scores have been normalized to sum to 100. We assumed α = 0.85.
A
3.3
B
38.4
C
34.3
D
3.9
E
8.1
F
3.9
G
1.6
H
1.6
I
1.6
L
1.6
M
1.6
review articles
june 2011 | vol. 54 | no. 6 | communications of the acm 95
the problem of dangling nodes, the
corresponding rows in H are replaced
by the uniform probability vector
u = 1/ne, where e is a vector of length
n with all components equal to 1.
Alternatively, one may use any fixed
probability vector in place of u. This
means the random surfer escapes
from the dangling page by jumping to
a randomly chosen page. We call S the
resulting matrix.
The second problem with the ideal
model is that the surfer can get trapped
into a bucket of the Web graph, which is
a reachable strongly connected component without outgoing edges toward the
rest of the graph. The solution proposed
by Brin and Page is to replace matrix S
by the Google matrix
G = aS + (1 − a) E
where E is the teleportation matrix
with identical rows each equal to the
uniform probability vector u, and a
is a free parameter of the algorithm
often called the damping factor.
Alternatively, a fixed personalization
probability vector v can be used in
place on u. In particular, the personalization vector can be exploited to
bias the result of the method toward
certain topics. The interpretation of
the new system is that, with probability a the random surfer moves
forward by following links, and, with
the complementary probability 1 −a
the surfer gets bored of following
links and enters a new destination in
the browser’s URL line, possibly unrelated to the current page. The surfer
is hence teleported, like a Star Trek
character, to that page, even if there
exists no link connecting the current
and the destination pages in the Web
universe. The inventors of PageRank
propose to set the damping factor
a = 0.85, meaning that after about five
link clicks the random surfer chooses
a random page.
The PageRank vector is then defined
as the solution of equation:
π = πG (1)
An example is provided in Figure
1. Node A is a dangling node, while
nodes B and C form a bucket. Notice
the dynamics of the method: page C
receives just one link but from the most
important page B; its importance is
much higher than that of page E, which
receives many more links, but from
anonymous pages. Pages G, H, I, L, and
M do not receive endorsements; their
scores correspond to the minimum
amount of status of each page.
Typically, the normalization condition is also added. In this case,
Equation 1 becomes π=aπS+(1−a)u.
The latter distinguishes two factors
contributing to the PageRank vector:
an endogenous factor equal to πS, which
takes into consideration the real topology of the Web graph, and an exogenous
factor equal to the uniform probability
vector u, which can be interpreted as a
minimal amount of status assigned to
each page independently of the hyperlink graph. The parameter a balances
between these two factors.
Computing the PageRank Vector
Does Equation 1 have a solution?
Is the solution unique? Can we efficiently compute it? The success of the
PageRank method rests on the answers
to these queries. Luckily, all these
questions have nice answers.
Thanks to the dangling nodes
patch, matrix S is a stochastic matrix,d
and clearly the teleportation matrix E
is also stochastic. It follows that G is
stochastic as well, since it is defined
as a convex combination of stochastic matrices S and E. It is easy to show
that, if G is stochastic, Equation 1 has
always at least one solution. Hence, we
have got at least one PageRank vector.
Having two independent PageRank
vectors, however, would be already
too much: which one should we use to
rank Web pages? Here, a fundamental
result of algebra comes to the rescue:
Perron–Frobenius theorem.
7, 24 It states
that, if A is an irreduciblee
nonnegative square matrix, then there exists a
unique vector x, called the Perron vector, such that xA = rx, x > 0, and ,
where r is the maximum eigenvalue
of A in absolute value, that algebraists
call the spectral radius of A. The Perron
vector is the left dominant eigenvector
of A, that is, the left eigenvector assod This simply means that all rows sum up to 1.
e A matrix is irreducible if and only if the directed
graph associated with it is strongly connected,
that is, for every pair i and j of graph nodes there
are paths leading from i to j and from j to i.
ciated with the largest eigenvalue in
magnitude.
The matrix S is most likely reducible,
since experiments have shown that the
Web has a bow-tie structure fragmented
into four main continents that are not
mutually reachable, as first observed by
Broder et al.4
Thanks to the teleportation trick, however, the graph of matrix
G is strongly connected. Hence G is irreducible and Perron–Frobenius theorem
applies.f
Therefore, a positive PageRank
vector exists and is furthermore unique.
Interestingly, we can arrive at the
same result using Markov theory.
20 The
above described random walk on the Web
graph, modified with the teleportation
jumps, naturally induces a finite-state
Markov chain, whose transition matrix
is the stochastic matrix G. Since G is
irreducible, the chain has a unique stationary distribution corresponding to
the PageRank vector.
A last crucial question remains: can
we efficiently compute the PageRank
vector? The success of PageRank is
largely due to the existence of a fast
method to compute its values: the
power method, a simple iteration
method to find the dominant eigenpair
f Since G is stochastic, its spectral radius is 1.
Table 1. PageRank history
Year Author Contribution
1906 Markov Markov theory20
1907 Perron Perron theorem24
1912 Frobenius Perron–Frobenius
theorem7
1929 von Mises and
PollaczekGeiringer
Power method31
1941 Leontief Econometric
model18
1949 Seeley Sociometric
model29
1952 Wei Sport ranking
model32
1953 Katz Sociometric
model11
1965 Hubbell Sociometric
model10
1976 Pinski
and Narin
Bibliometric
model26
1998 Kleinberg HITS14
1998 Brin and Page PageRank3
96 communications of the acm | june 2011 | vol. 54 | no. 6
review articles
and the code is freely available at the
author’s Web page.
Hubs and Authorities on the Web
Hypertext Induced Topic Search
(HITS) is a Web page ranking
method proposed by Kleinberg.14, 15
The connections between HITS and
PageRank are striking. Despite the
close conceptual, temporal, and
even geographical proximity of the
two approaches, it appears that HITS
and PageRank have been developed
independently. In fact, both papers
presenting PageRank3
and HITS15 are
today citational blockbusters: the
PageRank article collected 6,167 citations, while the HITS paper has been
cited 4,617 times.h
HITS thinks of Web pages as authorities and hubs. HITS circular thesis
reads as follows:
Good authorities are pages that
are pointed to by good hubs and
good hubs are pages that point to
good authorities.
Let L = (l
i, j
) be the adjacency matrix
of the Web graph, i.e., l
i, j
= 1 if page i
links to page j and l
i, j
= 0 otherwise. We
denote with LT the transpose of L. HITS
defines a pair of recursive equations as
follows, where x is the authority vector
containing the authority scores and
y is the hub vector containing the hub
scores:
x(k)
= LTy(k−1)
(2) y(k)
= Lx(k)
where k ≥ 1 and y(0) = e, the vector of all
ones. The first equation tells us that
authoritative pages are those pointed
to by good hub pages, while the second
equation claims that good hubs are
pages that point to authoritative pages.
Notice that Equation 2 is equivalent to
x(k)
= LTLx(k−1)
(3) y(k)
= LLTy(k−1)
It follows that the authority vector x is
the dominant right eigenvector of the
authority matrix A=LT
L, and the hub
vector y is the dominant right eigenvector of the hub matrix H=LLT
. This is very
similar to the PageRank method, except
h Source: Google Scholar on February 5, 2010.
of a matrix developed by von Mises and
Pollaczek-Geiringer.31 It works as follows on the Google matrix G. Let π (0) =
u = 1/ne. Repeatedly compute π(k+1) =
π(k) G until ||π(k+1) − π(k)
|| < , where ||·||
measures the distance between the
two successive PageRank vectors and
 is the desired precision.
The convergence rate of the power
method is approximately the rate at
which ak
approaches to 0: the closer a to
unity, the lower the convergence speed
of the power method. If, for instance, a =
0.85, as many as 43 iterations are sufficient to gain 3 digits of accuracy, and
142 iterations are enough for 10 digits of accuracy. Notice that the power
method applied to matrix G can be
easily expressed in terms of matrix
H, which, unlike G, is a very sparse
matrix that can be stored using a linear
amount of memory with respect to the
size of the Web.
Standing on the
Shoulders of Giants
Dwarfs standing on the shoulders of
giants is a Western metaphor meaning “One who develops future intellectual pursuits by understanding the
research and works created by notable
thinkers of the past.”g
The metaphor
was famously uttered by Isaac Newton:
“If I have seen a little further it is by
standing on the shoulders of Giants.”
Moreover, “Stand on the shoulders of
giants” is Google Scholar’s motto: “the
phrase is our acknowledgement that
much of scholarly research involves
building on what others have already
discovered.”
There are many giants upon whose
shoulders PageRank firmly stands:
Markov,20 Perron,24 Frobenius,7
von
Mises and Pollaczek-Geiringer31 provided at the beginning of the 1900s the
necessary mathematical machinery to
investigate and effectively solve the
PageRank problem. Moreover, the circular PageRank thesis has been previously exploited in different contexts,
including Web information retrieval,
bibliometrics, sociometry, and econometrics. In the following, we review
these contributions and link them to
the PageRank method. Table 1 shows a
brief summary of PageRank history. All
the ranking techniques surveyed in this
paper have been implemented in R27
g From the Wikipedia page for Standing on the
shoulders of giants.
Figure 2. A HITS instance with solution (compare with PageRank scores in Figure 1). Each
node is labeled with its authority (top) and hub (bottom) scores. Scores have been normalized
to sum to 100. The dominant eigenvalue for both authority and hub matrices is 10.7.
A
4.7
0.0
B
45.9
0.0
C
0.0
8.1
D
5.3
8.9
E
38.9
9.9
F
5.3
14.9
G
0.0
14.9
H
0.0
14.9
I
0.0
14.9
L
0.0
6.8
M
0.0
6.8
review articles
june 2011 | vol. 54 | no. 6 | communications of the acm 97
The connections
between HITS
and PageRank are
striking. Despite the
close conceptual,
temporal, and
even geographical
proximity of
the two approaches,
it appears that HITS
and PageRank have
been developed
independently.
the use of the authority and hub matrices instead of the Google matrix.
To compute the dominant eigenpair (eigenvector and eigenvalue) of the
authority matrix we can again exploit
the power method as follows: let x(0)=e.
Repeatedly compute x(k)
=Ax(k−1) and
normalize x(k)
=x(k)
/m(x(k)
), where m(x(k)
)
is the signed component of maximal
magnitude, until the desired precision
is achieved. It follows that x(k)
converges
to the dominant eigenvector x (the
authority vector) and m(x(k)
) converges
to the dominant eigenvalue (the spectral radius, which is not necessarily 1).
The hub vector y is then given by y=Lx.
While the convergence of the power
method is guaranteed, the computed
solution is not necessarily unique, since
the authority and hub matrices are not
necessarily irreducible. A modification similar to the teleportation trick
used for the PageRank method can be
applied to HITS to recover the uniqueness of the solution.35
An example of HITS is given in
Figure 2. We stress the difference
among importance, as computed by
PageRank, and authority and hubness,
as computed by HITS. Page B is both
important and authoritative, but it is
not a good hub. Page C is important
but by no means authoritative. Pages
G, H, and I are neither important nor
authoritative, but they are the best
hubs of the network, since they point to
good authorities only. Notice that the
hub score of B is 0 although B has one
outgoing edge; unfortunately for B, the
only page C linked by B has no authority.
Similarly, C has no authority because
it is pointed to only by B, whose hub
score is zero. This shows the difference
between indegree and authority, as well
as between outdegree and hubness.
Finally, we observe that nodes with null
authority scores (respectively, null hub
scores) correspond to isolated nodes in
the graph whose adjacency matrix is the
authority matrix A (respectively, the hub
matrix H).
An advantage of HITS with respect to
PageRank is that it provides two scores
at the price of one. The user is hence
provided with two rankings: the most
authoritative pages about the research
topic, which can be exploited to investigate in depth a research subject, and the
most hubby pages, which correspond
to portal pages linking to the research
topic from which a broad search can be
started. A disadvantage of HITS is the
higher susceptibility of the method to
spamming: while it is difficult to add
incoming links to our favorite page, the
addition of outgoing links is much easier. This leads to the possibility of purposely inflating the hub score of a page,
indirectly influencing also the authority
scores of the pointed pages.
HITS is related to a matrix factorization technique known as singular
value decomposition.
6
According to
this technique, the adjacency matrix
L can be written as the matrix product
USVT, where the columns of U, called
left-singular vectors, are the orthonormal eigenvectors of the hub matrix LLT,
the columns of V, called right-singular
vectors, are the orthonormal eigenvectors of the authority matrix LTL, and S
is a diagonal matrix whose diagonal
elements, called singular values, correspond to the square roots of the
eigenvalues of the hub matrix (or,
equivalently, of the authority matrix).
It follows that the HITS authority and
hub vectors correspond, respectively,
to the right- and left-singular vectors
associated with the highest singular
value of L.
HITS also has a connection to bibliometrics.6
Two typical bibliometric
methods to identify similar publications are co-citation, in which publications are related when they are cited by
the same papers, and co-reference, in
which papers are related when they cite
the same papers. The authority matrix is
a co-citation matrix and the hub matrix
is a co-reference matrix. Indeed, since
A = LT
L, the element ai,j
of the authority
matrix contains the number of times
pages i and j are both linked by a third
page (ai, j
is the number of inlinks of i).
Moreover, since H = LLT
, the element hi, j
of the hub matrix contains the number
of times both pages i and j link to a third
page (hi,i
is the number of outlinks of i).
Hence, good authorities are pages that
are frequently co-cited with other good
authorities, and good hubs are pages
that frequently co-reference with other
good hubs.
A following algorithm that incorporates ideas from both PageRank and
HITS is SALSA17: like HITS, SALSA computes both authority and hub scores,
and like PageRank, these scores are
obtained from Markov chains.
98 communications of the acm | june 2011 | vol. 54 | no. 6
review articles
reputed journals are weighted as those
from obscure journals. In 1976 Gabriel
Pinski and Francis Narin developed an
innovative journal ranking method.26
The method measures the influence
of a journal in terms of the influence
of the citing journals. The Pinski and
Narin thesis is:
A journal is influential if it is cited
by other influential journals.
This is the same circular thesis of the
PageRank method. Given a source
time window T1
and a previous target
time window T2
, the journal citation
system can be viewed as a weighted
directed graph in which nodes are journals and there is an edge from journal i
to journal j if there is some article published in i during T1
that cites an article
published in j during T2
. The edge is
weighted with the number ci, j
of such
citations from i to j. Let be
the total number of cited references of
journal i.
In the method described by Pinski
and Narin, a citation matrix H = (hi, j
)
is constructed such that hi, j
= ci, j
/cj
. The
coefficient hi,j is the amount of citations received by journal j from journal
i per reference given out by journal j.
For each journal an influence score is
determined which measures the relative journal performance per given
reference. The influence score πj
of
journal j is defined as
or, in matrix notation:
π = πΗ (4)
Hence, journals j with a large total
influence πj
cj
are those that receive significant endorsements from influential journals. Notice that the influence
per reference score πj
of a journal j is a
size-independent measure, since the
formula normalizes by the number of
cited references cj
contained in articles
of the journal, which is an estimation
of the size of the journal. Moreover, the
normalization neutralizes the effect
of journal self-citations, which are
citations between articles in the same
journal. These citations are indeed
counted both at the numerator and at
Bibliometrics
Bibliometrics, also known as scientometrics, is the quantitative study of
the process of scholarly publication
of research achievements. The most
mundane aspect of this branch of
information and library science is the
design and application of bibliometric
indicators to determine the influence
of bibliometric units like scholars and
academic journals. The Impact Factor
is, undoubtedly, the most popular
and controversial journal bibliometric
indicator available at the moment. It is
defined, for a given journal and a fixed
year, as the mean number of citations
in the year to papers published in the
two previous years. It has been proposed in 1963 by Eugene Garfield, the
founder of the Institute for Scientific
Information (ISI), working together
with Irv Sher.8
Journal Impact Factors
are currently published in the popular
Journal Citation Reports by ThomsonReuters, the new owner of the ISI.
The Impact Factor does not take
into account the importance of the
citing journals: citations from highly
Figure 3. An instance with solution of the journal ranking method proposed by Pinski and
Narin. Nodes are labeled with influence scores and edges with the citation flow between
journals. Scores have been normalized to sum to 100.
A
28.0
B
44.7
8
5
1
D
14.9 5
C
12.4 8
10
5
Figure 4. An example of the Katz model using two attenuation factors: a = 0.9 and a = 0.1
(the spectral radius of the adjacency matrix L is 1). Each node is labeled with the Katz score
corresponding to a = 0.9 (top) and a = 0.1 (bottom). Scores have been normalized to sum to 100.
june 2011 | vol. 54 | no. 6 | communications of the acm 99
the denominator of the influence score
formula. This avoids overinflating
journals that engage in the practice of
opportunistic self-citations.
It can be proved that the spectral
radius of matrix H is 1, hence the influence score vector corresponds to the
dominant eigenvector of H.
9
In principle, the uniqueness of the solution and
the convergence of the power method
to it are not guaranteed. Nevertheless,
both properties are not difficult to
obtain in real cases. If the citation
graph is strongly connected, then the
solution is unique. When journals
belong to the same research field,
this condition is typically satisfied.
Moreover, if there exists a self-loop in
the graph, that is an article that cites
an article in the same journal, then the
power method converges.
Figure 3 provides an example of the
Pinski and Narin method. Notice that
the graph is strongly connected and
has a self-loop, hence the solution is
unique and can be computed with the
power method. Both journals A and
C receive the same number of citations and give out the same number
of references. Nevertheless, the influence of A is bigger, since it is cited by
a more influential journal (B instead
of D). Furthermore, A and D receive
the same number of citations from
the same journals, but D is larger than
A, since it contains more references,
hence the influence of A is higher.
Similar recursive methods have been
independently proposed by Liebowitz
and Palmer19 and Palacios-Huerta
and Volij23 in the context of ranking
of economics journals. Recently, various PageRank-inspired bibliometric
indicators to evaluate the importance
of journals using the academic citation network have been proposed and
extensively tested: journal PageRank,2
Eigenfactor,34 and SCImago.28
Sociometry
Sociometry, the quantitative study
of social relationships, contains
remarkably old PageRank predecessors. Sociologists were the first to use
the network approach to investigate the
properties of groups of people related in
some way. They devised measures like
indegree, closeness, betweeness, as well
as eigenvector centrality, which are still
used today in modern (not necessarily
social) network analysis.21 In particular,
eigenvector centrality uses the same central ingredient of PageRank applied to a
social network:
A person is prestigious if he is
endorsed by prestigious people.
John R. Seeley in 1949 is probably the
first in this context to use the circular
argument of PageRank.29 Seeley reasons in terms of social relationships
among children: each child chooses
other children in a social group with
a non-negative strength. The author
notices that the total choice strengths
received by each children is inadequate as an index of popularity, since it
does not consider the popularity of the
chooser. Hence, he proposes to define
the popularity of a child as a function
of the popularity of those children who
chose the child, and the popularity of
the choosers as a function of the popularity of those who chose them and so
in an “indefinitely repeated reflection.”
Seeley exposes the problem in terms of
linear equations and uses Cramer’s
rule to solve the linear system. He does
not discuss the issue of uniqueness.
Another model is proposed in
1953 by Leo Katz.11 Katz views a social
network as a directed graph where
nodes are people and person i is connected by an edge to person j if i
chooses, or endorses, j. The status of
member i is defined as the number of
weighted paths reaching j in the network, a generalization of the indegree
measure. Long paths are weighted less
than short ones, since endorsements
devalue over long chains. Notice that
this method indirectly takes account
of who endorses as well as how many
endorse an individual: if a node i
points to a node j and i is reached by
many paths, then the paths leading to
i arrive also at j in one additional step.
Katz builds an adjacency matrix
L = (l
i,j
) such that l
i,j
= 1 if person i chooses
person j and l
i, j
= 0 otherwise. He defines
a matrix , where a is an
attenuation constant. Notice that the (i, j)
component of Lk
is the number of paths
of length k from i to j, and this number
is attenuated by ak
in the computation
of W. Hence, the (i,j) component of the
limit matrix W is the weighted number
of arbitrary paths from i to j. Finally, the
status of member , that is,
the number of weighted paths reaching j. If the attenuation factor a<1/r(L),
with r(L) the spectral radius of L, then
the above series for W converges.
Figure 4 illustrates the method with
an example. Notice the important role
of the attenuation factor: when it is
large (close to 1/r(L)), long paths are
devalued smoothly, and Katz scores
are strongly correlated with PageRank
ones. In the shown example, PageRank
and Katz methods provide the same
ranking of nodes when the attenuation
factor is 0.9. On the other hand, if the
attenuation factor is small (close to 0),
then the contribution given by paths
longer than 1 rapidly declines, and thus
Katz scores converge to indegrees, the
number of incoming links of nodes.
In the example, when the attenuation factor drops to 0.1, nodes C and E
switch their positions in the ranking:
node E, which receives many short
paths, significantly increases its score,
while node C, which is the destination
of just one short path and many (devalued) long ones, significantly decreases
its score.
In 1965 Charles H. Hubbell generalizes the proposal of Katz.10 Given
a set of members of a social context, Hubbell defines a matrix W=(wi, j
)
Figure 5. An instance of the Hubbell model with solution: each node is labeled with its
prestige score and each edge is labeled with the endorsement strength between the
connected members; negative strength is highlighted with dashed edges. The minimal
amount of status has been fixed to 0.2 for all members.
Alice
0.49
Bob
0.41
0.4 David
-0.9
-0.4
0.4
-0.4
Charles
0.2
0.6
0.1
-0.1
0.8
100 communications of the acm | june 2011 | vol. 54 | no. 6
review articles
economy of a country may be divided
into any desired number of sectors,
called industries, each consisting of
firms producing a similar product.
Each industry requires certain inputs
in order to produce a unit of its own
product, and sells its products to other
industries to meet their ingredient
requirements. The aim is to find prices
for the unit of product produced by
each industry that guarantee the reproducibility of the economy, which holds
when each sector balances the costs
for its inputs with the revenues of its
outputs. In 1973, Leontief earned the
Nobel Prize in economics for his work
on the input–output model. An example is provided in Table 2.
Let qi,j
denote the quantity produced
by the ith industry and used by the jth
industry, and qi
be the total quantity
produced by sector i, that is, .
Let A = (ai,j
) be such that ai, j
= qi, j
/qj
; each
coefficient ai,j
represents the amount
of product (produced by industry) i consumed by industry j that is necessary to
produce a unit of product j. Let πj
be the
price for the unit of product produced
by each industry j. The reproducibility
of the economy holds when each sector
j balances the costs for its inputs with
the revenues of its outputs, that is:
By dividing each balance equation by qj
we have
or, in matrix notation,
π = πA (6)
such that wi, j
is the strength at which i
endorses j. Interestingly, these weights
can be arbitrary, and in particular, they
can be negative. The prestige of a member is recursively defined in terms of
the prestige of the endorsers and takes
account of the endorsement strengths:
π = πW + v (5)
The term v is an exogenous vector such
that vi
is a minimal amount of status
assigned to i from outside the system.
The original aspects of the method
are the presence of an exogenous initial input and the possibility of giving
negative endorsements. A consequence
of negative endorsements is that the
status of an actor can also be negative. An actor who receives a positive
(respectively, negative) judgment from
a member of positive status increases
(respectively, decreases) his prestige.
On the other hand, and interestingly,
receiving a positive judgment from a
member of negative status makes a negative contribution to the prestige of the
endorsed member (if you are endorsed
by some person affiliated to the Mafia
your reputation might drop indeed).
Moreover, receiving a negative endorsement from a member of negative status
makes a positive contribution to the
prestige of the endorsed person (if the
same Mafioso opposes you, then your
reputation might raise).
Figure 5 shows an example for the
Hubbell model. Notice that Charles
does not receive any endorsement and
hence has the minimal amount of status given by default to each member.
David receives only negative judgments;
interestingly, the fact that he has a positive self-opinion further decreases his
status. A better strategy for him, knowing in advance of his negative status,
would be to negatively judge himself,
acknowledging the negative judgment
given by the other members.
Equation 5 is equivalent to π(I−W) =
v, where I is the identity matrix, that is
. The series converge if and only if the spectral radius
of W is less than 1. It is now clear that
the Hubbell model is a generalization
of the Katz model to general matrices
that adds an initial exogenous input v.
Indeed, the Katz equation for social status is , where e is a vector of
all ones. In an unpublished note Vigna
traces the history of the mathematics
of spectral ranking and shows there is
a reduction from the path summation
formulation of Hubbell–Katz to the
eigenvector formulation with teleportation of PageRank and vice versa.30 In the
mapping the attenuation constant is
the counterpart of the PageRank damping factor, and the exogenous vector
corresponds to the PageRank personalization vector. The interpretation of
PageRank as a sum of weighted paths is
also investigated by Baeza-Yates et al.1
Spectral ranking methods have also
been exploited to rank sport teams in
competitions that involve teams playing in pairs.13,32 The underlying idea is
that a team is strong if it won against
other strong teams. Much of the art of
the sport ranking problem is how to
define the matrix entries ai, j
expressing
how much team i is better than team j
(e.g., we could pick ai, j
to be 1 if j beats
i, 0.5 if the game ended in a tie, and 0
otherwise).12
Econometrics
We conclude with a succinct description of the input–output model developed in 1941 by Nobel Prize winner
Wassily W. Leontief in the field of
econometrics—the quantitative study
of economic principles.18 According to
the Leontief input–output model, the
Table 2. An input–output table for an economy with three sectors with the balance solution.
Agriculture Industry Family Total Price Revenue
Agriculture 7.5 6 16.5 30 20 600
Industry 14 6 30 50 15 750
Family 80 180 40 300 3 900
Cost 600 750 900
Each row shows the output of a sector to other sectors of the economy. Each column shows the inputs received by a sector from other sectors. For each sector we also show
total quantity produced, equilibrium unitary price, total cost, and total revenue. Notice that each sector balances costs and revenues.
review articles
june 2011 | vol. 54 | no. 6 | communications of the acm 101
Hence, highly remunerated industries
(industries j with high total revenue πj
qj
)
are those that receive substantial inputs
from highly remunerated industries,
a circularity that closely resembles the
PageRank thesis.25 With the same argument used by Geller9
for the Pinski and
Narin bibliometric model we can show
that the spectral radius of matrix A is 1,
thus the equilibrium price vector π is
the dominant eigenvector of matrix A.
Such a solution always exists, although
it might not be unique, unless A is irreducible. Notice the striking similarity
of the Leontief closed model with that
proposed by Pinski and Narin. An open
Leontief model adds an exogenous
demand and creates a surplus of revenue (profit). It is described by the
equation π = π A + v, where v is the profit
vector. Hubbell himself observes the
similarity between his model and the
Leontief open model.10
It might seem disputable to juxtapose PageRank and Leontief methods.
To be sure, the original motivation
of Leontief work was to give a formal
method to find equilibrium prices for
the reproducibility of the economy
and to use the method to estimate
the impact on the entire economy
of the change in demand in any sectors of the economy. Leontief, to the
best of our limited knowledge, was
not motivated by an industry ranking
problem. On the other hand, the motivation underlying the other methods
described in this paper is the ranking of
a set of homogeneous entities. Despite
the original motivations, however,
there are more than coincidental similarities between the Leontief open and
closed models and the other ranking
methods described in this paper. These
connections motivated the discussion
of the Leontief contribution, which is
probably the least known among the
surveyed methods within the computing community.
Conclusion
The classic notion of quality of information is related to the judgment
given by few field experts. PageRank
introduced an original notion of quality of information found on the Web:
the collective intelligence of the Web,
formed by the opinions of the millions
of people that populate this universe, is
exploited to determine the importance,
9. Geller, N.L. On the citation influence methodology of
Pinski and Narin. Inf. Process. Manage. 14, 2 (1978),
93–95.
10. Hubbell, C.H. An input–output approach to clique
identification. Sociometry 28 (1965), 377–399.
11. Katz, L. A new status index derived from sociometric
analysis. Psychometrika 18 (1953), 39–43.
12. Keener, J.P. The Perron–Frobenius theorem and the
ranking of football teams. SIAM Rev. 35, 1 (1993),
80–93.
13. Kendall, M.G. Further contributions to the theory of
paired comparisons. Biometrics 11, 1 (1955), 43–62.
14. Kleinberg, J.M. Authoritative sources in a hyperlinked
environment. In ACM–SIAM Symposium on Discrete
Algorithms, 1998, 668–677.
15. Kleinberg, J.M. Authoritative sources in a hyperlinked
environment. J. ACM 46, 5 (1999), 604–632.
16. Langville, A.N., Meyer, C.D. Google’s PageRank and
Beyond: The Science of Search Engine Rankings.
Princeton University Press, 2006.
17. Lempel, R., Moran, S. The stochastic approach for
link-structure analysis (SALSA) and the TKC effect.
Comput. Netw. 33, 1–6 (2000), 387–401.
18. Leontief, W.W. The Structure of American Economy,
1919–1929. Harvard University Press, Cambridge,
1941.
19. Liebowitz, S.J., Palmer, J.P. Assessing the relative
impacts of economics journals. J. Econ. Lit. 22 (1984),
77–88.
20. Markov, A. Rasprostranenie zakona bol’shih chisel
na velichiny, zavisyaschie drug ot druga.
Izvestiya Fiziko-matematicheskogo obschestva
pri Kazanskom universitete, 2-ya seriya 15, 94
(1906), 135–156.
21. Newman, M.E.J. Network Analysis: An Introduction.
Oxford University Press, Oxford, U.K., 2010.
22. Page, L., Brin, S., Motwani, R., Winograd, T. The
PageRank citation ranking: Bringing order to the
Web. Technical Report 1999–66, Stanford InfoLab,
November 1999. Retrieved June 1, 2010, from http://
ilpubs.Stanford.edu:8090/422/
23. Palacios-Huerta, I., Volij, O. The measurement of
intellectual influence. Econometrica 72 (2004),
963–977.
24. Perron, O. Zur theorie der matrices. Math. Ann. 64,
2 (1907), 248–263.
25. Pillai, S.U., Suel, T., Cha, S. The Perron–Frobenius
theorem: Some of its applications. IEEE Signal
Process. Mag. 22, 2 (2005), 62–75.
26. Pinski, G., Narin, F. Citation influence for journal
aggregates of scientific publications: Theory, with
application to the literature of physics. Inf. Process.
Manage. 12, 5 (1976), 297–312.
27. R Development Core Team. R: A Language and
Environment for Statistical Computing. R Foundation
for Statistical Computing, Vienna, Austria, 2007.
ISBN 3-900051-07-0. Accessed June 1, 2010,
at http://www.R-project.org
28. SCImago. SJR—SCImago Journal & Country Rank.
Accessed June 1, 2010, at http://www.scimagojr.com,
2007.
29. Seeley, J.R. The net of reciprocal influence: A
problem in treating sociometric data. Can. J. Psychol.
3 (1949), 234–240.
30. Vigna, S. Spectral ranking, 2010. Retrieved
June 1, 2010 from http://arxiv.org/abs/
0912.0238
31. von Mises, R., Pollaczek-Geiringer, H. Praktische
verfahren der gleichungsauflôsung. Z. Angew.
Math. Mech. 9, 58–77 (1929), 152–164.
32. Wei, T.H. The algebraic foundations of ranking theory.
PhD thesis, Cambridge University, 1952.
33. Weingart, P. Impact of bibliometrics upon the science
system: Inadvertent consequences? Scientometrics
62, 1 (2005), 117–131.
34. West, J., Althouse, B., Bergstrom, C., Rosvall, M.,
Bergstrom, T. Eigenfactor.org—Ranking and
mapping scientific knowledge. Accessed June 1,
2010, at http://www.eigenfactor.org, 2007.
35. Zheng, A.X., Ng, A.Y., Jordan, M.I. Stable
algorithms for link analysis. In International
ACM SIGIR Conference on Research and
Development in Information Retrieval, 2001,
258–266.
Massimo Franceschet (massimo.franceschet@udine.it)
is a researcher in the Department of Mathematics and
Computer Science, University of Udine, Italy.
© 2011 ACM 0001-0782/11/06 $
ABSTRACT
Micro-videos, a new form of user generated contents
(UGCs), are gaining increasing enthusiasm. Popular microvideos have enormous commercial potential in many ways,
such as online marketing and brand tracking. In fact, the
popularity prediction of traditional UGCs including tweets,
web images, and long videos, has achieved good theoretical
underpinnings and great practical success. However, little
research has thus far been conducted to predict the
popularity of the bite-sized videos. This task is nontrivial due to three reasons: 1) micro-videos are short
in duration and of low quality; 2) they can be described
by multiple heterogeneous channels, spanning from social,
visual, acoustic to textual modalities; and 3) there are
no available benchmark dataset and discriminant features
that are suitable for this task. Towards this end, we
present a transductive multi-modal learning model. The
proposed model is designed to find the optimal latent
common space, unifying and preserving information from
different modalities, whereby micro-videos can be better
represented. This latent space can be used to alleviate
the information insufficiency problem caused by the brief
nature of micro-videos. In addition, we built a benchmark
dataset and extracted a rich set of popularity-oriented
features to characterize the popular micro-videos. Extensive
experiments have demonstrated the effectiveness of the
proposed model. As a side contribution, we have released the
dataset, codes and parameters to facilitate other researchers.
Keywords
Micro-Videos; Popularity Prediction; Multi-View Learning
1. INTRODUCTION
The last couple of years have witnessed the unprecedented
growth of smart mobile devices, enabling users to record
life stories vividly with short videos and then instantly
∗Xuemeng Song is the corresponding author.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
MM ’16, October 15-19, 2016, Amsterdam, Netherlands
⃝c 2016 ACM. ISBN 978-1-4503-3603-1/16/10. . . $15.00
DOI: http://dx.doi.org/10.1145/2964284.2964314
upload them to social media websites, such as Snapchat1
and Vine2
. Since micro-videos, acting more like ‘live action
photographs’, are usually short in length, they need little
bandwidth and hence gain tremendous user enthusiasm.
The limits regarding the maximum length of micro-videos
on Snapchat and Vine, are set as 10 and 6 seconds,
respectively. Considering Vine as an example, its video
length distribution over our collected 303, 242 Vine videos
is illustrated in Figure 1. Micro-videos, representing a new
form of user generated contents (UGCs), can be viewed,
discussed and even reposted by users once they are uploaded,
which leads to their rapid rise. It is reported that Vine hit
1.5 billion daily loops3
in 20154
and Snapchat hit 7 billion
daily views in 20165
. As a video messaging platform, it is
hard to crawl micro-videos from Snapchat. Hence, we focus
on Vine here, which can be gathered more easily for research.
Interestingly, among the tremendous volume of microvideos, some popular ones will be widely viewed and spread
by users, while many only gain little attention. This
phenomena is similar to many existing social media sites,
such as Twitter6
. For example, the micro-video about
the explosion that interrupted during the France-Germany
soccer match in 2015 has been successfully looped by
over 330 million times. Obviously, if we can identify the
hot and popular micro-videos in advance, it will benefit
many applications, such as online marketing and network
reservation. Regarding online marketing, the accurate early
prediction of popular micro-videos can facilitate companies’
planning of advertising campaigns and thus maximizing
their revenues. For network service providers, they can
timely reserve adequate distributed storage and bandwidth
for popular ones, based on the prediction. Therefore, it is
highly desirable to develop an effective scheme to accurately
predict the popularity of micro-videos.
However, the popularity prediction of micro-videos is
non-trivial due to the following challenges. First of
all, due to the short duration of micro-videos, each
modality can only provide limited information, the socalled modality limitation. Fortunately, micro-videos always
involve multiple modalities, namely, social, visual, acoustic
1
https://snapchat.com.
2
https://vine.co.
3
Loops refer to the times a micro-video has been viewed.
4
http://blog.vine.co.
5
http://goo.gl/YYnBbd.
6
https://twitter.com.
898
Figure 1: Duration distribution over our collected
303, 242 micro-videos.
and textual7 modalities. In a sense, these modalities are corelated rather than independent and essentially characterize
the same micro-videos. Therefore, the major challenge
lies on how to effectively fuse micro-videos’ heterogeneous
clues from multiple modalities [31, 30, 36]. The most
naive strategies are early fusion and late fusion [27]. They,
however, fail to account for the relatedness among multiple
modalities. Therefore, it is important to take modality
relatedness into consideration. Secondly, due to certain
external factors, such as camera shaking and lighting
condition, some modalities of the micro-videos, such as
visual or acoustic ones, may be of poor quality, which
is another kind of modality limitation. Hence, learning
directly from the original feature spaces of modalities,
which was adopted by most multi-modal learning methods
may be imprecise. Consequently, to improve the learning
performance, how to compensate the noisy modality with
reliable ones poses a crucial challenge for us. The last
challenge we are facing is the lack of benchmark dataset to
support our research. We found that both the contents and
social influence of micro-video publishers would account for
their popularity. As far as we know, the only available microvideo dataset [24] does not contain the important textual
and social modalities and is lack of popularity indicators as
the ground truth, which makes it unsuitable for our research.
It is thus necessary to build a comprehensive micro-video
dataset and further extract a rich set of discriminant features
to enhance the learning performance.
To address the aforementioned challenges, we present
a novel Transductive Multi-modAL Learning approach,
TMALL for short, to predicting the popularity of microvideos. As illustrated in Figure 2, we first crawl a
representative micro-video dataset from Vine and develop
a rich set of popularity-oriented features from multimodalities. We then perform multi-modal learning to
predict the popularity of micro-videos, which seamlessly
takes the modality relatedness and modality limitation
into account by utilizing a common space shared by all
modalities. We assume that there exists an optimal common
space, which maintains the original intrinsic characteristics
of micro-videos in the original spaces. In the light of this, all
modalities are forced to be correlated. Meanwhile, microvideos with different popularity can be better separated
in such optimal common space, as compared to that of
each single modality. In a sense, we alleviate the modality
limitation problem. Extensive experiments on this realworld dataset have well-validated our work.
Our main contributions can be summarized in threefold:
7Micro-videos are usually associated with certain textual data, such
as video descriptions given by the video owners.
• We approached the popularity prediction of microvideos by proposing a TMALL model, which is able
to simultaneously model the modality relatedness and
handle the modality limitations by introducing a
common space shared among all modalities. Moreover,
we have derived its closed-form solution rigorously.
• We developed a rich set of popularity-oriented
features from multiple modalities to comprehensively
characterize the popular micro-videos. Apart from
numerical results, we also provided several deep
insights based on the experimental results.
• We constructed a large-scale micro-video dataset,
comprising of 303, 242 micro-videos, 98, 166 users and
120, 324 following relationships. We have released our
compiled dataset, codes and parameters8
to facilitate
other researchers to repeat our experiments and verify
their proposed approaches.
The remainder of this paper is structured as follows. Section
2 reviews the related work. Data preparation is introduced
in Section 3. Sections 4 and 5 detail the proposed TMALL
model and the feature extraction, respectively. Section 6
presents the experimental results and analysis, followed by
our concluding remarks in Section 7.
2. RELATED WORK
Popularity predication and multi-view learning are both
related to this work.
2.1 Popularity Prediction
Due to its enormous commercial potential, popularity
prediction of UGCs has attracted great attention from both
the industry and academia [12, 20, 5, 29, 11]. Hong et al.
[12] explored the popularity prediction of tweets, which is
measured by the number of future retweets, by introducing
a rich set of features, such as topological features, temporal
features and meta features. Beyond the text popularity
prediction, McParlane et al. [20] focused on the popularity
of images. They extracted some sophisticated features and
cast the task of image popularity prediction as a problem
of binary classification, where the given image would be
classified as popular or not. Later, Cappallo et al. [5]
proposed a latent ranking method for the image popularity
prediction solely based on the visual content. The proposed
method was evaluated on several image datasets collected
from micro-blogging and photo-sharing websites. Although
huge success has been achieved by these approaches, limited
efforts have thus far been dedicated to the problem of video
popularity prediction, where multiple modalities coexist.
Noting this gap, Trzcinski et al. [29] shifted from images
to videos, and studied the problem of video popularity
prediction utilizing both the visual clues and the early
popularity pattern of the video once it is released. However,
this approach suffered from two limitations. First, the
proposed approach can only work on videos that have been
published over a certain period. Second, the authors only
used the traditional machine learning method—Support
Vector Regression (SVR), which failed to make full use of the
relationship among modalities. As a complement, we aim to
timely predict the popularity of a given micro-video even
before it get published by proposing a novel multi-modal
learning scheme.
8The dataset can be accessible via http://acmmm2016.wix.com/
micro-videos.
899
Figure 2: Micro-video popularity prediction via our proposed TMALL model.
2.2 Multi-View Learning
To deal with data containing multiple modalities, multiview learning is a highly feasible paradigm. Multi-view
learning is designed to improve the learning performance
by introducing a function to model each view and jointly
optimizing all functions. Existing work follows this line
can be roughly classified into two categories: co-training
and subspace learning. Co-training algorithms usually
train separate learners on distinct views, which are then
imposed to be consistent across views. Sindhwani et. al.
[26] introduced a co-regularization framework for multiview semi-supervised learning, as an extension of supervised
regularization algorithms. Noticing that corruption may
exist among different views, Christoudias et al. [7]
proposed an approach for multi-view learning taking the
view disagreement into consideration. In contrast, subspace
learning approaches hold the general assumption that
different views are generated from a latent view. Chaudhuri
et al. [6] first employed canonical correlation analysis (CCA)
to learn an efficient subspace, on which traditional machine
learning algorithms can be applied. Gao et al. [9] later
introduced a novel multi-view subspace clustering method,
which is able to simultaneously perform clustering on the
subspace of each view and guarantee the consistency among
multiple views by a common clustering structure.
Overall, compelling success has been achieved by multiview learning models on various problems, such as
categorization [26, 28], clustering [6, 9] and multimedia
retrieval [18, 19]. However, to the best of our knowledge,
limited efforts have been dedicated to applying multi-view
learning in the context of micro-video popularity prediction,
which is the major concern of our work.
3. DATA COLLECTION
This section details the dataset setup, which covers the
crawling strategy, and ground truth construction.
3.1 Crawling Strategy
Our micro-video dataset was crawled from one of the most
prominent micro-video sharing social networks, Vine. Beside
the historical uploaded micro-videos, Vine also archives
users’ profiles and their social connections.
In particular, we first randomly selected 10 active Vine
users from Rankzoo9
, which provides the top 1, 000 active
9
https://rankzoo.com.
users on Vine, as the seed users. We then adopted the
breadth-first crawling strategy to expand the seed users
by crawling their followers. Considering that these seed
users may have millions of followers, we practically only
retained the first 1, 000 returned followers for each seed
user to improve the crawling efficiency. After three layers
of crawling, we harvested a densely connected user set
consisting of 98, 166 users as well as 120, 324 following
relationships among users. For each user, his/her brief
profile was crawled, containing full name, description,
location, follower count, followee count, like count, post
count and loop count of all post videos. Besides, we
also collected the timeline (the micro-video posting history,
including the repostings from others.) of each user between
July 1st and October 1st, 2015. Finally, we obtained 1.6
million video postings, including a total number of 303, 242
unique micro-videos with a total duration of 499.8 hours.
3.2 Ground Truth Construction
We employed four popularity-related indicators, namely,
the number of comments (n comments), the number of likes
(n likes), the number of reposts (n reposts), and the number
of loops/views (n loops) to measure the popularity of microvideos. Figure 3 illustrates the proportion of micro-videos
regarding each of the four indicators in our dataset. It is
noticed that the distributions of them are different, and
each of them measure one aspect of the popularity. In order
to comprehensively and precisely measure the popularity of
each micro-video, yi, we linearly fuse all the four indicators:
yi =
(n reposts + n comments + n likes + n loops)
4
. (1)
4. OUR PROPOSED TMALL MODEL
4.1 Notation
We first declare several notations. We employ bold capital
letters (e.g., X) and bold lowercase letters (e.g., x) to denote
matrices and vectors, respectively. We use non-bold letters
(e.g., x) to represent scalars, and Greek letters (e.g., β) as
parameters. If not clarified, all vectors are in column form.
Without loss of generality, suppose we have N labeled
samples and M unlabeled samples with K > 2 modalities.
It is worth noting that the unlabeled samples also serve
as testing samples. Zk stands for the number of features
generated from the k-th modality. Then the k-th modality
can be represented as Xk ∈ R
(N+M)×Zk . The popularity of
900
Figure 3: Distribution of the number of comments,
likes, reposts and loops of micro-videos in our
dataset.
all the videos are denoted by y = {y1, y2, · · · , yN }
T ∈ R
N .
Let f = {f1, f2, · · · , fN , fN+1, fN+2, · · · , fN+M}
T ∈ R
N+M
stand for the predicted results regarding popularity for all
samples, including the labeled and unlabeled ones. We
aim to jointly learn the common space X0 ∈ R
(N+M)×Z0
shared by multiple modalities and the popularity for the M
unlabeled micro-videos.
Our proposed model targets at reasoning from observed
training micro-videos to testing ones. Such prediction
belongs to transductive learning, in which both labeled
samples as well as unlabeled samples are available for
training. It hence obtains better performance. In contrast,
inductive model is reasoning from observed training cases to
general rules, which are then applied to the test cases.
4.2 Problem Formulation
It is apparent that different modalities may contribute
distinctive and complementary information about microvideos. For example, textual modality gives us hints about
the topics of the given micro-video; acoustic and visual
modalities may respectively convey location and situation of
micro-videos, and user modality demonstrates the influence
of the micro-video publisher. These clues jointly contribute
to the popularity of a micro-video. Obviously, due to the
noise and information insufficiency of each modality, it may
be suboptimal to conduct learning directly from each single
modality separately. In contrast, we assume that there
exists an optimal latent space, in which micro-videos can be
better described. Moreover, the optimal latent space should
maintain the original intrinsic characteristics conveyed by
multi-modalities of the given micro-videos. Therefore,
we penalize the disagreement of the normalized Laplacian
matrix between the latent space and each modality. In
particular, we formalize this assumption as follows. Let
Sk ∈ R
(N+M)×(N+M) be the similarity matrix10, which is
computed by the Gaussian similarity function as follows,
Sk(i, j) =



exp(








x
i
k − x
j
k








2
2σ2
k
)
, if i ̸= j;
0 , if i = j.
(2)
where x
i
k and x
j
k
are the micro-video pairs in the k-th
modality space. Thereinto, the radius parameter σk is
simply set as the median of the Euclidean distances over
all video pairs in the k-th modality. We then derive the
corresponding normalized Laplacian matrix as follows,
L(Sk) = I − D
− 1
2
k SkD
− 1
2
k
, (3)
10To facilitate the illustration, k ranges from 0 to K.
where I is a (N + M) × (N + M) identity matrix and Dk ∈
R
(N+M)×(N+M)
is the diagonal degree matrix, whose (u, u)-
th entry is the sum of the u-th row of Sk. Since Sk(i, j) > 0,
we can derive that tr(L(Sk)) > 0. We thus can formulate
the disagreement penalty between the latent space and the
original modalities as,
∑K
k=1

1
tr(L(S0))L(S0) −
1
tr(L(Sk))L(Sk) ∥
2
F , (4)
where tr(A) is the trace of matrix A and



·




F
denotes
the Frobenius norm of matrix. In addition, inspired by
[32], considering that similar micro-videos attempt to have
similar popularity in the latent common space, we adopt the
following regularizer,
1
2
N∑
+M
m=1
N∑
+M
n=1
( f(x
m
0 )

D0(xm
0
)

f(x
n
0 )

D0(x
n
0
)
)2
S0(m, n)
= f
T L(S0)f. (5)
Based upon these formulations, we can define the loss
function that measures the empirical error on the training
samples. As reported in [22], the squared loss usually yields
good performance as other complex ones. We thus adopt the
squared loss in our algorithm for simplicity and efficiency.
In particular, since we do not have the labels for testing
samples, we only consider the squared loss regarding the N
unlabeled samples to guarantee the learning performance.
We ultimately reach our objective function as,
min
f ,L(S0)
∑N
i=1
(yi − fi)
2 + µf
T L(S0)f
+ λ
∑K
k=1

1
tr(L(S0))L(S0) −
1
tr(L(Sk))L(Sk) ∥
2
F , (6)
where λ and µ are both nonnegative regularization
parameters. To be more specific, λ penalizes the
disagreement among the latent space and modalities, and µ
encourages that similar popularity will be assigned to similar
micro-videos.
4.3 Alternative Optimization
To simplify the representation, we first define that,



L˜ =
1
tr(L(S0))L(S0),
L˜k =
1
tr(L(Sk))L(Sk).
(7)
Therefore, the objective function can be transformed to,
min
f
∑N
i=1
(yi − fi)
2 + λ
∑K
k=1



L˜ − L˜k




2
F
+ µf
T Lf ˜ ,
subject to tr(L(S0)) = 1. (8)
Furthermore, to optimize L˜ more efficiently, inspired by the
property that tr(L˜k) = 1, we let,
L(S0) = ∑K
k=1
βkL˜k, subject to ∑K
k=1
βk = 1. (9)
901
Consequently, we have,
L˜ =
1
tr(L(S0))L(S0) =
∑K
k=1
βkL˜k,
subject to ∑K
k=1
βk = 1. (10)
Interestingly, we find that βk can be treated as the corelated degree between the latent common space and each
modality. It is worth noting that we do not impose the
constraint of β ≥ 0, since we want to keep both positive
and negative co-relations. A positive coefficient indicates
the positive correlation between the modality space and the
latent common space, while a negative coefficient reflects the
negative correlation, which may be due to the noisy data of
the modality. The larger the βk is, the higher correlation
between the latent space and the k-th modality will be. In
the end, the final objective function can be written as,
min
f ,β
∑N
i=1
(yi − fi)
2 + λ
∑K
k=1



∑K
i=1 βiL˜i − L˜k




2
F
+
µf
T ∑K
k=1
βkL˜kf + θ



β




2
, subject to e
T β = 1, (11)
where β = [β1, β2, · · · , βK]
T ∈ R
K and e = [1, 1, · · · , 1]T ∈
R
K. θ is the regularization parameter, introduced to avoid
the overfitting problem. We denote the objective function
of Eqn.(11) as Γ. We adopt the alternating optimization
strategy to solve the two variables f and β in Γ. In
particular, we optimize one variable while fixing the other
one in each iteration. We keep this iterative procedure until
the Γ converges.
4.3.1 Computing βj with f fixed
We first fix f and transform the objective function Γ as,
min
β
λ
∑K
k=1
N∑
+M
t=1





M(t)β −˜l
(t)
k






2
F
+ µg
T β + θ



β




2
,
subject to e
T β = 1, (12)
where g = [f
T L˜1f,f
T L˜2f, · · · ,f
T L˜Kf]
T ∈ R
K, M(t) =
[˜l
(t)
1
,˜l
(t)
2
, · · · ,˜l
(t)
K ] ∈ R
(N+M)×K and ˜l
(t)
k ∈ R
N+M denotes
the t-th column of L˜k. For simplicity, we replace ˜l
(t)
K with
˜l
(t)
k
e
T β, as e
T β = 1. With the help of Lagrangian, Γ can
be rewritten as follows.
min
β
λ
∑K
k=1
N∑
+M
t=1





(M(t) −˜l
(t)
k
e
T







2
F
+ µg
T β + δ(1 − e
T β)
+ θ



β




2
, (13)
where δ is a nonnegative Lagrange multiplier. Taking
derivative of Eqn.(13) with respect to β, we have,
∂Γ
∂β
= Hβ + µg − δe, (14)
where,
H = 2[(λ
∑K
k=1
N∑
+M
t=1
(M(t) −˜l
(t)
k
e
T
)
T
(M(t) −˜l
(t)
k
e
T
)
)
+ θI
]
,
(15)
and I is a K × K identity matrix. Setting Eqn.(14) to zero,
we have,
β = H
−1
(δe − µg). (16)
Substituting Eqn.(16) into e
T β = 1, we have,



δ =
1 + µe
T H−1g
e
T H−1e
,
β = H
−1
[
e + µe
T H−1ge
e
T H−1e
− µg
]
.
(17)
According to the definition of positive-definite matrix, H
is always positive definite and hence invertible. Therefore,
H−1
is also positive definite, which ensures e
T H−1
e > 0.
4.3.2 Computing f with βj fixed
With fixed βj , taking derivative of Γ with respect to fi,
where 1 ≤ i ≤ N, we have,
∂Γ
∂fi
= 2(fi − yi) + 2µ
N∑
+M
j=1
L˜(i, j)fj . (18)
We then take derivative of the Γ with respect to fi, where
N + 1 ≤ i ≤ N + M. We reach,
∂Γ
∂fi
= 2µ
N∑
+M
j=1
L˜(i, j)fj . (19)
In a vector-wise form, we restate the solution of f as follows,
f = G
−1
ˆy, (20)
where G = ˆI+µ
∑K
k=1 βkL˜k, ˆy = {y1, y2, · · · , yN , 0, 0, · · · , 0}
and ˆI ∈ R
(N+M)×(N+M)
is defined as follows,
Iˆ(i, j) = {
1 if i = j, and 1 ≤ i ≤ N,
0 otherwise.
(21)
5. FEATURE EXTRACTION
It is apparent that both the publisher influence and
content influence contribute to the popularity of UGCs.
In particular, we characterized the publisher influence via
the social modality, and the content influence via visual,
acoustic and textual modalities. For content influence,
we first examined the popular micro-videos in our dataset
and propose three common characteristics of online microvideos. For each characteristic, we then explained the
insights, and transformed it into a set of features for
video representation. Finally, we developed a rich set of
popularity-oriented features from each modality.
5.1 Observations
Universal Appeal. The subjects of widely popular
micro-videos cannot be something that can only be
appreciated by a small group of people. Therefore, the topics
and objects contained in micro-videos should be something
common so that to be interpreted the same way across
people and cultures. To capture this characteristic, we
extracted Sentence2Vector feature from the textual modality
and deep object feature from the visual one.
Emotional Content. People are naturally drawn to
things that arouse their emotions. Micro-videos showing
funny animals or lovely babies make people feel urge to
902
share them to express the same emotions. As a result, microvideos that are highly emotional are more likely to be shared.
Therefore, we extracted textual sentiment, visual sentiment
features for each video as well as several acoustic features,
which is widely used in emotion recognition in music [33].
High Quality and Aesthetic Design. When people
share information on social networks, people are actually
showing a little piece of themselves to their audience.
Therefore, high quality and aesthetic design of the content,
which could reflect the taste of people, is another important
characteristic of popular micro-videos. Color histogram,
aesthetic feature and visual quality feature were thus
extracted to encode such characteristic. In addition, the
acoustic features we extracted are frequently used in music
modeling, which could help to detect music in the audio
track of micro-videos [17].
5.2 Social Modality
It is intuitive that micro-videos posted by users, who has
more followers or has a verified account, are more likely to
be propagated, and thus tend to receive a higher number
of audiences. To characterize the influence of micro-video
publishers, we developed the following publisher-centric
features for micro-videos.
• Follower/Followee Count. The number of followers
and followees of the given micro-video publisher.
• Loop Count. The total number of loops received by
all the posts of the publisher.
• Post Count. The number of posts generated by the
publisher.
• Twitter Verification. A binary value indicating
whether the publisher has been verified by Twitter11
.
5.3 Visual Modality
Due to the short-length of micro-videos, the visual content
is usually highly related to a single theme, which enables
us to only employ a few key frames to represent the whole
micro-video. Inspired by this, we extracted the visual
features from certain key frames. The mean pooling was
performed across all the key frames to create a fixed-length
vector representation of each micro-video.
5.3.1 Color Histogram
It has been found that most basic visual features (i.e.,
intensity and the mean value of different color channels in
HSV space) except color histogram, have little correlation
with popularity [15]. Color histogram has outstanding
correlation due to the fact that striking colors tend to catch
users’ eyes. Therefore, we only extracted color histogram
as the basic visual feature to characterize popular microvideos. To reduce the size of color space, we grouped the
color space into 50 distinct colors, which results in a 50-D
vector for each frame.
5.3.2 Object Features
It has been studied that popular UGCs are strongly
correlated with the objects contained in the videos [10]. We
believe that the presence of certain objects affect microvideos’ popularity. For example, micro-videos with ‘cute
11A Vine account can be verified by Twitter, if it is linked to a verified
Twitter account.
dogs’ or ‘beautiful girls’ are more likely to be popular
than those with ‘desks’ and ‘stones’. We thus employed
the deep convolutional neural networks (CNNs) [16], a
powerful model for image recognition problems [35], to
detect objects in micro-videos. Specifically, we applied the
well-trained AlexNet DNN provided by the Caffe software
package [13] to the input key frames. The output of the
fc7 layer and the final 1, 000-way softmax layer in AlexNet
is a probability distribution over the 1, 000 class labels
predefined in ImageNet. We treat them as our feature
representation of each frame. In the end, a mean pooling
was performed over the frames to generate a single 4, 096-D
vector and 1, 000-D vector for each micro-video.
5.3.3 SentiBank Features
We performed the sentiment analysis of the visual
modality due to that the sentiment of UGCs has been
proven to be strongly correlated with their popularity [10].
In particular, we extracted the visual sentiment features
based on the deep CNNs model which was trained on the
SentiBank dataset[4]. SentiBank contains 2, 089 concepts
and each of them invokes specific sentiments such as ‘cute
girls’ and ‘funny animals’. Therefore, after mean pooling
among keyframes, each micro-video is represented by a
2, 089-D vector.
5.3.4 Aesthetic Features
Aesthetic features are a set of handful selected features
related to the principles of the nature and appreciation of
beauty, which have been studied and found to be effective
in popularity prediction [8]. Intuitively, micro-videos that
are objectively aesthetic are more likely to be popular. We
employed the released tool12 [3] to extract the following
aesthetic features: a) dark channel feature; b) luminosity
feature; c) s3 sharpness; d) symmetry; e) low depth of field;
f) white balance; g) colorfulness; h) color harmony, and i)
eye sensitivity, at 3 × 3 grids over each key frame. We then
calculated: a) normalized area of dominant object; and b)
normalized distances of centroid of dominant objects with
respect to four stress points at frame level. In the end, we
obtained 149-D aesthetic features for each micro-video.
5.3.5 Visual Quality Assessment Features
It is important that the visual quality of popular contents
are maintained at an acceptable level, given rising consumer
expectations of the quality of multimedia content delivered
to them [25]. In particular, we employed the released tool13
to extract the micro-videos quality features based on the
motion and spatio-temporal information, which have been
proven to correlate highly with human visual judgments of
quality. This results in a 46-D features.
5.4 Acoustic Modality
Acoustic modality usually works as an important
complement to visual modality in many video-related tasks,
such as video classification [34]. In fact, audio channels
embedded in the micro-videos may also contribute to
the popularity of micro-videos to a large extent. For
example, the audio channel may indicate the quality of a
given micro-video and convey rich background information
about the emotion as well as the scene contained in the
12http://www.ee.columbia.edu/˜subh/Software.php.
13http://live.ece.utexas.edu/.
903
micro-video, which significantly affects the popularity of a
micro-video. The acoustic information is especially useful
for the cases where the visual features could not carry
enough information. Therefore, we adopted the following
widely-used acoustic features, i.e., Mel-Frequency Cepstral
Coefficients (MFCC) [17] and Audio-Six (i.e., Energy
Entropy, Signal Energy, Zero Crossing Rate, Spectral
Rolloff, Spectral Centroid, and Spectral Flux [33]). These
features are frequently used in different audio-related tasks,
such as emotion detection and music recognition. We finally
obtained a 36-D acoustic feature vector for each micro-video.
5.5 Textual Modality
Micro-videos are usually associated with textual modality
in the form of descriptions, such as “when Leo finally gets
the Oscar” and “Puppy dog dreams”, which may precisely
summarize the micro-videos. Such summarization may
depict the topics and sentiment information regarding the
micro-videos, which has been proven to be of significance in
online article popularity prediction [2].
5.5.1 Sentence2Vector
We found that the popular micro-videos are sometimes
related to the topics of the textual descriptions. This
observation propels us to conduct content analysis over
the textual descriptions of micro-videos. Considering the
short-length of descriptions, to perform content analysis,
we employed the state-of-the-art textual feature extraction
tool Sentence2Vector14, which was developed on the basis of
work embedding algorithm Word2Vector [21]. In this way,
we extracted 100-D features for video descriptions.
5.5.2 Textual Sentiment
We also analyze the sentiments over texts, which has been
proven to play an important role in popularity prediction
[1]. With the help of the Sentiment Analysis tool in
Stanford CoreNLP tools15, we assigned each micro-video a
sentiment score ranging from 0 to 4 and they correspond to
very negative, negative, neutral, positive, and very positive,
respectively.
6. EXPERIMENT
In this section, we conducted extensive experiments to
comparatively verify our model.
6.1 Experiment Settings
The remaining experiments were conducted over a cluster
of 50 servers equipped with Intel Xeon(2x) CPU E5-2620
v3 at 2.40 GHz on 64 GB RAM, 24 cores and 64-bit Linux
operating system. Regarding the deep feature extraction,
we deployed Caffe framework [13] on a server equipped with
a NVIDIA Titan Z GPU. The experimental results reported
in this paper were based on 10-fold cross-validation. In each
round of the 10-fold cross-validation, we split our dataset
into two chunks: 90% of the micro-videos were used for
training, 10% were used for testing. We report performance
in terms of normalised Mean Square Error (nMSE) [22]
between the predicted popularity and the actual popularity.
The nMSE is an estimator of the overall deviations between
14https://github.com/klb3713/sentence2vec.
15http://stanfordnlp.github.io/CoreNLP/.
predicted and measured values. It is defined as,
nMSE =

i=1 (pi − ri)
2

i=1 r
2
i
, (22)
where pi is the predicted value and ri is the target value in
ground truth.
We have three key parameters as shown in Eqn.(8). The
optimal values of these parameters were carefully tuned with
the training data in each of the 10 fold. We employed
the grid search strategy to obtain the optimal parameters
between 10−5
to 102 with small but adaptive step sizes.
In particular, the step sizes were 0.00001, 0.0001, 0.001,
0.01, 0.1, 1 and 10 for the range of [0.00001,0.0001],
[0.0001,0.001], [0.001,0.01], [0.01,0.1], [0.1,1], [1,10] and
[10,100], respectively. The parameters corresponding to the
best nMSE were used to report the final results. For other
compared systems, the procedures to tune the parameters
are analogous to ensure the fair comparison. Considering
one fold as an example, we observed that our model reached
the optimal performance at λ = 1, µ = 0.01 and θ = 100.
6.2 On Model Comparison
To demonstrate the effectiveness of our proposed TMALL
model, we carried out experiments with several state-of-theart multi-view learning approaches:
• Early Fusion. The first baseline concatenates the
features extracted from the four modalities into
a single joint feature vector, on which traditional
machine learning models can be applied. In this work,
we adopted the widely used regression model—SVR,
and implemented it with the help of scikit-learn [23].
• Late Fusion. The second baseline first separately
predicts the popularity of micro-videos from each
modality via SVR model, and then linearly integrates
them to obtain the final results.
• regMVMT. The third baseline is the regularized
multi-view learning model [37]. This model only
regulates the relationships among different views
within the original space.
• MSNL. The fourth one is the multiple social network
learning (MSNL) model proposed in [27]. This model
takes the source confidence and source consistency into
consideration.
• MvDA. The fifth baseline is a multi-view discriminant
analysis (MvDA) model [14], which aims to learn a
single unified discriminant common space for multiple
views by jointly optimizing multiple view-specific
transforms, one for each view. The model exploits both
the intra-view and inter-view correlations.
Table 1 shows the performance comparison among
different models. From this table, we have the following
observations: 1) TMALL outperforms the Early Fusion
and Late Fusion. Regarding the Early Fusion, features
extracted from various sources may not fall into the same
semantic space. Simply appending all features actually
brings in a certain amount of noise and ambiguity. Besides,
Early Fusion may lead to the curse of dimensionality since
the final feature vector would be of very high dimension.
For the Late Fusion, the fused result however might not
be reasonably accurate due to two reasons. First, a single
modality might not be sufficiently descriptive to represent
904
Table 1: Performance comparison between our
proposed TMALL model and several state-of-theart baselines in terms of nMSE.
Methods nMSE p-value
Early Fusion 59.931 ± 41.09 9.91E-04
Late Fusion 8.461 ± 5.34 3.25E-03
regMVMT 1.058 ± 0.05 1.88E-03
MSNL 1.098 ± 0.13 1.42E-02
MvDA 0.982 ± 7.00E-03 9.91E-04
TMALL 0.979 ± 9.42E-03 –
the complex semantics of the videos. Separate results
would be thus suboptimal and the integration may not
result in a desired outcome. Second, it is labor-intensive
to tune the fusion weights for different modalities. Even
worse, the optimal parameters for one application cannot be
directly applied to another one. 2) TMALL achieves better
performance, as compared with regMVMT and MSNL. This
could be explained that linking different modalities via a
unified latent space is better than imposing disagreement
penalty directly over original spaces. 3) The less satisfactory
performance of MvDA indicates that it is necessary to
explore the consistency among different modalities when
building the latent space. And 4) as compared to the multiview learning baselines, such as regMVMT, MSNL, and
MvDA, our model stably demonstrates its advantage. This
signals that the proposed transductive models can achieve
higher performance than inductive models under the same
experimental settings. This can be explained by the fact
that TMALL leverages the knowledge of testing samples.
Moreover, we performed the paired t-test between
TMALL and each baseline on 10-fold cross validation. We
found that all the p-values are much smaller than 0.05, which
shows that the performance improvements of our proposed
model over other baselines are statistically significant.
6.3 On Modality Comparison
To verify the effectiveness of multi-modal integration,
we also conducted experiments over different modality
combinations of the four modalities. Table 2 summarizes
the multi-modal analysis and the paired t-test results. It is
obvious that the more modalities we incorporated, the better
performance we can obtain. This implies the complementary
relationships rather than mutual conflicting relationships
among the different modalities. Moreover, we found that
removing features from any of these four modalities suffers
from a decrease in performance. In a sense, this is
consensus with the old saying “two heads are better than
one”. Additionally, as the performance obtained from
different combinations are not the same, this validates that
incorporating β which controls the confidence of different
modalities is reasonable. Interestingly, we observed that the
combination without social modality obtains the worst result
which indicates that the social modality plays a pivotal role
in micro-video propagation, as compared to visual, textual
or acoustic modality. This also validates that the features
developed from social modality are much discriminative,
even though they are with low-dimensions. On the other
hand, the textual modality contributes the least among all
modalities, as the performance of our model without textual
modality still achieves good performance. This may be
Table 2: Performance comparison among different
modality combinations with respect to nMSE. We
denote T, V, A and S as textual, visual, acoustic
and social modality, respectively.
View combinations nMSE p-value
T+V+A 0.996 ± 4.20E-03 2.62E-05
T+A+S 0.982 ± 4.27E-03 2.59E-05
T+V+S 0.982 ± 4.13E-03 3.05E-04
V+A+S 0.981 ± 5.16E-03 2.16E-05
T+V+A+S 0.979 ± 9.42E-03 –
Table 3: Performance comparison among different
visual features with respect to nMSE.
Features nMSE p-value
Color Histogram 0.996 ± 6.88E-03 1.94E-04
Object Feature 0.994 ± 6.71E-03 2.47E-04
Visual Sentiment 0.994 ± 6.72E-03 2.49E-04
Aesthetic Feature 0.984 ± 6.95E-03 4.44E-01
ALL 0.979 ± 9.42E-03 –
caused by the sparse textual description, which is usually
given in one short sentence.
6.4 On Visual Feature Comparison
To further examine the discriminative visual features we
extracted, we conducted experiments over different kinds of
visual features using TMALL. We also performed significant
test to validate the advantage of combining multiple
features. Table 3 comparatively shows the performance of
TMALL in terms of different visual feature configurations.
It can be seen that the object, visual sentiment and aesthetic
features achieve similar improvement in performance, as
compared to color histogram features. This reveals that
micro-videos’ popularity is better reflected by their content,
sentiment and design, including what objects they contain,
which emotion they convey and what design standards they
follow. This is highly consistent with our oberservations
and also implies that micro-videos which aim to gain high
popularity need to be well designed and considered more
from the visual content.
6.5 Illustrative Examples
To gain the insights of the influential factors in the task
of popularity prediction of micro-videos, we comparatively
illustrate a few representative examples in Figure 4. From
this figure, we have the following observations: 1) Figure
4(a) shows three micro-video pairs. Each of the three microvideo pairs describes the similar semantics, i.e., animals,
football game and sunset, respectively, but they were
published by different users. The publishers of the videos
in top row are much more famous than those of the bottom.
We found that the corresponding popularity of micro-videos
in the second row are much lower than those in the first
row, although they have no significant difference from the
perspective of video contents, which clearly justifies the
importance of social modality. 2) Figure 4(b) illustrates
three micro-video pairs, where each pair of micro-videos
were published by the same user. However, the micro-videos
in the first row achieve much higher popularity than those
in the second row, which demonstrates that the contents
of micro-videos also contribute to their popularity. In
905
(a) Illustration of three micro-video pairs, and each pair was published by two distinct users. The publishers of the videos
in top row are much more famous than those of the bottom.
(b) Illustration of three micro-video pairs, and each pair was published by the same user. The videos in the first row are
much more acoustically comfortable, visually joyful, and aesthetically beautiful than those in the second row.
(c) Illustration of three popular micro-videos with different textual descriptions, which contains superstar names, hot events,
and detail information, respectively.
Figure 4: Comparative illustration of video examples. They respectively justify the importance of social,
acoustic as well as visual, and textual modalities. We use three key frames to represent each video.
particular, the comparisons in Figure 4(b), from left to
right, are i) the existence of ‘skillful pianolude’ compared
with ‘noisy dance music’, ii) ‘funny animals’ compared with
‘motionless dog’, and iii) ‘beautiful flowers’ compared with
‘gloomy sky’. These examples indicate the necessity of
developing acoustic features, visual sentiment and visual
aesthetic features for the task of micro-video popularity.
And 3) Figure 4(c) shows a group of micro-videos, whose
textual descriptions contain either superstar names, hot
hashtags, or informative descriptions. These micro-videos
received a lot of loops, comments, likes and reposts. These
examples thus reflect the value of textual modality.
6.6 Complexity Analysis
To theoretically analyze the computational cost of our
proposed TMALL model, we first compute the complexity
in the construction of H and g, as well as the inverse of
matrices H and G. The construction of H has the time
complexity of O(K2
(N + M)). Fortunately, H keeps the
same in each iteration, and thus can be computed by offline.
The computation of g needs the time cost O(K(N + M)
2
).
In addition, computing the inverse of H and G has the
complexity of O(K3
) and O((N + M)
3
) , respectively. The
computation cost of β in Eqn.(17) is O(K2
). Therefore,
the speed bottleneck lies in the computation of the inverse
of G. In practice, the proposed TMALL model converges
very fast, which on average takes less than 10 iterations.
Overall, the learning process over 9, 720 micro-videos can
be accomplished within 50 seconds.
7. CONCLUSION AND FUTURE WORK
This paper presents a novel transductive multi-modal
learning method (TMALL), to predict the popularity of
micro-videos. In particular, TMALL works by learning
an optimal latent common space from multi-modalities of
the given micro-videos, in which the popularity of microvideos are much more distinguishable. The latent common
space is capable of unifying and preserving information
from different modalities, and it helps to alleviate the
modality limitation problem. To verify our model, we built
a benchmark dataset and extracted a rich set of popularityoriented features to characterize micro-videos from multiple
perspectives. By conducting extensive experiments, we
draw the following conclusions: 1) the optimal latent
common space exists and works; 2) the more modalities
we incorporate to learn the common space, the more
906
discriminant it is; and 3) the features extracted to describe
the social and content influence are representative. As a side
research contribution, we have released the dataset, codes
and parameters to facilitate other researchers. In the future,
we plan to incorporate the cross-domain knowledge, such as
the hot topics on Twitter, to enhance the performance of
popularity prediction.
8. ACKNOWLEDGMENTS
This research is supported by the National Research
Foundation, Prime Minister’s Office, Singapore under its
IRC@SG Funding Initiative.
9. REFERENCES
[1] J. Berger. Arousal increases social transmission of information.
Psychological science, 22(7):891–893, 2011.
[2] J. Berger and K. L. Milkman. What makes online content
viral? Journal of Marketing Research, 49(2):192–205, 2012.
[3] S. Bhattacharya, B. Nojavanasghari, T. Chen, D. Liu, S.-F.
Chang, and M. Shah. Towards a comprehensive computational
model foraesthetic assessment of videos. In Proceedings of the
ACM Multimedia Conference, pages 361–364. ACM, 2013.
[4] D. Borth, R. Ji, T. Chen, T. M. Breuel, and S. Chang.
Large-scale visual sentiment ontology and detectors using
adjective noun pairs. In Proceedings of the ACM Multimedia
Conference, pages 223–232. ACM, 2013.
[5] S. Cappallo, T. Mensink, and C. G. M. Snoek. Latent factors of
visual popularity prediction. In Proceedings of the ACM on
International Conference on Multimedia Retrieval, pages
195–202. ACM, 2015.
[6] K. Chaudhuri, S. M. Kakade, K. Livescu, and K. Sridharan.
Multi-view clustering via canonical correlation analysis. In
Proceedings of the International Conference on Machine
Learning, pages 129–136. ACM, 2009.
[7] C. M. Christoudias, R. Urtasun, and T. Darrell. Multi-view
learning in the presence of view disagreement. CoRR,
abs/1206.3242, 2012.
[8] S. Dhar, V. Ordonez, and T. L. Berg. High level describable
attributes for predicting aesthetics and interestingness. In
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 1657–1664. IEEE, 2011.
[9] H. Gao, F. Nie, X. Li, and H. Huang. Multi-view subspace
clustering. In IEEE International Conference on Computer
Vision, pages 4238–4246. IEEE, 2015.
[10] F. Gelli, T. Uricchio, M. Bertini, A. D. Bimbo, and S. Chang.
Image popularity prediction in social media using sentiment
and context features. In Proceedings of the ACM Multimedia
Conference, pages 907–910. ACM, 2015.
[11] X. He, M. Gao, M. Kan, Y. Liu, and K. Sugiyama. Predicting
the popularity of web 2.0 items based on user comments. In
Proceedings of the International ACM SIGIR Conference on
Research and Development in Information Retrieval, pages
233–242. ACM, 2014.
[12] L. Hong, O. Dan, and B. D. Davison. Predicting popular
messages in twitter. In Proceedings of the ACM International
Conference on World Wide Web, pages 57–58. ACM, 2011.
[13] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B.
Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional
architecture for fast feature embedding. In Proceedings of the
ACM Multimedia Conference, pages 675–678. ACM, 2014.
[14] M. Kan, S. Shan, H. Zhang, S. Lao, and X. Chen. Multi-view
discriminant analysis. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 38(1):188–194, 2016.
[15] A. Khosla, A. D. Sarma, and R. Hamid. What makes an image
popular? In Proceedings of the ACM International
Conference on World Wide Web, pages 867–876, 2014.
[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
Proceedings of the Annual Conference on Neural Information
Processing Systems, pages 1106–1114. NIPS Foundation, 2012.
[17] Z. Li, J. Wang, J. Cai, Z. Duan, H. Wang, and Y. Wang.
Non-reference audio quality assessment for online live music
recordings. In Proceedings of the ACM Multimedia
Conference, pages 63–72. ACM, 2013.
[18] A. Liu, W. Nie, Y. Gao, and Y. Su. Multi-modal clique-graph
matching for view-based 3d model retrieval. IEEE
Transactions on Image Processing, 25(5):2103–2116, 2016.
[19] A. Liu, Z. Wang, W. Nie, and Y. Su. Graph-based
characteristic view set extraction and matching for 3d model
retrieval. Information Sciences, 320:429–442, 2015.
[20] P. J. McParlane, Y. Moshfeghi, and J. M. Jose. ”nobody comes
here anymore, it’s too crowded”; predicting image popularity on
flickr. In Proceedings of the ACM International Conference
on Multimedia Retrieval, page 385. ACM, 2014.
[21] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean.
Distributed representations of words and phrases and their
compositionality. In Proceedings of the Annual Conference on
Neural Information Processing Systems, pages 3111–3119.
NIPS Foundation, 2013.
[22] L. Nie, L. Zhang, Y. Yang, M. Wang, R. Hong, and T.-S. Chua.
Beyond doctors: future health prediction from multimedia and
multimodal observations. In Proceedings of the ACM
Multimedia Conference, pages 591–600. ACM, 2015.
[23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn:
Machine learning in Python. Journal of Machine Learning
Research, 12:2825–2830, 2011.
[24] M. Redi, N. O’Hare, R. Schifanella, M. Trevisiol, and
A. Jaimes. 6 seconds of sound and vision: Creativity in
micro-videos. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 4272–4279.
IEEE, 2014.
[25] M. A. Saad, A. C. Bovik, and C. Charrier. Blind prediction of
natural video quality. IEEE Transactions on Image
Processing, 23(3):1352–1365, 2014.
[26] V. Sindhwani, P. Niyogi, and M. Belkin. A co-regularization
approach to semi-supervised learning with multiple views. In
Proceedings of the International Conference on Machine
Learning, pages 74–79. ACM, 2005.
[27] X. Song, L. Nie, L. Zhang, M. Akbari, and T. Chua. Multiple
social network learning and its application in volunteerism
tendency prediction. In Proceedings of the International ACM
SIGIR Conference on Research and Development in
Information Retrieval, pages 213–222, 2015.
[28] X. Song, L. Nie, L. Zhang, M. Liu, and T. Chua. Interest
inference via structure-constrained multi-source multi-task
learning. In Proceedings of the International Joint Conference
on Artificial Intelligence, pages 2371–2377. AAAI Press, 2015.
[29] T. Trzcinski and P. Rokita. Predicting popularity of online
videos using support vector regression. CoRR, abs/1510.06223,
2015.
[30] M. Wang, X. Hua, R. Hong, J. Tang, G. Qi, and Y. Song.
Unified video annotation via multigraph learning. IEEE
Transactions on Circuits and Systems for Video Technology,
19(5):733–746, 2009.
[31] M. Wang, H. Li, D. Tao, K. Lu, and X. Wu. Multimodal
graph-based reranking for web image search. IEEE
Transactions on Image Processing, 21(11):4649–4661, 2012.
[32] M. Wang, X. Liu, and X. Wu. Visual classification by
1
-hypergraph modeling. IEEE Transactions on Knowledge
and Data Engineering, 27(9):2564–2574, 2015.
[33] B. Wu, E. Zhong, A. Horner, and Q. Yang. Music emotion
recognition by multi-label multi-layer multi-instance multi-view
learning. In Proceedings of the ACM Multimedia Conference,
pages 117–126. ACM, 2014.
[34] Z. Wu, Y. Jiang, J. Wang, J. Pu, and X. Xue. Exploring
inter-feature and inter-class relationships with deep neural
networks for video classification. In Proceedings of the ACM
Multimedia Conference, pages 167–176. ACM, 2014.
[35] H. Zhang, X. Shang, W. Yang, H. Xu, H. Luan, and T. Chua.
Online collaborative learning for open-vocabulary visual
classifiers. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. IEEE, 2016.
[36] H. Zhang, M. Wang, R. Hong, and T. Chua. Play and rewind:
optimizing binary representations of videos by self-supervised
temporal hashing. In Proceedings of the ACM Multimedia
Conference. ACM, 2016.
[37] J. Zhang and J. Huan. Inductive multi-task learning with
multiple view data. In Proceedings of the ACM SIGKDD
International Conference on Knowledge Discovery and Data
Mining, pages 543–551. ACM,