Reading List

Books

Compulsory

Randomized Algorithms. Rajeev Motwani and Prabhakar Raghavan.
Approximation Algorithms. Vijay V. Vazirani.
Mathematics for Computer Science. Eric Lehman, F Thomson Leighton, Albert R Meyer. Online at http://courses.csail.mit.edu/6.042/spring12/mcs.pdf

Reference books

Introduction to Algorithms. Thomas H. Cormen.
Computational Geometry: An Introduction. Franco P. Preparata and Michael I. Shamos.
Algorithms and Data Structures for External Memory. Jeffrey Scott Viter.
Probability and Computing. Mitzenmacher and Upfal.
- Note: this book is more basic compared with Motwani’s book
Computational Complexity. Christos H. Papadimitriou.
Combinatorial Optimization: Algorithms and Complexity. Christos H. Papadimitriou and Kenneth Steiglitz.
Computers and Intractability: A Guide to the Theory of NP. M. R. Garey and D. S. Johnson.

Data Stream

Michael Greenwald, Sanjeev Khanna: Space-Efficient Online Computation of Quantile Summaries. SIGMOD Conference 2001: 58-66
Mayur Datar, Aristides Gionis, Piotr Indyk, Rajeev Motwani: Maintaining stream statistics over sliding windows (extended abstract). SODA 2002: 635-644
Xuemin Lin, Hongjun Lu, Jian Xu, Jeffrey Xu Yu: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream. ICDE 2004: 362-373
Kyriakos Mouratidis, Spiridon Bakiras, Dimitris Papadias: Continuous monitoring of top-k queries over sliding windows. SIGMOD Conference 2006: 635-646
Gurmeet Singh Manku, Rajeev Motwani: Approximate Frequency Counts over Data Streams. VLDB 2002: 346-357
Richard M. Karp, Scott Shenker, Christos H. Papadimitriou: A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. 28: 51-55 (2003)
Pedro Domingos, Geoff Hulten: Mining high-speed data streams. KDD 2000: 71-80
Daniel M. Kane, Jelani Nelson, David P. Woodruff: An optimal algorithm for the distinct elements problem. PODS 2010: 41-52
Edith Cohen: Size-Estimation Framework with Applications to Transitive Closure and Reachability. J. Comput. Syst. Sci. 55(3): 441-453 (1997)
- Note: need to read its TR version.
Noga Alon, Yossi Matias, Mario Szegedy: The Space Complexity of Approximating the Frequency Moments. STOC 1996: 20-29
Graham Cormode, S. Muthukrishnan: An Improved Data Stream Summary: The Count-Min Sketch and Its Applications. LATIN 2004: 29-38
Rui Zhang, Nick Koudas, Beng Chin Ooi, Divesh Srivastava: Multiple Aggregations Over Data Streams. SIGMOD Conference 2005: 299-310
Phillip B. Gibbons, Yossi Matias: Synopsis Data Structures for Massive Data Sets. SODA 1999: 909-910

Mini Textbook

Graham Cormode, Minos N. Garofalakis, Peter J. Haas, Chris Jermaine: Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches. Foundations and Trends in Databases 4(1-3): 1-294 (2012)
S. Muthukrishnan: Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science 1(2) (2005)
- Check out the 120 page version online

Graph

Graph Mining

Xifeng Yan, Philip S. Yu, Jiawei Han: Graph Indexing: A Frequent Structure-based Approach. SIGMOD Conference 2004: 335-346
James Cheng, Yiping Ke, Wilfred Ng, An Lu: Fg-index: towards verification-free query processing on graph databases. SIGMOD Conference 2007: 857-872
Chen Chen, Cindy Xide Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han: Mining Graph Patterns Efficiently via Randomized Summaries. PVLDB 2(1): 742-753 (2009)

Similarity Search

Zhiping Zeng, Anthony K. H. Tung, Jianyong Wang, Jianhua Feng, Lizhu Zhou: Comparing Stars: On Approximating Graph Edit Distance. PVLDB 2(1): 25-36 (2009)
Gaoping Zhu, Xuemin Lin, Ke Zhu, Wenjie Zhang, Jeffrey Xu Yu: TreeSpan: efficiently computing similarity all-matching. SIGMOD Conference 2012: 529-540
Wook-Shin Han, Jinsoo Lee, Jeong-Hoon Lee: Turboiso: towards ultrafast and robust subgraph isomorphism search in large graph databases. SIGMOD Conference 2013: 337-348

Pattern Matching and Searching

Dennis Shasha, Jason Tsong-Li Wang, Rosalba Giugno: Algorithmics and Applications of Tree and Graph Searching. PODS 2002: 39-52
Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, Yinghui Wu, Yunpeng Wu: Graph Pattern Matching: From Intractable to Polynomial Time. PVLDB 3(1): 264-275 (2010)
Lei Zou, Jinghui Mo, Lei Chen, M. Tamer Özsu, Dongyan Zhao: gStore: Answering SPARQL Queries via Subgraph Matching. PVLDB 4(8): 482-493 (2011)

Shortest Path

Edith Cohen, Eran Halperin, Haim Kaplan, Uri Zwick: Reachability and distance queries via 2-hop labels. SODA 2002: 937-946
Holger Bast, Stefan Funke, Peter Sanders, Dominik Schultes: Fast Routing in Road Networks with Transit Nodes, Science, 2007
- Note: need to read their technical paper though.
Yufei Tao, Cheng Sheng, Jian Pei: On k-skip shortest paths. SIGMOD Conference 2011: 421-432.
- VC Dimension

Miscellaneous

Glen Jeh, Jennifer Widom: SimRank: a measure of structural-context similarity. KDD 2002: 538-543
Sanjeev Arora, Satish Rao, Umesh V. Vazirani: Geometry, flows, and graph-partitioning algorithms. Commun. ACM 51(10): 96-105, 2008
Jure Leskovec, Jon M. Kleinberg, Christos Faloutsos: Graphs over time: densification laws, shrinking diameters and possible explanations. KDD 2005: 177-187

Spatial-temporal and Multi-dimensional Databases

Spatial Databases

Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, Bernhard Seeger: The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles. SIGMOD Conference 1990: 322-331
Yannis Theodoridis, Timos K. Sellis: A Model for the Prediction of R-tree Performance. PODS 1996: 161-171

kNN, Reverse kNN

Hanan Samet, Jagan Sankaranarayanan, Houman Alborzi: Scalable network distance browsing in spatial databases. SIGMOD Conference 2008: 43-54
- may contain errors
Yufei Tao, Dimitris Papadias, Xiang Lian: Reverse kNN Search in Arbitrary Dimensionality. VLDB 2004: 744-755

Skyline

Dimitris Papadias, Yufei Tao, Greg Fu, Bernhard Seeger: Progressive skyline computation in database systems. ACM Trans. Database Syst. 30(1): 41-82 (2005)
Jian Pei, Yidong Yuan, Xuemin Lin, Wen Jin, Martin Ester, Qing Liu, Wei Wang, Yufei Tao, Jeffrey Xu Yu, Qing Zhang: Towards multidimensional subspace skyline analysis. ACM Trans. Database Syst. 31(4): 1335-1381 (2006)

Moving Objects or Queries, Trajectory

Simonas Saltenis, Christian S. Jensen, Scott T. Leutenegger, Mario A. Lopez: Indexing the Positions of Continuously Moving Objects. SIGMOD Conference 2000: 331-342
Jun Zhang, Manli Zhu, Dimitris Papadias, Yufei Tao, Dik Lun Lee: Location-based Spatial Queries. SIGMOD Conference 2003: 443-454
Muhammad Aamir Cheema, Xuemin Lin, Wenjie Zhang, Ying Zhang: Influence zone: Efficiently processing reverse k nearest neighbors queries. ICDE 2011: 577-588
Zaiben Chen, Heng Tao Shen, Xiaofang Zhou, Yu Zheng, Xing Xie: Searching trajectories by locations: an efficiency study. SIGMOD Conference 2010: 255-266

Top-k

Ronald Fagin, Amnon Lotem, Moni Naor: Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4): 614-656 (2003)
Gao Cong, Christian S. Jensen, Dingming Wu: Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects. PVLDB 2(1): 337-348 (2009)

High-dimensional Space

Jon M. Kleinberg: Two Algorithms for Nearest-Neighbor Search in High Dimensions. STOC 1997: 599-608
Sariel Har-Peled, Piotr Indyk, Rajeev Motwani: Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality. Theory of Computing 8(1): 321-350 (2012)
Moses Charikar: Similarity estimation techniques from rounding algorithms. STOC 2002: 380-388
H. V. Jagadish, Beng Chin Ooi, Kian-Lee Tan, Cui Yu, Rui Zhang: iDistance: An adaptive B+-tree based indexing method for nearest neighbor search. ACM Trans. Database Syst. 30(2): 364-397 (2005)

Similarity Queries for Texts

Edit Distance

W. Masek and M. Paterson, A Faster Algorithm Computing String Edit Distances, Journal of Computer and System Sciences, vol. 20, pp. 18–31, 1980.
H. Berghel and D. Roach, An extension of ukkonen’s enhanced dynamic programming asm algorithm, ACM Trans. Inf. Syst., vol. 14, pp. 94–106, Jan. 1996.
Gene Myers: A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming. J. ACM 46(3): 395-415 (1999).

Prefix/Suffix Filtering and Mismatch Filtering

R. J. Bayardo, Y. Ma, and R. Srikant, Scaling up all pairs similarity search, in WWW, 2007.
Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu. Efficient Similarity Joins for Near Duplicate Detection. WWW 2008.

Edit Similarity Queries

Chen Li, Bin Wang, Xiaochun Yang: VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams. VLDB 2007: 303-314
S. Chaudhuri and R. Kaushik, Extending autocompletion to tolerate errors, in Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 707–718, ACM, 2009.
G. Li, D. Deng, J. Wang, and J. Feng, Pass-join: A partition-based method for similarity joins, PVLDB, vol. 5, no. 3, pp. 253–264, 2011.
Jianbin Qin, Wei Wang, Chuan Xiao, Yifei Lu, Xuemin Lin, Haixun Wang. Asymmetric Signature Schemes for Efficient Exact Edit Similarity Query Processing. TODS, 2013.
R. Cole, L.-A. Gottlieb, and M. Lewenstein. Dictionary matching and indexing with errors and don’t cares. In STOC, pages 91–100, 2004.

Enumeration

Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma: Detecting near-duplicates for web crawling. WWW 2007: 141-150.
Yinan Li, Jignesh M. Patel, Allison Terrel: WHAM: A High-Throughput Sequence Alignment Method, TODS 2012.
Chuan Xiao, Jianbin Qin, Wei Wang, Yoshiharu Ishikawa, Koji Tsuda, Kunihiko Sadakane. Efficient Error-tolerant Query Autocopletion. PVLDB 2013.

Probabilistic Algorithms

Jiaqi Zhai, Yin Lou, Johannes Gehrke: ATLAS: a probabilistic algorithm for high dimensional similarity search. SIGMOD Conference 2011: 997-1008
Sergey Ioffe: Improved Consistent Sampling, Weighted Minhash and L1 Sketching. ICDM 2010: 246-255.

Keyword Search

Introduction and Surveys

G. Weikum. DB&IR: both sides now. In SIGMOD, pages 25–30, 2007.
Yi Chen, Wei Wang, Ziyang Liu, Xuemin Lin: Keyword search on structured and semi-structured data. SIGMOD Conference 2009: 1005-1010.
- slides available at http://www.cse.unsw.edu.au/~weiw/project/keyword_sigmod09_tutorial.pptx
- Also check out a later version: Yi Chen, Wei Wang, Ziyang Liu: Keyword-based search and exploration on databases. ICDE 2011: 1380-1383.
  - slides available at http://www.cse.unsw.edu.au/~weiw/project/ICDE11_KWS_tutorial_with_refs.pptx

Relational/Graph Algorithms

Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou. SPARK: Top-k Keyword Query in Relational Databases. SIGMOD 2007.
B. Ding, J. X. Yu, S. Wang, L. Qin, X. Zhang, and X. Lin. Finding top-k min-cost connected trees in databases. In ICDE, 2007
K. Golenberg, B. Kimelfeld, and Y. Sagiv. Keyword proximity search in complex data graphs. In SIGMOD, 2008.
P. P. Talukdar, M. Jacob, M. S. Mehmood, K. Crammer, Z. G. Ives, F. Pereira, and S. Guha. Learning to create data-integrating queries. PVLDB, 1(1):785–796, 2008.
Hao He, Haixun Wang, Jun Yang, Philip S. Yu: BLINKS: ranked keyword searches on graphs. SIGMOD Conference 2007: 305-316

XML

C. Sun, C.-Y. Chan, and A. Goenka. Multiway SLCA-based keyword search in XML data. In WWW, 2007.
Junfeng Zhou, Zhifeng Bao, Wei Wang, Jinjia Zhao, and Xiaofeng Meng. Efficient query processing for XML keyword queries based on the IDList index. VLDB Journal, 2013.
Yufei Tao, Stavros Papadopoulos, Cheng Sheng, Kostas Stefanidis: Nearest keyword search in XML documents. SIGMOD Conference 2011: 589-600.

Effectiveness

Z. Liu and Y. Chen. Reasoning and identifying relevant matches for xml keyword search. PVLDB, 1(1):921–932, 2008.
Arash Termehchy, Marianne Winslett: Using structural information in XML keyword search effectively. ACM Trans. Database Syst. 36(1): 4 (2011).
D. Xin, Y. He, and V. Ganti. Keyword++: A Framework to Improve Keyword Search Over Entity Databases. PVLDB, 3(1):711–722, 2010.
Xiaohui Yu, Huxia Shi: CI-Rank: Ranking Keyword Search Results Based on Collective Importance. ICDE 2012: 78-89.
Ken Q. Pu: Keyword query cleaning using hidden Markov models. KEYS 2009: 27-32.
Zhen Zhang, Bin He, Kevin Chen-Chuan Chang: Understanding Web Query Interfaces: Best-Effort Parsing with Hidden Syntax. SIGMOD Conference 2004: 107-118.

Uncertain Data Management

Anish Das Sarma, Omar Benjelloun, Alon Y. Halevy, Jennifer Widom: Working Models for Uncertain Data. ICDE 2006: 7
- possible world semantics
Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar: Evaluating Probabilistic Queries over Imprecise Data. SIGMOD Conference 2003: 551-562
- simple query, 1D data
Yufei Tao, Reynold Cheng, Xiaokui Xiao, Wang Kay Ngai, Ben Kao, Sunil Prabhakar: Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions. VLDB 2005: 922-933
- indexing
Nilesh N. Dalvi, Dan Suciu: The dichotomy of conjunctive queries on probabilistic structures. PODS 2007: 293-302
- theoretical foundation
Pei Jian, Bin Jiang, Xuemin Lin, Yidong Yuan: Probabilistic Skylines on Uncertain Data. VLDB 2007: 15-26
- skyline
Jian Li, Barna Saha, Amol Deshpande: A Unified Approach to Ranking in Probabilistic Databases. PVLDB 2(1): 502-513 (2009)
- ranking
Pei Jian, Bin Jiang, Xuemin Lin, Yidong Yuan: Probabilistic Skylines on Uncertain Data. VLDB 2007: 15-26
- streaming
Xiang Lian, Lei Chen: A Generic Framework for Handling Uncertain Data with Local Correlations. PVLDB 4(1): 12-21 (2010)
- correlation
Jeffrey Jestes, Feifei Li, Zhepeng Yan, Ke Yi: Probabilistic string similarity joins. SIGMOD Conference 2010: 327-338
- string
George Beskales, Mohamed A. Soliman, Ihab F. Ilyas: Efﬁcient Search for the Top-k Probable Nearest Neighbors in Uncertain Databases. PVLDB 1(1): 326 - 339 (2008).
- nearest neighbor
Jianzhong Li, Zhaonian Zou, Hong Gao: Mining frequent subgraphs over uncertain graph databases under probabilistic semantics. VLDB J. 21(6): 753-777 (2012)
- graph
Wenjie Zhang, Xuemin Lin, Ying Zhang, Muhammad Aamir Cheema, Qing Zhang: Stochastic skylines. ACM Trans. Database Syst. 37(2): 14 (2012)
- more sophisticated model
Michele Dallachiesa, Besmira Nushi, Katsiaryna Mirylenka, Themis Palpanas: Uncertain Time-Series Similarity: Return to the Basics. PVLDB 5(11): 1662-1673 (2012)
- time series
Saket Sathe, Hoyoung Jeung, Karl Aberer: Creating Probabilistic Databases from Imprecise Time-Series Data. ICDE 2011, 327 - 338.
- creating probabilistic databases
Dan Suciu, A Tutorial on Probabilistic Databases, MSRA Summer School 213.
- a tutorial on relational query evaluation
Cheqing Jin, Ke Yi, Lei Chen, Jeffrey Xu Yu, Xuemin Lin: Sliding-window top-k queries on uncertain streams. PVLDB 1(1): 301-312 (2008)