Reading List

Books

Compulsory

  • Randomized Algorithms. Rajeev Motwani and Prabhakar Raghavan.
  • Approximation Algorithms. Vijay V. Vazirani.
  • Mathematics for Computer Science. Eric Lehman, F Thomson Leighton, Albert R Meyer. Online at http://courses.csail.mit.edu/6.042/spring12/mcs.pdf

Reference books

  • Introduction to Algorithms. Thomas H. Cormen.
  • Computational Geometry: An Introduction. Franco P. Preparata and Michael I. Shamos.
  • Algorithms and Data Structures for External Memory. Jeffrey Scott Viter.
  • Probability and Computing. Mitzenmacher and Upfal.
    • Note: this book is more basic compared with Motwani’s book
  • Computational Complexity. Christos H. Papadimitriou.
  • Combinatorial Optimization: Algorithms and Complexity. Christos H. Papadimitriou and Kenneth Steiglitz.
  • Computers and Intractability: A Guide to the Theory of NP. M. R. Garey and D. S. Johnson.

Data Stream

  • Michael Greenwald, Sanjeev Khanna: Space-Efficient Online Computation of Quantile Summaries. SIGMOD Conference 2001: 58-66
  • Mayur Datar, Aristides Gionis, Piotr Indyk, Rajeev Motwani: Maintaining stream statistics over sliding windows (extended abstract). SODA 2002: 635-644
  • Xuemin Lin, Hongjun Lu, Jian Xu, Jeffrey Xu Yu: Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream. ICDE 2004: 362-373
  • Kyriakos Mouratidis, Spiridon Bakiras, Dimitris Papadias: Continuous monitoring of top-k queries over sliding windows. SIGMOD Conference 2006: 635-646
  • Gurmeet Singh Manku, Rajeev Motwani: Approximate Frequency Counts over Data Streams. VLDB 2002: 346-357
  • Richard M. Karp, Scott Shenker, Christos H. Papadimitriou: A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. 28: 51-55 (2003)
  • Pedro Domingos, Geoff Hulten: Mining high-speed data streams. KDD 2000: 71-80
  • Daniel M. Kane, Jelani Nelson, David P. Woodruff: An optimal algorithm for the distinct elements problem. PODS 2010: 41-52
  • Edith Cohen: Size-Estimation Framework with Applications to Transitive Closure and Reachability. J. Comput. Syst. Sci. 55(3): 441-453 (1997)
    • Note: need to read its TR version.
  • Noga Alon, Yossi Matias, Mario Szegedy: The Space Complexity of Approximating the Frequency Moments. STOC 1996: 20-29
  • Graham Cormode, S. Muthukrishnan: An Improved Data Stream Summary: The Count-Min Sketch and Its Applications. LATIN 2004: 29-38
  • Rui Zhang, Nick Koudas, Beng Chin Ooi, Divesh Srivastava: Multiple Aggregations Over Data Streams. SIGMOD Conference 2005: 299-310
  • Phillip B. Gibbons, Yossi Matias: Synopsis Data Structures for Massive Data Sets. SODA 1999: 909-910

Mini Textbook

  • Graham Cormode, Minos N. Garofalakis, Peter J. Haas, Chris Jermaine: Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches. Foundations and Trends in Databases 4(1-3): 1-294 (2012)
  • S. Muthukrishnan: Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science 1(2) (2005)
    • Check out the 120 page version online

Graph

Graph Mining

  • Xifeng Yan, Philip S. Yu, Jiawei Han: Graph Indexing: A Frequent Structure-based Approach. SIGMOD Conference 2004: 335-346
  • James Cheng, Yiping Ke, Wilfred Ng, An Lu: Fg-index: towards verification-free query processing on graph databases. SIGMOD Conference 2007: 857-872
  • Chen Chen, Cindy Xide Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han: Mining Graph Patterns Efficiently via Randomized Summaries. PVLDB 2(1): 742-753 (2009)
  • Zhiping Zeng, Anthony K. H. Tung, Jianyong Wang, Jianhua Feng, Lizhu Zhou: Comparing Stars: On Approximating Graph Edit Distance. PVLDB 2(1): 25-36 (2009)
  • Gaoping Zhu, Xuemin Lin, Ke Zhu, Wenjie Zhang, Jeffrey Xu Yu: TreeSpan: efficiently computing similarity all-matching. SIGMOD Conference 2012: 529-540
  • Wook-Shin Han, Jinsoo Lee, Jeong-Hoon Lee: Turboiso: towards ultrafast and robust subgraph isomorphism search in large graph databases. SIGMOD Conference 2013: 337-348

Pattern Matching and Searching

  • Dennis Shasha, Jason Tsong-Li Wang, Rosalba Giugno: Algorithmics and Applications of Tree and Graph Searching. PODS 2002: 39-52
  • Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, Yinghui Wu, Yunpeng Wu: Graph Pattern Matching: From Intractable to Polynomial Time. PVLDB 3(1): 264-275 (2010)
  • Lei Zou, Jinghui Mo, Lei Chen, M. Tamer Özsu, Dongyan Zhao: gStore: Answering SPARQL Queries via Subgraph Matching. PVLDB 4(8): 482-493 (2011)

Shortest Path

  • Edith Cohen, Eran Halperin, Haim Kaplan, Uri Zwick: Reachability and distance queries via 2-hop labels. SODA 2002: 937-946
  • Holger Bast, Stefan Funke, Peter Sanders, Dominik Schultes: Fast Routing in Road Networks with Transit Nodes, Science, 2007
    • Note: need to read their technical paper though.
  • Yufei Tao, Cheng Sheng, Jian Pei: On k-skip shortest paths. SIGMOD Conference 2011: 421-432.
    • VC Dimension

Miscellaneous

  • Glen Jeh, Jennifer Widom: SimRank: a measure of structural-context similarity. KDD 2002: 538-543
  • Sanjeev Arora, Satish Rao, Umesh V. Vazirani: Geometry, flows, and graph-partitioning algorithms. Commun. ACM 51(10): 96-105, 2008
  • Jure Leskovec, Jon M. Kleinberg, Christos Faloutsos: Graphs over time: densification laws, shrinking diameters and possible explanations. KDD 2005: 177-187

Spatial-temporal and Multi-dimensional Databases

Spatial Databases

  • Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, Bernhard Seeger: The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles. SIGMOD Conference 1990: 322-331
  • Yannis Theodoridis, Timos K. Sellis: A Model for the Prediction of R-tree Performance. PODS 1996: 161-171

kNN, Reverse kNN

  • Hanan Samet, Jagan Sankaranarayanan, Houman Alborzi: Scalable network distance browsing in spatial databases. SIGMOD Conference 2008: 43-54
    • may contain errors
  • Yufei Tao, Dimitris Papadias, Xiang Lian: Reverse kNN Search in Arbitrary Dimensionality. VLDB 2004: 744-755

Skyline

  • Dimitris Papadias, Yufei Tao, Greg Fu, Bernhard Seeger: Progressive skyline computation in database systems. ACM Trans. Database Syst. 30(1): 41-82 (2005)
  • Jian Pei, Yidong Yuan, Xuemin Lin, Wen Jin, Martin Ester, Qing Liu, Wei Wang, Yufei Tao, Jeffrey Xu Yu, Qing Zhang: Towards multidimensional subspace skyline analysis. ACM Trans. Database Syst. 31(4): 1335-1381 (2006)

Moving Objects or Queries, Trajectory

  • Simonas Saltenis, Christian S. Jensen, Scott T. Leutenegger, Mario A. Lopez: Indexing the Positions of Continuously Moving Objects. SIGMOD Conference 2000: 331-342
  • Jun Zhang, Manli Zhu, Dimitris Papadias, Yufei Tao, Dik Lun Lee: Location-based Spatial Queries. SIGMOD Conference 2003: 443-454
  • Muhammad Aamir Cheema, Xuemin Lin, Wenjie Zhang, Ying Zhang: Influence zone: Efficiently processing reverse k nearest neighbors queries. ICDE 2011: 577-588
  • Zaiben Chen, Heng Tao Shen, Xiaofang Zhou, Yu Zheng, Xing Xie: Searching trajectories by locations: an efficiency study. SIGMOD Conference 2010: 255-266

Top-k

  • Ronald Fagin, Amnon Lotem, Moni Naor: Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4): 614-656 (2003)
  • Gao Cong, Christian S. Jensen, Dingming Wu: Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects. PVLDB 2(1): 337-348 (2009)

High-dimensional Space

  • Jon M. Kleinberg: Two Algorithms for Nearest-Neighbor Search in High Dimensions. STOC 1997: 599-608
  • Sariel Har-Peled, Piotr Indyk, Rajeev Motwani: Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality. Theory of Computing 8(1): 321-350 (2012)
  • Moses Charikar: Similarity estimation techniques from rounding algorithms. STOC 2002: 380-388
  • H. V. Jagadish, Beng Chin Ooi, Kian-Lee Tan, Cui Yu, Rui Zhang: iDistance: An adaptive B+-tree based indexing method for nearest neighbor search. ACM Trans. Database Syst. 30(2): 364-397 (2005)

Similarity Queries for Texts

Edit Distance

  • W. Masek and M. Paterson, A Faster Algorithm Computing String Edit Distances, Journal of Computer and System Sciences, vol. 20, pp. 18–31, 1980.
  • H. Berghel and D. Roach, An extension of ukkonen’s enhanced dynamic programming asm algorithm, ACM Trans. Inf. Syst., vol. 14, pp. 94–106, Jan. 1996.
  • Gene Myers: A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming. J. ACM 46(3): 395-415 (1999).

Prefix/Suffix Filtering and Mismatch Filtering

  • R. J. Bayardo, Y. Ma, and R. Srikant, Scaling up all pairs similarity search, in WWW, 2007.
  • Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu. Efficient Similarity Joins for Near Duplicate Detection. WWW 2008.

Edit Similarity Queries

  • Chen Li, Bin Wang, Xiaochun Yang: VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams. VLDB 2007: 303-314
  • S. Chaudhuri and R. Kaushik, Extending autocompletion to tolerate errors, in Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 707–718, ACM, 2009.
  • G. Li, D. Deng, J. Wang, and J. Feng, Pass-join: A partition-based method for similarity joins, PVLDB, vol. 5, no. 3, pp. 253–264, 2011.
  • Jianbin Qin, Wei Wang, Chuan Xiao, Yifei Lu, Xuemin Lin, Haixun Wang. Asymmetric Signature Schemes for Efficient Exact Edit Similarity Query Processing. TODS, 2013.
  • R. Cole, L.-A. Gottlieb, and M. Lewenstein. Dictionary matching and indexing with errors and don’t cares. In STOC, pages 91–100, 2004.

Enumeration

  • Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma: Detecting near-duplicates for web crawling. WWW 2007: 141-150.
  • Yinan Li, Jignesh M. Patel, Allison Terrel: WHAM: A High-Throughput Sequence Alignment Method, TODS 2012.
  • Chuan Xiao, Jianbin Qin, Wei Wang, Yoshiharu Ishikawa, Koji Tsuda, Kunihiko Sadakane. Efficient Error-tolerant Query Autocopletion. PVLDB 2013.

Probabilistic Algorithms

  • Jiaqi Zhai, Yin Lou, Johannes Gehrke: ATLAS: a probabilistic algorithm for high dimensional similarity search. SIGMOD Conference 2011: 997-1008
  • Sergey Ioffe: Improved Consistent Sampling, Weighted Minhash and L1 Sketching. ICDM 2010: 246-255.

Keyword Search

Introduction and Surveys

  • G. Weikum. DB&IR: both sides now. In SIGMOD, pages 25–30, 2007.
  • Yi Chen, Wei Wang, Ziyang Liu, Xuemin Lin: Keyword search on structured and semi-structured data. SIGMOD Conference 2009: 1005-1010.
    • slides available at http://www.cse.unsw.edu.au/~weiw/project/keyword_sigmod09_tutorial.pptx
    • Also check out a later version: Yi Chen, Wei Wang, Ziyang Liu: Keyword-based search and exploration on databases. ICDE 2011: 1380-1383.
      • slides available at http://www.cse.unsw.edu.au/~weiw/project/ICDE11_KWS_tutorial_with_refs.pptx

Relational/Graph Algorithms

  • Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou. SPARK: Top-k Keyword Query in Relational Databases. SIGMOD 2007.
  • B. Ding, J. X. Yu, S. Wang, L. Qin, X. Zhang, and X. Lin. Finding top-k min-cost connected trees in databases. In ICDE, 2007
  • K. Golenberg, B. Kimelfeld, and Y. Sagiv. Keyword proximity search in complex data graphs. In SIGMOD, 2008.
  • P. P. Talukdar, M. Jacob, M. S. Mehmood, K. Crammer, Z. G. Ives, F. Pereira, and S. Guha. Learning to create data-integrating queries. PVLDB, 1(1):785–796, 2008.
  • Hao He, Haixun Wang, Jun Yang, Philip S. Yu: BLINKS: ranked keyword searches on graphs. SIGMOD Conference 2007: 305-316

XML

  • C. Sun, C.-Y. Chan, and A. Goenka. Multiway SLCA-based keyword search in XML data. In WWW, 2007.
  • Junfeng Zhou, Zhifeng Bao, Wei Wang, Jinjia Zhao, and Xiaofeng Meng. Efficient query processing for XML keyword queries based on the IDList index. VLDB Journal, 2013.
  • Yufei Tao, Stavros Papadopoulos, Cheng Sheng, Kostas Stefanidis: Nearest keyword search in XML documents. SIGMOD Conference 2011: 589-600.

Effectiveness

  • Z. Liu and Y. Chen. Reasoning and identifying relevant matches for xml keyword search. PVLDB, 1(1):921–932, 2008.
  • Arash Termehchy, Marianne Winslett: Using structural information in XML keyword search effectively. ACM Trans. Database Syst. 36(1): 4 (2011).
  • D. Xin, Y. He, and V. Ganti. Keyword++: A Framework to Improve Keyword Search Over Entity Databases. PVLDB, 3(1):711–722, 2010.
  • Xiaohui Yu, Huxia Shi: CI-Rank: Ranking Keyword Search Results Based on Collective Importance. ICDE 2012: 78-89.
  • Ken Q. Pu: Keyword query cleaning using hidden Markov models. KEYS 2009: 27-32.
  • Zhen Zhang, Bin He, Kevin Chen-Chuan Chang: Understanding Web Query Interfaces: Best-Effort Parsing with Hidden Syntax. SIGMOD Conference 2004: 107-118.

Uncertain Data Management

  • Anish Das Sarma, Omar Benjelloun, Alon Y. Halevy, Jennifer Widom: Working Models for Uncertain Data. ICDE 2006: 7
    • possible world semantics
  • Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar: Evaluating Probabilistic Queries over Imprecise Data. SIGMOD Conference 2003: 551-562
    • simple query, 1D data
  • Yufei Tao, Reynold Cheng, Xiaokui Xiao, Wang Kay Ngai, Ben Kao, Sunil Prabhakar: Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions. VLDB 2005: 922-933
    • indexing
  • Nilesh N. Dalvi, Dan Suciu: The dichotomy of conjunctive queries on probabilistic structures. PODS 2007: 293-302
    • theoretical foundation
  • Pei Jian, Bin Jiang, Xuemin Lin, Yidong Yuan: Probabilistic Skylines on Uncertain Data. VLDB 2007: 15-26
    • skyline
  • Jian Li, Barna Saha, Amol Deshpande: A Unified Approach to Ranking in Probabilistic Databases. PVLDB 2(1): 502-513 (2009)
    • ranking
  • Pei Jian, Bin Jiang, Xuemin Lin, Yidong Yuan: Probabilistic Skylines on Uncertain Data. VLDB 2007: 15-26
    • streaming
  • Xiang Lian, Lei Chen: A Generic Framework for Handling Uncertain Data with Local Correlations. PVLDB 4(1): 12-21 (2010)
    • correlation
  • Jeffrey Jestes, Feifei Li, Zhepeng Yan, Ke Yi: Probabilistic string similarity joins. SIGMOD Conference 2010: 327-338
    • string
  • George Beskales, Mohamed A. Soliman, Ihab F. Ilyas: Efficient Search for the Top-k Probable Nearest Neighbors in Uncertain Databases. PVLDB 1(1): 326 - 339 (2008).
    • nearest neighbor
  • Jianzhong Li, Zhaonian Zou, Hong Gao: Mining frequent subgraphs over uncertain graph databases under probabilistic semantics. VLDB J. 21(6): 753-777 (2012)
    • graph
  • Wenjie Zhang, Xuemin Lin, Ying Zhang, Muhammad Aamir Cheema, Qing Zhang: Stochastic skylines. ACM Trans. Database Syst. 37(2): 14 (2012)
    • more sophisticated model
  • Michele Dallachiesa, Besmira Nushi, Katsiaryna Mirylenka, Themis Palpanas: Uncertain Time-Series Similarity: Return to the Basics. PVLDB 5(11): 1662-1673 (2012)
    • time series
  • Saket Sathe, Hoyoung Jeung, Karl Aberer: Creating Probabilistic Databases from Imprecise Time-Series Data. ICDE 2011, 327 - 338.
    • creating probabilistic databases
  • Dan Suciu, A Tutorial on Probabilistic Databases, MSRA Summer School 213.
    • a tutorial on relational query evaluation
  • Cheqing Jin, Ke Yi, Lei Chen, Jeffrey Xu Yu, Xuemin Lin: Sliding-window top-k queries on uncertain streams. PVLDB 1(1): 301-312 (2008)