COSC 526 Big Data Mining



  • Feb 21, 2015: Projects page has been updated. Project deadlines, and other details have been put up as a table. Please follow the guidelines there for information.
  • Feb 19, 2015: We had to cancel class due to inclement weather.
  • Feb 10-12, 2015: We will have our guest lecture on graph analytics by Dr. Sukumar (Rangan) Sreenivas from Oak Ridge National Laboratory.
  • Jan 27, 2015: We will have our first guest lecture on the Berkeley Data Analytic Stack by Dr. Seung-Hwan Kim from Oak Ridge National Laboratory.
  • Jan 13, 2015: Assignment 0 is due today. Please form project teams
  • Jan 8, 2015: Belated Happy New Year! Welcome to the first class

Class Description

The emphasis of this section will be on Big Data. Tentative topics to be covered include: (1) Introduction to big data mining paradigms using (a) Distributed computing tools such as Map-Reduce/Hadoop and (b) Multi-core tools including GPUs and heterogeneous compute resources, (2) Ideas to "munge, manipulate and analyze" large volumes of data, (3) Streaming Data Analytics, (4) Randomized/probabilistic approaches to construct matrix decompositions/ dimensionality reduction, (5) Similarity search in high dimensional datasets, (6) Link Detection/ Page Rank and applications, and (7) Graph mining techniques.

Planned datasets for course projects include: (1) large volumes of social media data (over a year worth of data collected at ORNL; >10 TB), (2) open source claims data from the Centers for Medicaid and Medicare (~80-100 GB but complex and noisy healthcare related data), (3) 1000 genome project ( >2 TB data but highly complex and noisy biological datasets), and (4) cybersecurity data.

The course grade will be based on three mini-projects (with mini implementation examples), a course-project (which begins within the first two weeks of class) and a final poster session.