CS591L page

CS 591L: Cyber Security and Big Data Analytics (Fall 2014)

Welcome to CS591L! This course is intended for graduate students interested in research in cyber security and/or big data. Both first year graduate studnets and more senior graduate students are welcome. I also motivate undergraduate seniors who are interested in research.

Instructor

Dr. Yanfang (Fanny) Ye, Assistant Professor
Lane Department of Computer Science and Eletrical Engineering
West Virgina University

Office: ESB-935
Email: yanfang.ye (at) mail (dot) wvu (dot) edu
Lectures: Thursdays 5:00pm -- 7:30pm in ESB-E207
Office Hours: Fridays 2:30pm -- 4:30pm, or by appointment

Course Description

This course introduces top emerging topics in cyber security and provide the scientific foundation to solve real world security problems; the course explores the challenges and opportunities of big data, introduces typical big data analytic techniques and how to apply them, especially in the area of cyber security.

Course Prerequisites

There are no official prerequisites.

Text Book

No textbook is required. Course lectures, notes and other materials will be found under the course materials section on Ecampus or WVU MIX System.

[References]:

Topics

Cyber Security

Malware attacks and defenses
Phishing fraud and detection
Mobile threats and detection
Internet of Things (IoT) and security

Big Data Analytics

Introduction of big data: challenges and opportunities
Big data analytic techniques (foundation of data mining)
- data pre-processing
- frequent pattern mining
- classification and regression
- ensemble learning
- clustering
- graph mining
Models for big data analytics
- streaming algorithm
- mapreduce and algorithm design

Case Studies

Grading

A 90-100 B 80-89 C 70-79 D 60-69 F <60

Homework (30%) You will be handed several homework assignments. You may discuss homework with other students, but each student must write up solutions in their own words without assistance from anyone. Any submitted work that it copied from any source or too similar to be an independent write-up will not be given credit.
Group survey (30%) You will be assign one group survey and presentation. The survey should be conducted on a topic about cyber security and/or big data analytics, possible topic examples: big data analytics on a specific domain (e.g., cyber security, IoT). The survey should cover at least 20 research papers and some real data sets.

2-3 students per group
(50%) A summary report in ACM Transaction format (at least 6 pages):
http://www.acm.org/publications/latex_style/v2- acmsmall.zip
(50%) A 20 minutes presentation + 10 minutes Q/A

Group project (40%) You will be assigned one group research project. Project topics could be related to diverse big data analytical application domains, e.g., cyber security, or smart devices, etc. You will be required to use cutting edge big data tools and techniques to solve the proposed research problems.

2-3 students per group
(5%) Fully motivate the problem
(10%) Survey related work
(25%) Develop your own solutions -- substantial novel algorithm development, theoretical analysis, and implementation are expected
(25%) A thorough empirical evaluation, using the given data set(s) or your collecting data, and comparing with baseline methods
(25%) A fully developed project report: 12 pages in ACM SIG Tighter Alternate style:
http://www.acm.org/sigs/publications/proceedings-templates#aL2
(10%) A 20 minutes presentation + 10 minutes Q/A

Schedule (tentative)

Date	Topic	Event
Aug 21	* Course introduction * Tutorial of Cyber Security	HW1 out
Aug 28	* Malware and detection techniques	HW2 out
Sep 4	* Tutorial of Big Data Analytics	HW3 out
Sep 11	* Association Rule Discovery * Case Study 1: AR Discovery for Malware Pattern Analysis	HW4 (Programming assignment): Given: a data set including 1,000 malware sample features (# of items >2000), min-support, min-confidence; Output: frequent itemsets & association rules; Experiments and analysis: compare and analyze the performance of each mining algorithm (you are asked to implement Apriori and FP-Growth algorithms, 5 bonus points will be given if you can propose, implement and prove your own mining algorithm).
Sep 18	* Classification
Sep 25	* Classification * Ensemble Learning
Oct 2	* Model Evaluation * Regression * Clustering	HW5 out
Oct 9	* Data Preprocessing Techniques	1. Group Survey Announcement 2. Group Project Announcement
Oct 16	* Streaming Algorithm * Map-reduce and algorithm design
Oct 23	* Graph mining * Case Study 2: Graph Mining for Malware Detection
Oct 30	* Case Study 3: Clustering in a Map-reduce Framework for Phishing Fraud Detection * Case Study 4: Data Analytics for IoT and its Security
Nov 13	* Group Survey Presentation
Nov 20	* Hadoop Installation * Programming on Hadoop
Dec 4	* Group Project Presentation
Dec 12	* Group Project Presentation

Group Project

Select one from the following four topics for your group project. In the project, you are required to use cutting edge data mining and/or big data modeling techniques to solve the proposed research problems.

Group Project 1: Malware Detection Based on Win API Calls (difficulty factor: *1.0)

In this project, you are asked to investigate and extend data mining and/or big data modeling techniques for malware detection based on the extracted feature set (Win API Calls).

You will be given a data set including 50,000 instances, half of which are malware extracted features and the other half are benign files extracted features. (Download the Data Set)

You are required to propose and develop your own solution (substantial novel algorithm) to build the classification model based on the given data set. Theoretical analysis and implementation are also expected.

A thorough empirical evaluation using the given data set and comparing with baseline methods are required.

Then, a fully developed project report with required format above should be submitted.

Finally, present your project in the class.

Group Project 2: Malware Clustering Based on File Instructions (difficulty factor: *1.0)

In this project, you are asked to investigate and extend data mining and/or big data modeling techniques for malware clustering based on extracted file instruction features.

You will be given a data set including 1,481 malware instances represented by function based instruction sequences, which can be categorized into 422 malware families.

You are required to propose and develop your own solution (substantial novel algorithm) to partition this given malware instances into clusters. Theoretical analysis and implementation are also expected.

A thorough empirical evaluation using the given data set and comparing with baseline methods are required.

Then, a fully developed project report with required format above should be submitted.

Finally, present your project in the class.

Group Project 3: Malware Detection Based on File Relation Graphs (difficulty factor: *1.1)

For malware detection, the relations among file samples provide invaluable information about their properties. In this project, you are asked to investigate and extend data mining and/or big data modeling techniques for malware detection based on given file relationships.

You will be given a data set representing the relationship between 69,165 file samples, 3,095 of which are malware, 22,583 of which are benign files, and 45,487 of which are unknown files. (Download the Data Set)

You may need to construct graphs based on the given data set to represent the relations between file samples. Then you are required to propose and develop your own solution (substantial novel algorithm) to do the graph mining based on your constructed graphs. Theoretical analysis and implementation are also expected.

A thorough empirical evaluation using the given data set and comparing with baseline methods are required.

Then, a fully developed project report with required format above should be submitted.

Finally, present your project in the class.

Group Project 4: IoT and its Application on Children's Safety (difficulty factor: *1.2)

In recent years, crimes against children and the cases of missing children have been increased at a high rate. Therefore, there's an urgent need for safety support systems to prevent crimes against children or for anti-loss, especially when the parents or guardians are not around with the children, such as the children on their ways to and back from schools. In this project, based on the children's location histories reported by the smart devices (which can be simulated by smartphones) the children wear, you are asked to explore the children's life patterns which capture their general life styles and regularities, and apply data mining and/or big data modeling techniques to learn the safe regions as well as safe routes of the children. When the children are under potential dangers (such as staying at a strange region or violate the safe routes), their parents or guardians will receive automatic notifications. You are also asked to further explore an effective energy-efficient positioning scheme for the smart devices which leverages the location tracking accuracy of the children while keeping energy overhead low.

You should collect the data by your own developed mobile application.

You may need to propose and develop your own solution (substantial novel algorithm) to learn the safe regions as well as safe routes of the children based on their location history and detect potential dangers (you can take yourselves as the children and collect your own location history for learning and prediction). Theoretical analysis and implementation are also expected.

A thorough empirical evaluation using the given data set and comparing with baseline methods are required.

Then, a fully developed project report with required format above should be submitted.

Finally, present your project in the class.