CS 591L: Cyber Security and Big Data Analytics (Fall 2014)
Welcome to CS591L! This course is intended for graduate students interested in research in cyber security and/or big data. Both first year graduate studnets and more senior graduate students are welcome. I also motivate undergraduate seniors who are interested in research.
Instructor
Dr. Yanfang (Fanny) Ye, Assistant Professor
Lane Department of Computer Science and Eletrical Engineering
West Virgina University
Office: ESB935
Email: yanfang.ye (at) mail (dot) wvu (dot) edu
Lectures: Thursdays 5:00pm  7:30pm in ESBE207
Office Hours: Fridays 2:30pm  4:30pm, or by appointment
Course Description
This course introduces top emerging topics in cyber security and provide the scientific foundation to solve real world security problems; the course explores the challenges and opportunities of big data, introduces typical big data analytic techniques and how to apply them, especially in the area of cyber security.
Course Prerequisites
There are no official prerequisites.
Text Book
No textbook is required. Course lectures, notes and other materials will be found under the course materials section on Ecampus or WVU MIX System.
[References]:
Topics
 Malware attacks and defenses
 Phishing fraud and detection
 Mobile threats and detection
 Internet of Things (IoT) and security
 Introduction of big data: challenges and opportunities
 Big data analytic techniques (foundation of data mining)
 data preprocessing
 frequent pattern mining
 classification and regression
 ensemble learning
 clustering
 graph mining
 Models for big data analytics
 streaming algorithm
 mapreduce and algorithm design
Cyber Security
Big Data Analytics
Case Studies
Grading
A 90100 B 8089 C 7079 D 6069 F <60
 Homework (30%) You will be handed several homework assignments. You may discuss homework with other students, but each student must write up solutions in their own words without assistance from anyone. Any submitted work that it copied from any source or too similar to be an independent writeup will not be given credit.
 Group survey (30%) You will be assign one group survey and presentation. The survey should be conducted on a topic about cyber security and/or big data analytics, possible topic examples: big data analytics on a specific domain (e.g., cyber security, IoT). The survey should cover at least 20 research papers and some real data sets.
 23 students per group
 (50%) A summary report in ACM Transaction format (at least 6 pages):
http://www.acm.org/publications/latex_style/v2 acmsmall.zip  (50%) A 20 minutes presentation + 10 minutes Q/A
 Group project (40%) You will be assigned one group research project. Project topics could be related to diverse big data analytical application domains, e.g., cyber security, or smart devices, etc. You will be required to use cutting edge big data tools and techniques to solve the proposed research problems.
 23 students per group
 (5%) Fully motivate the problem
 (10%) Survey related work
 (25%) Develop your own solutions  substantial novel algorithm development, theoretical analysis, and implementation are expected
 (25%) A thorough empirical evaluation, using the given data set(s) or your collecting data, and comparing with baseline methods
 (25%) A fully developed project report: 12 pages in ACM SIG Tighter Alternate style:
http://www.acm.org/sigs/publications/proceedingstemplates#aL2  (10%) A 20 minutes presentation + 10 minutes Q/A
Schedule (tentative)
Date  Topic  Event 

Aug 21  * Course introduction * Tutorial of Cyber Security 
HW1 out 
Aug 28  * Malware and detection techniques  HW2 out 
Sep 4  * Tutorial of Big Data Analytics  HW3 out 
Sep 11  * Association Rule Discovery * Case Study 1: AR Discovery for Malware Pattern Analysis 
HW4 (Programming assignment): Given: a data set including 1,000 malware sample features (# of items >2000), minsupport, minconfidence; Output: frequent itemsets & association rules; Experiments and analysis: compare and analyze the performance of each mining algorithm (you are asked to implement Apriori and FPGrowth algorithms, 5 bonus points will be given if you can propose, implement and prove your own mining algorithm). 
Sep 18  * Classification 

Sep 25  * Classification * Ensemble Learning 

Oct 2  * Model Evaluation * Regression * Clustering 
HW5 out 
Oct 9  * Data Preprocessing Techniques  1. Group Survey Announcement 2. Group Project Announcement 
Oct 16  * Streaming Algorithm * Mapreduce and algorithm design 

Oct 23  * Graph mining * Case Study 2: Graph Mining for Malware Detection 

Oct 30 
* Case Study 3: Clustering in a Mapreduce Framework for Phishing Fraud Detection * Case Study 4: Data Analytics for IoT and its Security 

Nov 13  * Group Survey Presentation  
Nov 20 
* Hadoop Installation * Programming on Hadoop 

Dec 4  * Group Project Presentation  
Dec 12  * Group Project Presentation 
Group Project
Select one from the following four topics for your group project. In the project, you are required to use cutting edge data mining and/or big data modeling techniques to solve the proposed research problems.
Group Project 1: Malware Detection Based on Win API Calls (difficulty factor: *1.0)
In this project, you are asked to investigate and extend data mining and/or big data modeling techniques for malware detection based on the extracted feature set (Win API Calls).
 You will be given a data set including 50,000 instances, half of which are malware extracted features and the other half are benign files extracted features. (Download the Data Set)
 You are required to propose and develop your own solution (substantial novel algorithm) to build the classification model based on the given data set. Theoretical analysis and implementation are also expected.
 A thorough empirical evaluation using the given data set and comparing with baseline methods are required.
 Then, a fully developed project report with required format above should be submitted.
 Finally, present your project in the class.
Group Project 2: Malware Clustering Based on File Instructions (difficulty factor: *1.0)
In this project, you are asked to investigate and extend data mining and/or big data modeling techniques for malware clustering based on extracted file instruction features.
 You will be given a data set including 1,481 malware instances represented by function based instruction sequences, which can be categorized into 422 malware families.
 You are required to propose and develop your own solution (substantial novel algorithm) to partition this given malware instances into clusters. Theoretical analysis and implementation are also expected.
 A thorough empirical evaluation using the given data set and comparing with baseline methods are required.
 Then, a fully developed project report with required format above should be submitted.
 Finally, present your project in the class.
Group Project 3: Malware Detection Based on File Relation Graphs (difficulty factor: *1.1)
For malware detection, the relations among file samples provide invaluable information about their properties. In this project, you are asked to investigate and extend data mining and/or big data modeling techniques for malware detection based on given file relationships.
 You will be given a data set representing the relationship between 69,165 file samples, 3,095 of which are malware, 22,583 of which are benign files, and 45,487 of which are unknown files. (Download the Data Set)
 You may need to construct graphs based on the given data set to represent the relations between file samples. Then you are required to propose and develop your own solution (substantial novel algorithm) to do the graph mining based on your constructed graphs. Theoretical analysis and implementation are also expected.
 A thorough empirical evaluation using the given data set and comparing with baseline methods are required.
 Then, a fully developed project report with required format above should be submitted.
 Finally, present your project in the class.
Group Project 4: IoT and its Application on Children's Safety (difficulty factor: *1.2)
In recent years, crimes against children and the cases of missing children have been increased at a high rate. Therefore, there's an urgent need for safety support systems to prevent crimes against children or for antiloss, especially when the parents or guardians are not around with the children, such as the children on their ways to and back from schools. In this project, based on the children's location histories reported by the smart devices (which can be simulated by smartphones) the children wear, you are asked to explore the children's life patterns which capture their general life styles and regularities, and apply data mining and/or big data modeling techniques to learn the safe regions as well as safe routes of the children. When the children are under potential dangers (such as staying at a strange region or violate the safe routes), their parents or guardians will receive automatic notifications. You are also asked to further explore an effective energyefficient positioning scheme for the smart devices which leverages the location tracking accuracy of the children while keeping energy overhead low.
 You should collect the data by your own developed mobile application.
 You may need to propose and develop your own solution (substantial novel algorithm) to learn the safe regions as well as safe routes of the children based on their location history and detect potential dangers (you can take yourselves as the children and collect your own location history for learning and prediction). Theoretical analysis and implementation are also expected.
 A thorough empirical evaluation using the given data set and comparing with baseline methods are required.
 Then, a fully developed project report with required format above should be submitted.
 Finally, present your project in the class.