MTU Courses - COMP9062 - Big Data Processing

Module Details

Module Code:	COMP9062
Title:	Big Data Processing
Long Title:	Big Data Processing
NFQ Level:	Expert
Valid From:	Semester 1 - 2018/19 ( September 2018 )

Duration:	1 Semester

Credits:	5

Field of Study:	4811 - Computer Science

Module Delivered in:	4 programme(s)

Module Description:

Data is now being generated at an unprecedented rate. The volume, velocity and variety of the data that is being produced means that traditional database architectures are no longer suitable to store, manage and analyse such data. As a result, organisations are now using distributed systems where parts of the data are stored in distributed databases and managed and analysed by distributed algorithms.
In this module, students will be introduced to distributed architectures, frameworks and algorithms to store, manage and analyse large-scale datasets. As part of this module, students will learn not only how to deal with static data but also data in motion performing real-time data analytics.

Learning Outcomes
On successful completion of this module the learner will be able to:
#	Learning Outcome Description
LO1	Appraise how the velocity, volume and variety of data will impact how data is stored, managed and analysed.
LO2	Survey the different tools that constitute a big data framework.
LO3	Process large-scale temporal, geospatial, text and graph datasets using descriptive and analytical tools.
LO4	Design and develop a machine learning algorithm for performing large scale distributed computation.

Dependencies
Module Recommendations This is prior learning (or a practical skill) that is strongly recommended before enrolment in this module. You may enrol in this module if you have not acquired the recommended learning but you will have considerable difficulty in passing (i.e. achieving the learning outcomes of) the module. While the prior learning is expressed as named MTU module(s) it also allows for learning (in another module or modules) which is equivalent to the learning specified in the named module(s).

Incompatible Modules These are modules which have learning outcomes that are too similar to the learning outcomes of this module. You may not earn additional credit for the same learning and therefore you may not enrol in this module if you have successfully completed any modules in the incompatible list.
No incompatible modules listed
Co-requisite Modules
No Co-requisite modules listed
Requirements This is prior learning (or a practical skill) that is mandatory before enrolment in this module is allowed. You may not enrol on this module if you have not acquired the learning specified in this section.
No requirements listed

Indicative Content
The Big Data Revolution. Data storage and data process: Historical evolution. New infrastructure, data models and processing techniques required to deal with big data. Main challenges: Capture, store, search, analyse and visualise the data.
Distributed Computing. Sequential vs. non-sequential computation. Parallel, concurrent and distributed computing: definition and differences. A sequential vs a distributed framework for processing large-scale datasets: Efficiency, resiliency, scalability. Process communication: Asynchronous message passing, message inbox, priority policies, time-limits. Process planner: Dependent process via links, fault tolerance via monitors and state notification. Actors and Streams.
Big Data Framework. Dataset characterisation: Variety, velocity and volume. Data Framework ecosystem overview: Tools to ingest, store, analyse and manage data. Data integration: Extracting, transforming and loading relational and non-relational data. Distributed File system: Cluster components and roles.
Large-Scale Distributed Computation. Map-sort-reduce process: Data processing, Key/value-based communication, Standard I/O file streaming. Spark: Core, Shell, DataSets and DataFrames. Eager and Lazy evaluation. Resilient Distributed Datasets: Transformations and actions, basic API. Distributed Processing and Persistence: RDD partitions and job execution. Spark streaming: Offline vs on-line data processing. Advantages and disadvantages of Spark streaming. Architecture and application flow for Spark streaming. Applications: Text, temporal and geospatial data processing.
Machine Learning for Large-Scale Distributed Computation. Algorithmic design for parallel computing environments: K-means clustering, Decision trees and random forests, graph processing, neural nets, recommender systems. Spark MLlib: Survey of existing algorithms for parallel analysis of large data sets.

Assessment Breakdown	%
Module Content & Assessment
Coursework	100.00%

Assessments

Coursework

Assessment Type	Project	% of Total Mark	50
Timing	Week 8	Learning Outcomes	1,2,3
Assessment Description Complete an analytics project by performing a comprehensive analysis of different offline datasets by applying appropriate technologies e.g. MapReduce and Spark. Produce a report comparing and contrasting both approaches in terms of their expressiveness and efficiency.

Assessment Type	Project	% of Total Mark	50
Timing	Week 13	Learning Outcomes	3,4
Assessment Description Perform descriptive and predictive analytics over different off-line and online-based data sets by applying distributed machine learning algorithms. Design and implement at least one of the algorithms proposed.

No End of Module Formal Examination

Reassessment Requirement
Coursework Only This module is reassessed solely on the basis of re-submitted coursework. There is no repeat written examination.

The University reserves the right to alter the nature and timings of assessment

Module Workload

Workload: Full Time
Workload Type	Contact Type	Workload Description	Frequency	Average Weekly Learner Workload	Hours
Lecture	Contact	Lecture deliverying theory underpinning learning outcomes	Every Week	2.00	2
Lab	Contact	Practical computer-based lab supporting learning outcomes	Every Week	2.00	2
Independent & Directed Learning (Non-contact)	Non Contact	Student undertakes independent study. The student reads recommended papers and practices implementation.	Every Week	3.00	3
Total Hours					7.00
Total Weekly Learner Workload					7.00
Total Weekly Contact Hours					4.00

Workload: Part Time
Workload Type	Contact Type	Workload Description	Frequency	Average Weekly Learner Workload	Hours
Lecture	Contact	Lecture deliverying theory underpinning learning outcomes	Every Week	2.00	2
Lab	Contact	Practical computer-based lab supporting learning outcomes	Every Week	2.00	2
Independent & Directed Learning (Non-contact)	Non Contact	Student undertakes independent study. The student reads recommended papers and practices implementation.	Every Week	3.00	3
Total Hours					7.00
Total Weekly Learner Workload					7.00
Total Weekly Contact Hours					4.00

Recommended Book Resources
Module Resources
Ofer Mendelevitch, Casey Stella and Douglas Eadline.. (2017), Practical Data Science with Hadoop and Spark : Designing and Building Effective Analytics at Scale, Pearson Education, [ISBN: 9780134024141]. Nick Pentreath. (2015), Machine Learning with Spark, PACKT Publishing, [ISBN: 9781783288519].
Supplementary Book Resources
Joe Armstrong. (2013), Programming Erlang, Pragmatic Bookshelf, [ISBN: 9781937785536]. Srinath Perera and Thilina Gunarathne. (2013), Hadoop MapReduce Cookbook, PACKT Publishing, [ISBN: 9781849517294].
This module does not have any article/paper resources
Other Resources
Website, Hadoop Cloudera Map-Reduce documentation, https://www.cloudera.com/documentation/e nterprise/5-5-x/categories/hub_mapreduce .html Website, Hadoop Cloudera Spark documentation, https://www.cloudera.com/documentation/e nterprise/5-5-x/categories/hub_spark.htm l

Programme Code	Programme	Semester	Delivery
Module Delivered in
CR_KARIN_9	Master of Science in Artificial Intelligence	1	Mandatory
CR_KINSE_9	Master of Science in Cybersecurity	1	Elective
CR_KSADE_9	Master of Science in Software Architecture & Design	2	Mandatory
CR_KINSY_9	Postgraduate Diploma in Science in Cybersecurity	1	Elective

https://mtu.akarisoftware.com/

COMP9062 - Big Data Processing

Module Details

Assessments

Module Workload