MTU Courses - DATA9002 - Distributed Data Management

Module Details

Module Code:	DATA9002
Title:	Distributed Data Management
Long Title:	Distributed Data Management
NFQ Level:	Expert
Valid From:	Semester 1 - 2017/18 ( September 2017 )

Duration:	1 Semester

Credits:	5

Field of Study:	4816 - Data Format

Module Delivered in:	4 programme(s)

Module Description:

Big data analytics turns big datasets into high-quality information, providing deeper insights enabling better decisions.
However, big data requires novel data storage and data process techniques.
In this module, the learner will be introduced to different NoSQL-based data models, their possible combination and the best use-cases for each of them.
The learner will also compare and contrast different large scale analytics libraries, comparing them in terms of their expressiveness and efficiency.

Learning Outcomes
On successful completion of this module the learner will be able to:
#	Learning Outcome Description
LO1	Appraise the challenges posed by big data and the new infrastructure, data models and processing techniques it demands.
LO2	Compare and contrast the main NoSQL-based data models, discriminating the best fit for different use-cases.
LO3	Combine document-oriented and graph-based data models for a fit for purpose multi-component system.
LO4	Demonstrate the scalability, flexibility and reliability of a distributed data cluster supporting large data sets.
LO5	Compare and contrast the MapReduce and Spark large-scale analytics libraries in terms of their expressiveness and efficiency.

Dependencies
Module Recommendations This is prior learning (or a practical skill) that is strongly recommended before enrolment in this module. You may enrol in this module if you have not acquired the recommended learning but you will have considerable difficulty in passing (i.e. achieving the learning outcomes of) the module. While the prior learning is expressed as named MTU module(s) it also allows for learning (in another module or modules) which is equivalent to the learning specified in the named module(s).

Incompatible Modules These are modules which have learning outcomes that are too similar to the learning outcomes of this module. You may not earn additional credit for the same learning and therefore you may not enrol in this module if you have successfully completed any modules in the incompatible list.
No incompatible modules listed
Co-requisite Modules
No Co-requisite modules listed
Requirements This is prior learning (or a practical skill) that is mandatory before enrolment in this module is allowed. You may not enrol on this module if you have not acquired the learning specified in this section.
No requirements listed

Indicative Content
The Big Data Revolution. Data storage and data process: Historical evolution. New infrastructure, data models and processing techniques required to deal with big data. Main challenges: Capture, store, search, analyse and visualise the data.
NoSQL Databases. Alternative to relational databases to address big data challenges. Impedance mismatch, scale-out vs. scale-up. Wide range of data models: Pure key/value, colummn-based, document-oriented and graph-based. Polyglot persistance. CAP theorem, partition tolerance, BASE vs. ACID transactions.
Document-oriented DBs. Efficient, scalable and resilient data storage: Replication and sharding. Clusters, configuration nodes, shards, chunk of data, shard key range, balancing backgroud operators. Expressive and efficient data queries: JSON-based document representation. Aggregation framework: Commands and pipelines.
Graph-based DBs. Efficient, scalable and resilient data storage: Property graph data model. Nodes, relationships, properties and labels. Expressive and efficient data queries: Cypher declarative SQL-like language. Graph formalism and optimal path-traversal algorithms. Polyglot persistance: On combining document-oriented and graph-based data models for a fit for purpose multi-component system.
Large-Scale Data Framework. Storage: Distributed File System. Data nodes vs. name nodes. Large files splitting and distribution algorithms. Analysis: Map-Recude. Divide and conquer algorithm schema. Map-sort-reduce process. Parallel processing. Key/value-based communication. Standard I/O file streaming. Spark: Resilient Distributed Dataset. Transformations and actions, basic API. Lazy evaluation. Context, cluster manager and worker nodes.

Assessment Breakdown	%
Module Content & Assessment
Coursework	100.00%

Assessments

Coursework

Assessment Type	Practical/Skills Evaluation	% of Total Mark	50
Timing	Week 7	Learning Outcomes	1,2,3
Assessment Description Given a large data set to be stored and queried, produce a report comparing and contrasting a document-oriented vs. graph-based solution for it. Implement a polyglot persistance-based solution combining two components using the document-oriented and graph-based approaches, respectively.

Assessment Type	Practical/Skills Evaluation	% of Total Mark	50
Timing	Week 12	Learning Outcomes	1,4,5
Assessment Description Given a large data set to be stored and analysed, produce a report comparing and contrasting a Map-Reduce vs. Spark-based solution for it. Implement the two solutions, comparing them in terms of their expressiveness and efficiency.

No End of Module Formal Examination

Reassessment Requirement
Coursework Only This module is reassessed solely on the basis of re-submitted coursework. There is no repeat written examination.

The University reserves the right to alter the nature and timings of assessment

Module Workload

Workload: Full Time
Workload Type	Contact Type	Workload Description	Frequency	Average Weekly Learner Workload	Hours
Lecture	Contact	Lecture based on Indicative Content	Every Week	1.00	1
Lab	Contact	Lab based on Indicative Content	Every Week	3.00	3
Independent Learning	Non Contact	Independent student learning	Every Week	3.00	3
Total Hours					7.00
Total Weekly Learner Workload					7.00
Total Weekly Contact Hours					4.00

Workload: Part Time
Workload Type	Contact Type	Workload Description	Frequency	Average Weekly Learner Workload	Hours
Lecture	Contact	Lecture based on Indicative Content	Every Week	1.00	1
Lab	Contact	Lab based on Indicative Content	Every Week	3.00	3
Independent Learning	Non Contact	Independent student learning	Every Week	3.00	3
Total Hours					7.00
Total Weekly Learner Workload					7.00
Total Weekly Contact Hours					4.00

Recommended Book Resources
Module Resources
Pramod J. Sadalage and Martin Fowler. (2013), NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, Addison-Wesley, [ISBN: 9780321826626]. Ofer Mendelevitch, Casey Stella and Douglas Eadline. (2017), Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale, Pearson Education, [ISBN: 9780134024141].
Supplementary Book Resources
John Sharp et. al. (2013), Data Access for Highly-Scalable Solutions: Using SQL, NoSQL, and Polyglot Persistence, Microsoft patterns & practices, [ISBN: 9781621140306]. Kristina Chodorow. (2013), MongoDB: The Definitive Guide, O'Reilly Media, [ISBN: 9781449344689]. Srinath Perera and Thilina Gunarathne. (2013), Hadoop MapReduce Cookbook, Packt Publishing, [ISBN: 9781849517294].
Supplementary Article/Paper Resources
Sugam Sharma et. al.. (2014), A Brief Review on Leading Big Data Models, Data Science Journal, 13. A. B. M. Moniruzzaman and Syed Akhter Hossain. (2013), NoSQL Database: New Era of Databases for Big data Analytics - Classification, Characteristics and Comparison, CoRR/abs/1307.0191.. Landset, S., Khoshgoftaar, T.M., Richter, A.N. et al.. (2015), A survey of open source tools for machine learning with big data in the Hadoop ecosystem, Journal of Big Data, 2:24. Kyong-Ha Lee et. al.. (2012), Parallel data processing with MapReduce: a survey, ACM SIGMOD, 40:4.
Other Resources
Website, MongoDB documentation, https://docs.mongodb.com/ Website, Neo4j documentation, https://neo4j.com/docs/ Website, Hadoop Cloudera Map-Reduce documentation, https://www.cloudera.com/documentation/e nterprise/5-5-x/categories/hub_mapreduce .html Website, Hadoop Cloudera Spark documentation, https://www.cloudera.com/documentation/e nterprise/5-5-x/categories/hub_spark.htm l

Programme Code	Programme	Semester	Delivery
Module Delivered in
CR_SCOBI_9	Master of Science in Computational Biology	2	Mandatory
CR_SDAAN_9	Master of Science in Data Science & Analytics	2	Mandatory
CR_SNUHA_9	Master of Science in Nutrition & Health Analytics	2	Mandatory
CR_SCPBI_9	Postgraduate Diploma in Science in Computational Biology	2	Mandatory

https://mtu.akarisoftware.com/

DATA9002 - Distributed Data Management

Module Details

Assessments

Module Workload