In this course you will learn how to get most from your data by combing statistical analysis, data mining and machine learning on your Big Data resources. You will gain hands on lab real life experience which will allow you easily start exploring your own data just after finishing this course.

We’ll start with learning about overall process of working with Big Data and what are the main challenges in collecting, storing and processing those data. You will learn how open source stack solutions from Apache Hadoop ecosystem can be used in your organization. Next, we will dive into solving practical problems with those tools starting from scratch and building our cutting-edge solution. First, we will prepare environment for collecting from multiple sources and storing our Big Data. You will learn how to use Zookeeper to centralize management of your Hadoop solution and manage large number of nodes with command shell. Next, we will focus on managing data storage and working with data – how to migrate them, transform and filter.

After that we will focus on processing data with Workflows and especially how to incorporate Map Reduce to get results in our distributed systems. Then you will learn how to analyze data in distributed environment with Python, Impala and Hive.

Next module will focus on ETL, Warehousing and Data Mining in Big Data environment. At the end we will go beyond classic approach for data analysis and we will use machine learning and data science techniques to get most out of your data.

Our goal is to teach you how to handle Big Data in different solutions and how to get additional insight on your business.

This course is intendent for data analysts, data scientist, big data analyst, developers and IT professionals who wants to get deep knowledge and skills regarding processing big data in Hadoop ecosystem.

hours

18

language

English

Summary

In this course you will learn how to get most from your data by combing statistical analysis, data mining and machine learning on your Big Data resources. You will gain hands on lab real life experience which will allow you easily start exploring your own data just after finishing this course.

We’ll start with learning about overall process of working with Big Data and what are the main challenges in collecting, storing and processing those data. You will learn how open source stack solutions from Apache Hadoop ecosystem can be used in your organization. Next, we will dive into solving practical problems with those tools starting from scratch and building our cutting-edge solution. First, we will prepare environment for collecting from multiple sources and storing our Big Data. You will learn how to use Zookeeper to centralize management of your Hadoop solution and manage large number of nodes with command shell. Next, we will focus on managing data storage and working with data – how to migrate them, transform and filter.

After that we will focus on processing data with Workflows and especially how to incorporate Map Reduce to get results in our distributed systems. Then you will learn how to analyze data in distributed environment with Python, Impala and Hive.

Next module will focus on ETL, Warehousing and Data Mining in Big Data environment. At the end we will go beyond classic approach for data analysis and we will use machine learning and data science techniques to get most out of your data.

Our goal is to teach you how to handle Big Data in different solutions and how to get additional insight on your business.

This course is intendent for data analysts, data scientist, big data analyst, developers and IT professionals who wants to get deep knowledge and skills regarding processing big data in Hadoop ecosystem.

Target Audience

Data analysts, database analysts, big data analysts, data scientists and IT professionals who wants master big data management and analysis.

prerequisites

To attend this training, you should have experience with basic statistical analysis, it is recommended that participants would understand basic concepts of object-oriented programing languages, control flow statements like IF, FOR, FOREACH, concept of variables, datatypes, collections, datasets.

Topics Covered

  • a) Defining Big Data
  • b) Problems arising with BD
  • c) Overview of Hadoop Ecosystem
  • d) Hadoop architecture concepts and practical implementation
  • a) Hadoop stack and environment management
  • b) Setup and management of cluster and nodes
  • c) Using Zookeeper for centralized management
  • d) Hadoop command shell
  • a) HBase Installation and management
  • b) Nutch and Solr configuration
  • c) Working with Nutch Crawlers
  • d) Bulk transferring and streaming data
  • e) Monitoring cluster and data
  • a) Map Reduce concept and examples
  • b) Working with Pig
  • c) Map Reduce in Hive
  • d) Scheduling in Hadoop
  • e) Working with Oozie Workflow
  • a) Data analysis with Python
  • b) Working with data in Hive and Impala
  • c) In-memory computing with Spark
  • d) Distributed analysis and design patterns
  • a) Configuration and management of Hunk
  • b) Creating reports and dashboards
  • c) Working with Talend
  • a) Working with Pentaho Data Integrator
  • b) Creating and managing ETL process
  • c) Data Ingestion
  • d) Structured Data Queries with Hive
  • e) Flume Data Flows
  • a) ML and Data Science overview and lifecycle
  • b) Analytics with Higher-level APIs
  • c) Machine learning with Spark
  • d) Working with Apache Mahout
  • e) Using data lakes
  • minimize course outline