SPARK - RISHABH LALA

SQL has a problem of horizontal scalling (adding more columns) and this problem is solved by using NoSQL databases like Cassandra (by getting rid of foreign keys). When data really becomes so large that it has to be distrubuted to multiple storages (RAIDs) we use HDFS (Hadoop Distrubuted File System). Denormalized database?

Hive: Hive is a system recently developed with foreign keys, that uses HADOOP and heavily relies on MapReduce. Hive does not fully utilize SQL queries. Its not good for real time queries as there is a lot of latency.

AWS, Azure, Oracle, Azure Databricks, everyone is trying to sell cloud solutions. What they did is use open source solutions and then put a layer on top for ease of use which provides the business value. Then they provided hardware as well to keep themselves in the business. They realize that they cant use properitory languges for users because they may limit the users (programmers) that's why they still want to stick with non properitory languages (unwillingly). All the providers want to make as much sales as they can.So a lot of them are really just using open source software and they're branding it and putting in as part of their system. Now that isn't a bad thing, right? They, but they need to be able to differentiate themselves from, you know, other vendors. And yeah, look, brandings is always something that goes on, but it can be a little confusing though.So their business model is that they're value capture is, is just like almost those Red Hat type of vendors where they provide support but then further because now these are just components when much larger systems and you gotta integrate these systems, they're actually getting that next layer of de risking by taking that on and they charge a premium for that.So a lot of them are really just using open source software and they're branding it and putting in as part of their system.at isn't a bad thing, right? They, but they need to be able to differentiate themselves from, you know, other vendors.And yeah, look, brandings is always something that goes on, but it can be a little confusing though.We see initially AWS via dominant player Azure is catching up Google Cloud, Oracle. You know, there's a number of vendors in this space. And so they're objective at this point is they were walking a fine line.

Big data becomes more relvant with advancement of AI. For instance, 1.75 trillion parameteres are used in the OPEN AI's Chatgpt 4. They have only released few parameters for public but mostly its closed source. But to be able to store this much data you need distributed computing.

SPARK SQL, SPARK Streaming, SPARK ML (Improves the speed of learning with help of distrubuted learning), GRAPHX --> SPARK CORE ENGINE --> YARN, MESOS, STANDALONE, KUBERNETES. In spark most of the processing is done with RAM memory, that's why its incredibly fast. Sometimes it uses like 1 TB of memory making it incredibly fast. Spark makes RDD data structures (Resilient Data Structures).
RDD: is immutable distrubuted collection of elements, distrubted across nodes in a cluster and can be operated in parallel using transformationsa nd actions. Spark uses its own distributed computation engine across multiple nodes. This makes the machine more expensive as it takes a lot of RAM. Memory is more expensive than storage disk (HDFS) making the spark infrastructure more expensive.

For example if you want to calculate a standard deviation, then you may need to have two mappers and two reducers. One does the summation operation and the other does the deviation

Real time processing is better with cassandra.
The underlying langauge for spark is Scala, Scala is superset of java, and requires JVM. We will use Python interface of Spark.

Hadoop is an open-source software framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from a single server to thousands of machines, each offering local computation and storage. The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster, then processes the data in parallel.
Here are the key components and features of Hadoop:

Hadoop Distributed File System (HDFS): This is the storage system of Hadoop that stores data as blocks in a distributed environment, ensuring high bandwidth across the cluster.
MapReduce: This is the processing technique and a programming model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Reduce task takes the output from a Map as input and combines those data tuples into a smaller set of tuples.
YARN (Yet Another Resource Negotiator): YARN is a framework for job scheduling and cluster resource management, introduced in later versions of Hadoop to improve scalability and cluster utilization.
Hadoop Common: These are Java libraries and utilities needed by other Hadoop modules. These utilities support the other Hadoop components.
Hadoop Ecosystem: Beyond these core components, Hadoop is also supported by an ecosystem of related software packages that extend its functionality or add additional features, such as Apache Hive for SQL-like querying of data, Apache HBase for NoSQL storage, Apache Pig for data flow scripting, Apache Spark for in-memory data processing, and more.

Apache Hive
Apache Hive is a data warehousing and SQL query engine that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.

Data Model: Hive organizes data into tables, and its model is more akin to traditional relational databases, though it is built on top of Hadoop and works with Hadoop data formats like HDFS and HBase.
Scalability and Performance: Hive is designed for managing and querying large datasets, but it is generally used for batch processing jobs and not for real-time queries. It can handle petabytes of data, but its query latency is typically higher than that of traditional databases or Cassandra.
Use Cases: Hive is best suited for data warehousing tasks. It is commonly used for analytical queries on large datasets, such as big data analytics, business intelligence, and data mining, where queries can be batched.
Query Language: Hive uses HiveQL, which is a SQL-like query language. This allows users familiar with SQL to easily write queries in Hive, although HiveQL is adapted to the Hadoop ecosystem.
Key Differences

Data Storage: Cassandra is a NoSQL database designed for online transaction processing (OLTP) with a focus on high availability and scalability across multiple data centers. Hive, however, is built on top of Hadoop and is meant for online analytical processing (OLAP).
Latency: Cassandra provides lower latency access to data, suitable for real-time applications. Hive is better suited for batch processing with higher latency.
Query Capability: Hive provides a richer and more SQL-compliant query environment than Cassandra, making it easier to perform complex analyses and joins.
Write/Read Performance: Cassandra offers fast writes and is optimized for high throughput, while Hive is optimized for complex read operations on large datasets.

Apache Spark is also an open-source project, and it's one of the most popular big data processing frameworks in use today. Developed originally at UC Berkeley's AMPLab, Spark was open sourced in 2010 under a BSD license and later moved to the Apache Software Foundation in 2013, where it is now maintained under the Apache License 2.0. Databricks is based on SPARK.
Key Features of Apache Spark:

Speed: Spark is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning. Spark can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Ease of Use: Spark has easy-to-use APIs for operating on large datasets. It supports a wide range of programming languages including Scala, Java, Python, and R, allowing developers to write applications in their preferred language. Spark also includes a suite of over 80 high-level operators that make it easy to build parallel apps.
Modular: Spark includes multiple closely integrated components for different big data processing needs. The core Spark engine supports a variety of "big data" tools including:
- Spark SQL: for SQL and structured data processing
- MLlib: for machine learning
- GraphX: for graph processing
- Structured Streaming: for incremental computation and stream processing
General Purpose: Unlike systems specialized to a specific use case (e.g., text search, graph analytics), Spark is a general-purpose engine for big data processing. It supports a broad range of workloads such as batch applications, iterative algorithms, interactive queries, and streaming.
Robustness: Spark provides advanced DAG (directed acyclic graph) execution engine that supports cyclic data flow and in-memory computing. It also offers fault tolerance through "lineage" (it remembers the sequence of operations leading to a certain piece of data) which allows it to recompute lost data if necessary.

Spark’s open-source nature, coupled with its capabilities, has led to widespread adoption across industries, from finance to telecommunications, for use cases ranging from data transformations and analytics to real-time streaming and machine learning tasks.