Saturday, May 9, 2015

Cloud Computing and Testing: A Simpler View

This write-up will give you a summary about the benefits, design and framework, programming and testing of clouds both as a service and as a structure too.

What is Cloud Computing?
Cloud Computing has made a great impact on IT industry. Data moved away from Personal computers and Enterprise application servers to be clustered on Cloud.
Cloud computing is a model which provides a convenient way to access and consume a shared resource pool which contains a wide variety of services: storage, networks, servers, applications etc. and that too on a demand basis. Additionally, the service provisioning and release is very easy to manage and doesn’t always require service provider’s intervention
For this, clouds use a large cluster of servers which provide a low-cost technology benefits to consumers by using specialized data connections for data processing. Virtualization is often used to multiply the potential of cloud computing.

It has three delivery models:


Infrastructure as a Service (IaaS)
Platform as a Service (PaaS)
Software as a Service (SaaS)

1. It’s the basic layer of cloud
2. Servers, networks, storage is provided by service provider
3. Software etc are cloud consumer's responsibility
1. No control by consumer over underlying infrastructure
2. A platform e.g. a web server or database or some content management tool like Wordpress is provided by Service provider which helps in application development
3. Here you will have a Virtual machine with all necessary software
1. Here whole application is outsourced to cloud provider
2. It will be provider's responsibility to manage license and access related issues
3. Examples are google docs or any hosted email services

Types of Clouds:
1. Here services are available to all
2. Service provider uses internet and his applications are widest group of users
1. Services (Equipment and data centres) are private to organization
2.  A secure access is given to users of organization
1. A mixture of both services
2. Some services of organization can be used by all and some are private to users inside

There are benefits of using Cloud Computing but, there are limitations too e.g. data integrity, will it be secure, will it stay private and also will services be available to all at all times.
Here comes the need of testing.

Types of Testing in Cloud Computing:

Testing a Cloud
Functional Testing

1. System Verification Testing: Functional needs are tested
2. Acceptance Testing: User testing is done for meeting requirements
3. Interoperability Testing: Application should function well anywhere even if transferred away from cloud
Non Functional Testing
1. Availability Testing: It is the responsibility of cloud vendor that the cloud is without sudden downtime and without affecting client's business
2. Security Testing: Making sure that there's no unauthorized access and that data integrity is maintained
3. Performance Testing: Stress and load testing to make sure that performance remains intact during situations of both maximum and decrease in load
4. Multi Tenancy Testing: Testing to make sure that services are available to multiple clients at same time and that data is secure to avoid access level conflicts
5. Disaster recovery Testing: Verification that the services are restored in case of failure with less disaster recovery time and with no harm to client's business
6. Scalability Testing: Verification that services can be scaled up or down as per needs
7. Interoperability Testing: It should be easy and possible to move a cloud application from one environment/platform to other

How does a Cloud store and process data?

Hadoop and MapReduce:
Earlier when data was manageable, it was stored in databases which had defined schema and relation. As data grew to Big data:Terabytes and Petabytes, (this data has unique characteristic than regular data : “write once read many (WORM)” ) ; Google Introduced GFS (Google File System) which was not open source. Google developed a new programming model called MapReduce. MapReduce is a software framework that allows programming to process stupendous amounts of unstructured data parallel across distributed cluster of processors. And Google Introduced BigTable: A distributed storage for managing structured data that allows scalability to large size: petabytes of data across thousands of commodity servers
Later, Hadoop Distributed File System (HDFS) was developed which is open source and distributed by Apache. Software framework used is MapReduce and the whole project is called Hadoop
MapReduce uses four entities:

submits MR job
helps in managing the job run. It is Java application whose main class is Jobtracker
runs the tasks which are divided from job
Distributed File system
(commonly HDFS) which is used to share files among entities

Properties of HDFS:
consists of thousands of server machines, each storing a fragment of system’s data
Each data job is replicated a number of times (default 3)
It is not taken as exception and is standard
Fault Tolerance
Detecting Faults and fast automatic recovery

Hadoop doesn’t waste time diagnosing the slow-running tasks instead it just detects when a task is slower and fires a replica of it as backup.

Apache HBase:
HBase is the Hadoop database. It is open source implementation of BigTable. For Real time and random access (read/write) needs to Big Data, HBase is used. It has very large tables hosting billions of rows*millions of columns. It is an open source, distributed storage structure for structured data. It is NoSQL database which stores data as key/value pairs in columns while HDFS uses flat files. So, it uses a combination of scalable abilities of Hadoop by running on the HDFS with real-time and random data access using key/value store and problem-solving properties of Map Reduce.
HBase uses four-dimensional data model and these 4 coordinates define each cell:

Row Key
Every row has unique key; the row key does not have a data type and is treated internally as a byte array.
Column Family
Data inside a row is organized into column families; each row having same set of column families, but across rows, the same column families don't require same column qualifiers. HBase stores column families in their own data files, which require definition upfront, and its hard to make changes to column families
Column Qualifier
Column families define columns, which are known as column qualifiers. Column qualifiers can be taken as the columns themselves
Every column can have a configurable no of versions, and data can be accessed for a specific version of a column qualifier.

HBase allows 2 types of access: random access of rows through their row keys, column family, column qualifier, and version and offline or batch access through map-reduce queries. This dual-approach makes it very powerful.


QA Testing your MR jobs: which is actually testing the whole Cloud
Traditional unit testing framework e.g. JUnit, PyUnit etc. can be used to get started testing MR jobs. Unit tests are a great way for testing MR jobs at micro level. Although they don’t test MR jobs as whole inside Hadoop

MRUnit is a tool that can be used to unit-test map and reduce functions. MRUnit involves testing the same way as traditional unit tests so it’s simple and doesn’t require Hadoop to be running.There are some drawbacks of using MRUnit but, much more are the benefits.
MRUnit tests are simple. No external I/O files are needed and tests are faster. Illustration of a test class:
class DummyTest() {
  private Dummy.MyMapper mapper
  private Dummy.MyReducer reducer
  private MapReduceDriver driver

  @Before void setUp() {
    mapper = new Dummy.MyMapper()
    reducer = new Dummy.MyReducer()
    driver = new MapReduceDriver(mapper, reducer)

  @Test void testMapReduce() {
    driver.withInput(new Text('key'), new Text('val'))
        .withOutput(new Text('foo'), new Text('bar'))
Map and Reduce can be tested separately and counters can be tested too.
During a job execution, Counters tell if a particular event occurred and how often. Hadoop has 4 types of counters:
File system, Job, Framework and Custom
Traditional unit tests and MRUnit help in detecting bugs early, but neither can test MR jobs within Hadoop. The local job runner let’s run Hadoop on a local machine, in one JVM, enabling MR jobs a little easier to debug in case of failing job.

Pseudo-distributed cluster constitutes of a single machine running all Hadoop giants. It tests integration with Hadoop better than the local job runner.

Running MR Jobs on a QA Cluster: Its most exhaustive but most complex and challenging mechanism of testing MR jobs on a QA cluster consisting at least a few machines

QA practices should be chosen based on organizational needs and budget. Unit-tests/MRUnit/local job runner can test MR jobs extensively in a simple way. But, running jobs on a QA or development cluster is obviously the best way to fully test MR jobs.

I hope that this blog will tell you that study of cloud is as vast as a cloud itself. 

No comments:

Post a Comment