User Guide

User Guide

Version 1.3

This guide provides easy-to-follow, step-by-step instructions with sample data on how to use the BitYotaI. BitYota is a Data Warehouse as a Service (DWS) for Big Data Analytics designed to be accessible by anyone, anywhere.

This document was last updated on April 01, 2013.

Table of Contents

Step 1: Signup and Login

Step 2: Create your Cluster

Step 3: Load your Data

Step 4: Analyze your Data using SQL

Getting Started

Step 1:  Signup and Login

Sign up for our service if you have not already. Confirm your email and login.

Step 2: Create Your Cluster

Login with your newly created account at http://service.bityota.com

Next, go to the Admin screen at the top right in the UI. Then select the Cluster tab in the Admin screen and click on +New Cluster button. Depending on the Plan you signed up for, you have the option to add one or more nodes to your cluster. The BitYota DWS offers 2 types of nodes: Compute and Data. A Data node is holds your data and has some processing power whereas a Compute node stores no data and is used purely to improve performance by adding pure processing power to your cluster. You will need 1 data node at a minimum to get started.

 

 Figure 1: Add a new Cluster

Step 3: Load Your Data

Click on the Data tab in the UI and click on the +New Source  button. Select the type of data source you want to load from, give it a name and enter the relevant details to connect to it. In the example below, we show how to configure an Amazon Simple Storage Service(S3) data source.

 

  Figure 2: Add a new Data Source

Next, tell us a bit more about the actual data you want to load. Click on the +New Set button. Select the data source corresponding to the data and enter relevant details to kick off the discovery process. See the example below. Base Path is the top level S3 bucket/base path for where all the data files for that data set are located. Sample Path is an example file path for your data set.

 

 Figure 3: Add a new Data Set

S3 base paths are s3://<bucketname>/<path>/  (e.g. s3://bityota-demo-data/tpch/customers/).   S3 sample paths are s3://<bucketname>/<path>/<filename>. If you have data that is copied in S3 at some regular interval, the path should include the date-time stamp in the below format
s3://<bucketname>/<path>/<date-time>/<filename>   (e.g. daily data from a TPCH customer feed is stored in path : s3://bityota-demo-data/tpch/customers/2012-11-05/customers.tbl.gz).

Files must be organized (JSON, CSV or any other field delimiter separated) to have one line per record. Pretty printed JSON documents where each record spans multiple lines are not currently supported. For example:

{ "_id": "record 1", .... }
{ "_id": "record 2", .... }

By default all files under base path directory are considered part of the data set. In case,you would like to consider a specific file for a data set, specify the same file path in both base path and sample path. eg.
Base path:
s3://bityota-demo-data/tpch/nations/nations.tbl.gz
Sample path: s3://bityota-demo-data/tpch/nations/nations.tbl.gz

 
Recurring loads (daily, hourly or by the minute) can be done by organizing the files in a subdirectory with the date, hour or minute in commonly used formats. Currently BitYota supports several commonly used date/time formats as recurring data set‘s directory name. The supported formats are described below in Table 1.

Data Feed Frequency Format
Daily YYYYMMDD
YYYY-MM-DD
DD-MM-YYYY
Hourly YYYYMMDDHH
YYYY-MM-DD-HH
DD-MM-YYYY-HH
Minute by minute YYYYMMDDHHmm
YYYY-MM-DD-HH-mm
DD-MM-YYYY-HH-mm

Table 1: Date/time Formats Supported By BitYota

Within a few minutes, BitYota‘s data integration adapter auto-discovers the format, schema, amount of data and its rate of arrival/change of your data set. Before load, you have an option to preview this as well as override any discovered settings. When you override discovered settings, the system will initiate a rediscovery to confirm the new settings.  

  Figure 4: Preview Data

You can also override the schema with the Change Schema option by providing the schema in the following format:

{"schema": [{"column_map": [

 { "name":"id", "type": "int","load":true},

 { "name":"order_detail_id", "type": "int","load":false}

]}]}

By default, your data will be loaded into a row-oriented table with a random partitioning scheme. You can override this from the Advanced Settings screen. From here, you can choose the table format(row or column) and partitioning scheme (Range, Random, or Hash) and sub-partitioning scheme (Range, Random or Hash) for your final table. If you select Range or Hash partitioning, then you will need to select the column and the data ranges to create partitions from.

 Next we go to specifying how often we want to load data. There are two options for data set load frequency.
a. One Time Load
This screen (see example below) sets the parameters for one time load. The
 and End Date parameters can be used to select subset of a data set to load on the first load in case the data set has the recurring time interval in the path. The first load will load the data from the Start Date to the current date. .

Figure 5: One Time Load settings

b. Recurring Load

If your data is of a recurring nature (for example events data that arrives every few minutes), you can set up an automated, recurring data load. span> This screen (see example below) sets the parameters for a recurring load. The Start Date parameter can be used to select a subset of a data set to load on the first load when the data publishing frequency is in the file path. The first load will load the data from the Start Date to the current date. The Data Load Frequencycan be a multiple of the Data Publishing Frequency.

Figure 6: Recurring Load settings

Step 4: Analysis

Once your data is loaded into your cluster, it is immediately available for querying and analysis. To do this, go to the Analysis tab.

Click on the +New Query button and open up the workbench below. All the data sets you have loaded will be visible on the right side and the query builder will be on the left. Double-clicking on any data set will open up a window for you to see its column structure. This window can be moved anywhere on the screen. You can also drag-and-drop column names and the data set name into the query builder.

Type in any regular SQL statement in the query builder area. BitYota has native support for SQL-2003 OLAP operators over both structured and semi-structured data, so you can run any ad-hoc analysis without restrictions.

For JSON documents you can access each attribute by using syntax like json_column->‘attribute’. For example,

SELECT json_column->‘person.name’ FROM people;

would extract the name from { person: { name: “Jones ”‚ …}}

For more examples of SQL over JSON syntax, see sample queries at http://www.bityota.com/examples/

 

Figure 7: Query Console

Once the query is executed, you can do several things - you can Schedule the query to run at a preset time and frequency, you can Publish the query for other members of your organization to use, and you can Download the query results in CSV format.

Figure 8: Query Operations

That‘s it! No hardware, no software, no hassle. In 4 easy steps, you‘ve loaded your millions of records of Big Data and analyzed it within minutes.