User Guide
Version 1.3
This guide provides easy-to-follow, step-by-step instructions with sample data on how to use the BitYotaI. BitYota is a Data Warehouse as a Service (DWS) for Big Data Analytics designed to be accessible by anyone, anywhere.
This document was last updated on April 01, 2013.
Table of Contents
Step 4: Analyze your Data using SQL
Getting Started
Step 1: Signup and Login
Sign up for our service if you have not already. Confirm your email and login.
Step 2: Create Your Cluster
Login with your newly created account at http://service.bityota.com
Next, go to the Admin screen at the top right in the UI. Then select the Cluster tab in the Admin screen and click on +New Cluster button. Depending on the Plan you signed up for, you have the option to add one or more nodes to your cluster. The BitYota DWS offers 2 types of nodes: Compute and Data. A Data node is holds your data and has some processing power whereas a Compute node stores no data and is used purely to improve performance by adding pure processing power to your cluster. You will need 1 data node at a minimum to get started.

Figure
1: Add a new Cluster
Step 3: Load Your Data
Click on the Data tab in the UI and click on the +New Source button. Select the type of data source you want to load from, give it a name and enter the relevant details to connect to it. In the example below, we show how to configure an Amazon Simple Storage Service(S3) data source.

Figure 2: Add a new Data Source
Next, tell us a bit more about the actual data you want to load. Click on the +New Set button.
Select the data source corresponding to the data and enter relevant details to
kick off the discovery process. See the example below. Base Path is the top
level S3 bucket/base path for where all the data files for that data set are
located. Sample Path is an example file path for your data set.
Figure 3: Add a new Data Set

s3://<bucketname>/<path>/<date-time>/<filename>
(e.g. daily
data from a TPCH customer feed is stored in path : s3://bityota-demo-data/tpch/customers/2012-11-05/customers.tbl.gz).
Files must be
organized (JSON, CSV or any other field delimiter separated) to have one line
per record. Pretty printed JSON documents where each record spans multiple
lines are not currently supported. For example:
{ "_id": "record 1", .... }
{ "_id": "record 2", .... }
By default all files under base path directory are considered part of the data set. In case,you would like to consider a specific file for a data set, specify the same file path in both base path and sample path.
eg.
Base path: s3://bityota-demo-data/tpch/nations/nations.tbl.gz
Sample path:
s3://bityota-demo-data/tpch/nations/nations.tbl.gz
Recurring loads (daily, hourly or by the minute) can be done by organizing the files in a subdirectory with the date, hour or minute in commonly used formats.
| Data Feed Frequency | Format |
|---|---|
| Daily | YYYYMMDD |
| YYYY-MM-DD | |
| DD-MM-YYYY | |
| Hourly | YYYYMMDDHH |
| YYYY-MM-DD-HH | |
| DD-MM-YYYY-HH | |
| Minute by minute | YYYYMMDDHHmm |
| YYYY-MM-DD-HH-mm | |
| DD-MM-YYYY-HH-mm |
Table 1: Date/time Formats Supported By BitYota
Within a few minutes, BitYota‘s data integration adapter
auto-discovers the format, schema, amount of data and its rate of
arrival/change of your data set. Before load, you have an option to preview
this as well as override any discovered settings. When you override discovered settings, the system will initiate a rediscovery
to confirm the new settings.

Figure 4: Preview Data
You can also
override the schema with the Change Schema option by providing the schema in the following format:
{"schema": [{"column_map": [
{ "name":"id", "type":
"int","load":true},
{ "name":"order_detail_id",
"type": "int","load":false}
]}]}
By default, your data will be loaded into a row-oriented table with a random partitioning scheme. You can override this from the Advanced Settings screen. From here, you can choose the table format(row or column) and partitioning scheme (Range, Random, or Hash) and sub-partitioning scheme (Range, Random or Hash) for your final table. If you select Range or Hash partitioning, then you will need to select the column and the data ranges to create partitions from.

Next we go to specifying how often we want to load data. There are two options for data set load frequency.
a. One Time Load
This screen (see example below) sets the parameters for one time load. The and End Date parameters
can be used to select subset of a data set to load on the first load in case
the data set has the recurring time interval in the path. The first load will
load the data from the Start
Date to the current date. .
Figure 5: One Time Load
settings
b. Recurring Load
If your data is of a recurring nature (for example events data that arrives every few minutes), you can set up an automated, recurring data load.

Figure 6: Recurring Load
settings
Step 4: Analysis
Once your data is loaded into your cluster, it is immediately available for querying and analysis. To do this, go to the Analysis
tab.
Click on the +New Query button and open up the workbench below. All the data sets you have loaded will be visible on the right side and the query builder will be on the left. Double-clicking on any data set will open up a window for you to see its column structure. This window can be moved anywhere on the screen. You can also drag-and-drop column names and the data set name into the query builder.
Type in any regular SQL statement in the query builder area. BitYota has native support for SQL-2003 OLAP operators over both structured and semi-structured data, so you can run any ad-hoc analysis without restrictions.
For JSON documents you can access each attribute by using syntax like json_column->‘attribute’.
For example,
SELECT
json_column->‘person.name’ FROM people;
would extract the name
from { person: { name: “Jones ”‚ …}}
For more examples
of SQL over JSON syntax, see sample queries at http://www.bityota.com/examples/

Figure 7: Query Console
Once the query is executed, you can do several things - you can Schedule the query to run at a preset time and frequency, you can Publish the query for other
members of your organization to use, and you can Download the query results in CSV format.

Figure 8: Query Operations
That‘s it! No
hardware, no software, no hassle. In 4 easy steps, you‘ve loaded your millions
of records of Big Data and analyzed it within minutes.