User Guide

Version 1.5

This guide provides easy-to-follow, step-by-step instructions on how to use BitYota's Service. BitYota is a Data Warehouse Service (DWS) for semi-structured data, designed from the ground up for fast, low-latency analytics on data from fast-changing web and mobile applications.

This document was last updated on Jan 06, 2014.

Table of Contents

Step 1: Signup and Login

Step 2: Create your Cluster

Step 3: Load your Data

Step 4: Analyze your Data using SQL


Step 1:  Signup and Login

Sign up for our service if you have not already. Confirm your email and login.


Step 2: Create Your Cluster

Login with your newly created account at http://service.bityota.com

When you log in for the first time, you will be able to launch your BitYota DWS cluster on your preferred Cloud Provider and pick an appropriate plan with your desired type of hardware and hourly cost. BitYota bills your account based on usage, ensuring you pay only for what you use. The default billing method is via your provided credit card. Depending on the Plan you signed up for, you can add one or more nodes to your cluster. The BitYota DWS offers 2 types of nodes: Compute and Data. A Data node holds your data and has some processing power whereas a Compute node stores no data and is used purely to improve performance by adding pure processing power to your cluster. You will need 1 data node at a minimum to get started. It usually takes about 8-10 minutes for your cluster to be provisioned. You will incur the standard BitYota usage fees per your plan as soon as the cluster is available, until you terminate it.

 

 

 Figure 1: Launch your Cluster


Step 3: Load Your Data

While your cluster is being provisioned, you can choose the data you want to load into your DWS cluster. To do this, go to the Data tab at the top and click on the +New Source  button. Select the type of data source you want to load from, give it a name and enter the relevant details to connect to it. In the example below, we show how to configure an Amazon Simple Storage Service(S3) data source.

 

  Figure 2: Add a new Data Source

Next, tell us a bit more about the actual data you want to load. Click on the +New Dataset button. A dataset is a collection of related data (in this case a set of files) from a data source. There are a two fields to fill out in order for BitYota to discover more information about this dataset. Dataset bucket is the top level S3 bucket where all the data files for that dataset are located.This bucket must be readable using the access keys you provided when defining the Data Source above. The second field on the New Dataset screen is the fully qualified URL to a sample file that represents your dataset accurately. This file will be read to generate a suggested schema for your data before it is loaded in the DWS. If you want to provide your own schema, check the "Skip Dataset Discovery" box.

 

 Figure 3: Add a new Dataset

Some guidelines for organizing your dataset files.
Files must be organized (JSON, CSV or any other field delimiter separated) to have one line per record. Pretty printed JSON documents where each record spans multiple lines are not currently supported. For example:

{ "_id": "record 1", .... }
{ "_id": "record 2", .... }

By default all files under the main dataset directory are considered part of the dataset.

 

Recurring loads (daily, hourly or by the minute) can be done by organizing the files under the main dataset directory, into subdirectories with the date, hour or minute as part of the sub-directory name BitYota supports several commonly used date/time formats as part of a recurring dataset‘s sub-directory name. The supported formats are described below.

Data Feed Frequency Format
Daily YYYYMMDD
YYYY-MM-DD
DD-MM-YYYY
Hourly YYYYMMDDHH
YYYY-MM-DD-HH
DD-MM-YYYY-HH
Minute by minute YYYYMMDDHHmm
YYYY-MM-DD-HH-mm
DD-MM-YYYY-HH-mm


Table 1: Date/time Formats Supported By BitYota

Once you've filled in the location of your dataset and the sample file to read for discovering your dataset, hit the Create button. Within a few minutes, you will see a screen containing important details about your dataset such as whether it is a recurring load, the pattern of the subdirectories, how the data is delimited within each file, how often the data arrives and the schema. You can override any of these details from the same screen.

Data Arrival Schedule : BitYota will discover how often your data arrives (e.g. every 60 minutes) based on the sub-directories found in your dataset. This information is displayed in the "New data generated every" field. If your data comes in at a certain time past the start of the hour (for example every hour at 22 minutes from the start of the hour, ie at 1:22, 2:22, 3:22, etc) and you want the BitYota load scheduler to wait until all your data for a given time frame has been written to the appropriate sub-directory, then you must specify a value in the "Wait till until New data is available" field. ( ideally you want to add a few minutes to the arrival time in order for the writes to complete). A more realistic example of the use of wait time is if you have data arriving multiple times during the hour (like an event stream). In this case you should set the “Wait until New data is available” field to be 60 mins to ensure all events data for an interval has finished arriving. Otherwise if your load frequency is much greater than your data arrival frequency, or you don’t mind some latency in your data loading, you can leave this field as is. The final field in this section is the "Timezone" field. This is the timezone for your data arrival time (default: UTC).

You can also override the schema with the Override Schema option by providing the schema in the following format:

{"column-name":"data type", "column-name","data type"}

For example,

{"myjsoncol1":"json"}

Click Save to confirm these details. Next Click on the Load Data button to start loading data into tables. Your cluster must be up and available in order to load data.

  Figure 4: Discovered Dataset Details

Destination Table Details: Enter the name of the destination table and choose the cluster to load into from the dropdown list. Note you can load data from the same dataset into multiple tables by specifying different destination table names in the Load data screen. You can also override the default dataset schema for a specific load from the Load Data screen. You can also choose the table layout (row or column) and partitioning scheme (Range, Random, or Hash) and sub-partitioning scheme (Range, Random or Hash) for your final table. If you select Range or Hash partitioning, then you will need to select the column and the data ranges to create partitions from.

  Figure 5: Load Data

Pre and Post Load Processing Steps : This is a flexible and simple way to build a data pipeline around your data loads. Steps are any regular SQL or script-based User Defined Functions (UDFs) called from within a SQL statement. For example if you want to run data quality checks prior to load, add those as one or more sequentially executed Pre Steps. The load will wait for the pre-processing steps to finish. Use the post load steps for doing post load verification or processing (for example generating aggregated data) . Additionally, you can use any of the available macros from the table below as parameters within the SQL statements. The available macros are:

Macro Description
<NOMINAL_START_YEAR> : Nominal start year of the current load
<NOMINAL_START_MONTH> : Nominal start month of the current load
<NOMINAL_START_DAY> : Nominal start day of the current load
<NOMINAL_START_HOUR> : Nominal start hour of the current load
<NOMINAL_START_MIN> : Nominal start minute of the current load
<NOMINAL_END_YEAR> : Nominal end year of the current load
<NOMINAL_END_MONTH> : Nominal end month of the current load
<NOMINAL_END_DAY> : Nominal end day of the current load
<NOMINAL_END_HOUR> : Nominal end hour of the current load
<NOMINAL_END_MIN> : Nominal end minute of the current load
<NOMINAL_START_TIME> : Nominal start time in epoch seconds for the current load
<NOMINAL_END_TIME> : Nominal end time in epoch seconds for the current load
<JOB_INSTANCE_ID> : Unique ID for the current load
<LOAD_SCHEMA> : Schema for the destination table of the current load
<LOAD_TABLE> : Destination table of the current load


Table 2: Available Macros

Note: Nominal time specifies the time when a load should happen. In theory the nominal time and the actual time of load should match, however, in practice due to delays the actual load may happen later than the nominal time.

  Figure 6: Pre and Post Steps

Load Schedule:  Finally, you need to specify how often we want to load data. There are two options for this:
a. One Time Load
You can specify a Start and End date time range to load all data that arrived in that time period. This is useful for loading historical data, a subset of data or even to do a one-time catchup of data that may have arrived late.

Figure 7: One Time Load Settings

b. Recurring Load

If your data is of a recurring nature (for example events data that arrives every few minutes), you can set up an automated, recurring data load. The screen below sets up the schedule for a recurring load. From the start date and time specified here, the scheduler will wake up at the specified frequency to look for data to load. The scheduler looks backwards to find data that arrived in the intervals between its last run and current run.

Figure 8: Recurring Load settings

Let's look at an example of a recurring load. Data for this dataset arrives once every 60 minutes, 20 minutes from the start of the hour (UTC) and is loaded in s3 buckets that are named like / If you have delays in data arrival, and you want to ensure that all the data for a given interval is loaded, you should change the “Wait till until New data is available” field in the Create Dataset Screen to be, say 30 mins , to ensure that all data for that hour has finished arriving. If you don’t care about loading the latest data, specify a load frequency greater than your data arrival frequency.

Scenario 1: Load every 120 minutes at the start of the hour starting at 6:00am with no wait time specified. Then data arrival and loading will occur as below:

Bucket Data Arrives at Data Loaded at
/<2014-01-23-06> 06:20 @06:00: - no data loaded
/<2014-01-23-07> 07:20
/<2014-01-23-08> 08:20 @08:00: load data in buckets /<2014-01-23-06>, /<2014-01-23-07>
/<2014-01-23-09> 09:20
/<2014-01-23-10> 10:20 @10:00: load data in buckets /<2014-01-23-08>, /<2014-01-23-09>

Scenario 2: Load every 60 minutes at the start of the hour starting at 6:00am with no wait time specified. This assumes that the data arrives from 20 minutes past the hour and the writes to the bucket are completed before the hour is up. Then data arrival and loading will then occur as below :
Bucket Data Arrives at Data Loaded at
/<2014-01-23-06> 06:20 @06:00: - no data loaded
/<2014-01-23-07> 07:20 @07:00: load data in bucket /<2014-01-23-06>
/<2014-01-23-08> 08:20 @08:00: load data in bucket /<2014-01-23-07>
/<2014-01-23-09> 09:20 @09:00: load data in bucket /<2014-01-23-08>
/<2014-01-23-10> 10:20 @10:00: load data in bucket/<2014-01-23-09>

Scenario 3: Load every 60 minutes, specifying a wait time of 30 mins until data is available. Then data arrival and loading will then occur as below :
Bucket Data Arrives at Data Loaded at
/<2014-01-23-06> 06:20 @06:30: - no data loaded
/<2014-01-23-07> 07:20 @07:30: load data in bucket /<2014-01-23-06>
/<2014-01-23-08> 08:20 @08:30: load data in bucket /<2014-01-23-07>
/<2014-01-23-09> 09:20 @09:30: load data in bucket /<2014-01-23-08>
/<2014-01-23-10> 10:20 @10:30: load data in bucket/<2014-01-23-09>


Step 4: Analysis

Once your data is loaded into your cluster, it is immediately available for querying and analysis. To do this, go to the Analysis tab.

Click on the +New Query button and open up a query workbench . All the tables you have loaded data into will be visible in the left nav panel. Clicking on the arrow next to any table will show you all the columns that are part of the table definition. You can drag-and-drop column and table names into the query window.

Type in any regular SQL statement in the query builder area. BitYota has native support for SQL-2011 OLAP operators over both structured and semi-structured data, so you can run any ad-hoc analysis without restrictions.

For JSON documents you can access each attribute by using syntax like json_column->‘attribute’. For example,

SELECT json_column->‘person.name’ FROM people;

would extract the name from { person: { name: “Jones ”‚ …}}

For more examples of SQL over JSON syntax, see sample queries at http://www.bityota.com/examples/

 

Figure 9: Query Builder

Once the query is executed, you can do several things; you can Schedule the query to run at a preset time and frequency, you can Publish the query for other members of your organization to use, or you can Download the query results in CSV format.

Figure 10: Query Operations

That‘s it! . In 4 easy steps, you‘ve loaded your data and analyzed it within minutes.

BitYota User Guide
Copyright © 2014 BitYota Inc. All rights reserved.