Datasource Adapters
Flat File Reader
Overview
The AMI Flat File Reader Datasource Adapter is a highly configurable adapter designed to process extremely large flat files into tables at rates exceeding 100mb per second*. There are a number of directives which can be used to control how the flat file reader processes a file. Each line (delineated by a Line feed) is considered independently for parsing. Note the EXECUTE <sql> clause supports the full AMI sql language.
*Using Pattern Capture technique (_pattern) to extract 3 fields across a 4.080 gb text file containing 11,999,504 records. This generated a table of 11,999,504 records x 4 columns in 37,364 milliseconds (additional column is the default linenum). Tested on raid-2 7200rpm 2TB drive
Generally speaking, the parser can handle three (4) different methods of parsing files:
Delimited list or ordered fields
Example data and query:
11232|1000|123.20
12412|8900|430.90
CREATE TABLE mytable AS USE _file="myfile.txt" _delim="|"
_fields="String account, Integer qty, Double px"
EXECUTE SELECT * FROM file
Key value pairs
Example data and query:
account=11232|quantity=1000|price=123.20
account=12412|quantity=8900|price=430.90
CREATE TABLE mytable AS USE _file="myfile.txt" _delim="|" _equals="="
_fields="String account, Integer qty, Double px"
EXECUTE SELECT * FROM file
Pattern Capture
Example data and query:
Account 11232 has 1000 shares at $123.20 px
Account 12412 has 8900 shares at $430.90 px
CREATE TABLE mytable AS USE _file="myfile.txt"
_fields="String account, Integer qty, Double px"
_pattern="account,qty,px=Account (.*) has (.*) shares at \\$(.*) px"
EXECUTE SELECT * FROM fil
Raw Line
If you do not specify a _fields, _mapping nor _pattern directive then the parser defaults to a simple row-per-line parser. A "line" column is generated containing the entire contents of each line from the file
CREATE TABLE mytable AS USE _file="f.txt" EXECUTE SELECT * FROM FILE
Configuring the Adapter for first use
1. Open the datamodeler (In Developer Mode -> Menu Bar -> Dashboard -> Datamodel)
2. Choose the "Add Datasource" button
3. Choose Flat File Reader
4. In the Add datasource dialog:
Name: Supply a user defined Name, ex: MyFiles
URL: /full/path/to/directory/containing/files (ex: /home/myuser/files )
(Keep in mind that the path is on the machine running AMI, not necessarily your local desktop)
5. Click "Add Datasource" Button
Accessing Files Remotely: You can also access files on remote machines as well using an AMI Relay. First install an AMI relay on the machine that contains the files, or at least has access to the files, you wish to read ( See AMI for the Enterprise documentation for details on how to install an AMI relay). Then in the Add Datasource wizard select the relay in the "Relay To Run On" dropdown.
General Directives
File name Directive (Required)
Syntax
_file=path/to/file
Overview
This directive controls the location of the file to read, relative to the datasource's url. Use the forward slash (/) to indicate directories (standard UNIX convention)
Examples
_file="data.txt" (Read the data.txt file, located at the root of the datasource's url)
_file="subdir/data.txt" (Read the data.txt file, found under the subdir directory)
Field definitions Directive (Required)
Syntax
_fields=col1_type col_name, col2_type col2_name, ...
Overview
This directive controls the Column names that will be returned, along with their types. The order in which they are defined is the same as the order in which they are returned. If the column type is not supplied, the default is String. Special note on additional columns: If the line number (see _linenum directive) column is not supplied in the list, it will default to type integer and be added to the end of the table schema. Columns defined in the Pattern (see _pattern directive) but not defined in _fields will be added to the end of the table schema.
Types should be one of: String, Long, Integer, Boolean, Double, Float, UTC
Column names must be valid variable names.
Examples
_fields="String account,Long quantity" (define two columns)
_fields ="fname,lname,int age" (define 3 columns, fname and lname default to String)
Directives for parsing Delimited list of ordered Fields
_file=file_name (Required, see general directives)
_fields=col1_type col1_name, ... (Required, see general directives)
_delim=delim_string (Required)
_conflateDelim=true|false (Optional. Default is false)
_quote=single_quote_char (Optional)
_escape=single_escape_char (Optional)
The _delim indicates the char (or chars) used to separate each field (If _conflateDelim is true, then 1 or more consecutive delimiters are treated as a single delimiter). The _fields is an ordered list of types and field names for each of the delimited fields. If the _quote is supplied, then a field value starting with quote will be read until another quote char is found, meaning delims within quotes will not be treated as delims. If the _escape char is supplied then when an escape char is read, it is skipped and the following char is read as a literal.
Examples
_delim="|"
_fields="code,lname,int age"
_quote="'"
_escape="\\"
This defines a pattern such that:
11232-33|Smith|20
'1332|ABC'||30
Account\|112|Jones|18
Maps to:
code | lname | age |
---|---|---|
11232-33 | Smith | 20 |
1332|ABC | 30 | |
Account|112 | Jones | 18 |
Directives for parsing Key Value Pairs
_file=file_name (Required, see general directives)
_fields=col1_type col1_name, ... (Required, see general directives)
_delim=delim_string (Required)
_conflateDelim=true|false (Optional. Default is false)
_equals=single_equals_char (Required)
_mappings=from1=to1,from2=to2,... (Optional)
_quote=single_quote_char (Optional)
_escape=single_escape_char (Optional)
The _delim indicates the char (or chars) used to separate each field (If _conflateDelim is true, then 1 or more consecutive delimiters are treated as a single delimiter). The _equals char is used to indicate the key/value separator. The _fields is an ordered list of types and field names for each of the delimited fields. If the _quote is supplied, then a field value starting with quote will be read until another quote char is found, meaning delims within quotes will not be treated as delims. If the _escape char is supplied then when an escape char is read, it is skipped and the following char is read as a literal.
The optional _mappings directive allows you to map keys within the flat file to file names specified in the _fields directive. This is useful when a file has key names that are not valid field names, or a file has multiple key names that should be used to populate the same column.
Examples
_delim="|"
_equals="="
_fields="code,lname,int age"
_mappings="20=code,21=lname,22=age"
_quote="'"
_escape="\\"
This defines a pattern such that:
code=11232-33|lname=Smith|age=20
code='1332|ABC'|age=30
20=Act\|112|21=J|22=18 (Note: this row will work due to the _mappings directive)
Maps to:
code | lname | age |
---|---|---|
11232-33 | Smith | 20 |
1332|ABC | 30 | |
Act|112 | J | 18 |
Directives for Pattern Capture
_file=file_name (Required, see general directives)
_fields=col1_type col1_name, ... (Optional, see general directives)
_pattern=col1_type col1_name, ...=regex_with_grouping (Required)
The _pattern must start with a list of column names, followed by an equal sign (=) and then a regular expression with grouping (this is dubbed a column-to-pattern mapping). The regular expression's first grouping value will be mapped to the first column, 2nd grouping to the second and so on.
If a column is already defined in the _fields directive, then it's preferred to not include the column type in the _pattern definition.
For multiple column-to-pattern mappings, use the \n (new line) to separate each one.
Examples
This defines a pattern such that:
User John Smith is 20 years old
User Bobby Boy is 30 years old
Maps to:
fname | lname | age |
---|---|---|
John | Smith | 20 |
Bobby | Boy | 30 |
_pattern="fname,lname,int age=User (.*) (.*) is (.*) years old\n lname,fname,int weight=Customer (.*),(.*) weighs (.*) pounds"
This defines two patterns such that:
User John Smith is 20 years old
User Boy,Bobby weighs 130 pounds'
Maps to:
fname | lname | age | weight |
---|---|---|---|
John | Smith | 20 | |
Bobby | Boy | 130 |