About Column Information and Statistics
- Last Updated: January 7, 2016
- 4 minute read
- DataDirect Connectors
- JDBC
- MongoDB 6.0
- Documentation
Column information and statistics are based on the number of rows sampled. The
default sample size is set to 1000 rows. You can
adjust the sample size by specifying a different number of rows for the
ColumnDiscoverySampleSize value in the ConfigOptions connection property. Refer to
"Connection Property Descriptions" in your driver documentation for details on
setting this and other options.
When the sample size is less than the total number of rows in a table,
statistics are based on the sample and not all the values in the table. For example, the
table RESIDENTS has 20,000 rows of data. If the sample size
is 5000 rows, statistics are based on the first 5000 rows of data returned. Statistics are
not collected on the remaining 15,000 rows of data. In turn, when the sample size is equal
to or greater than the number of rows in a table, statistics are based on all the rows
within the table.
Information and Statistics Displayed in the Column Information Pane
- Data Type is the name of the native data type. (Refer to "Data Types" in the driver documentation for information on supported data types.)
- Max Display Length is the longest value found
in a column within the sample. For example, in a sample of the column
LAKES, the valuesJordanandDurgam Cheruvuare found. In this scenario, the maximum display length would be 14. - Min Array Size is the minimum number of
elements found in an array. For example, in a sample of the column
EMAIL, the Schema Tool finds 2 email addresses in one row and 5 email addresses in another. The Min Array Size would be 2. - Max Array Size is the maximum number of
elements found in an array. For example, in a sample of the column
EMAIL, the Schema Tool finds 2 email addresses in one row and 5 email addresses in another. The Max Array Size would be 5. - Occurrences/Density displays:
- The number of rows within the sample size containing a column value as the given data type (occurrences)
- The percentage of rows within the sample size containing the column as a given data type (density)
How Sample Size Affects Occurrences/Density
- In a scenario where the sample size has been set to
5000, the first 5000 rows of the columnEMAILcontain 1000 occurrences of the Array data type and 3000 occurrences of the String data type. When theEMAILcolumn is selected, the Density/Occurrences values forARRAYwould appear as1000 (20.00%)while the values forSTRINGwould appear as3000 (60.00%). - In a scenario where the sample size has been set to
10000rows, the columnNAMESis selected from a table with 8000 rows. In this case, column statistics are based on all the data in the column. If 2000 String data types are discovered, the Occurrences/Density values would appear as2000 (25.00%). In turn, if 4000 Object data types are discovered, the Occurrences/Density values would appear as4000 (50.00%).
How Sample Size Affects Mapping
Sample size affects column information as well as column statistics. For example, if the Schema Tool discovers that all the sampled rows in a column have the Double data type, that data type is mapped as Double in the relational view of the table. In contrast, if the Schema Tool discovers a column with multiple data types within the sampled rows, the SQL type is determined by the combination of data types detected within the column. For details, see "Default Mapping of Columns with Inconsistent Native Data Types."
Column Size for Varchar
For columns mapped to Varchar, the driver truncates values that exceed the column size defined for the column when constructing the relational map of your data. For example, if you have a column with a defined column size of 150 that contains a value with 200 characters, the driver returns only the first 150 characters of that value.
During the initial discovery and normalization process, you can use the DefaultVarcharSize configuration option to specify the default length of fields that are discovered and mapped as Varchar by the driver. If the driver discovers a field with String data of a greater length, the String data is truncated to the length of the specified value.
After the initial discovery and normalization process, you can define the column size of individual columns from the main display of the Schema Tool. In the main display, select the table from the Available Schemas pane. In the Column Information pane, you can specify a new column size in the VARCHAR Size field.
If you merely want to find out the size of a Varchar column, you can execute the getColumns function. The column size defined will be the value of the COLUMN_SIZE column.
Column Size for Varbinary
For columns mapped to Varbinary, the driver truncates values that exceed the column size defined for the column when constructing the relational map of your data. For example, if you have a column with a defined column size of 8000 that contains a value with 9000 bytes, the driver will return only the first 8000 bytes of that value. The default maximum size of Varbinary columns is set 8000 bytes.
After the initial discovery and normalization process, you can define the column size of individual columns from the main display of the Schema Tool. In the main display, select the table from the Available Schemas pane. In the Column Information pane, you can specify a new column size in the VARCHAR Size field.
If you merely want to find out the size of the Varbinary column, you can execute the getColumns function. The column size defined will be the value of the COLUMN_SIZE column.