About Column Information and Statistics

Column information and statistics are based on the number of rows sampled. The default sample size is set to 1000 rows. You can adjust the sample size by specifying a different number of rows for the ColumnDiscoverySampleSize value in the ConfigOptions connection property. Refer to "Connection Property Descriptions" in your driver documentation for details on setting this and other options.

When the sample size is less than the total number of rows in a table, statistics are based on the sample and not all the values in the table. For example, the table RESIDENTS has 20,000 rows of data. If the sample size is 5000 rows, statistics are based on the first 5000 rows of data returned. Statistics are not collected on the remaining 15,000 rows of data. In turn, when the sample size is equal to or greater than the number of rows in a table, statistics are based on all the rows within the table.

Information and Statistics Displayed in the Column Information Pane

When you select a column (or field) from the Column Information pane, you receive the following information on the data types found in the column:

Data Type is the name of the native data type. (Refer to "Data Types" in the driver documentation for information on supported data types.)
Max Display Length is the longest value found in a column within the sample. For example, in a sample of the column LAKES, the values Jordan and Durgam Cheruvu are found. In this scenario, the maximum display length would be 14.
Min Array Size is the minimum number of elements found in an array. For example, in a sample of the column EMAIL, the Schema Tool finds 2 email addresses in one row and 5 email addresses in another. The Min Array Size would be 2.
Max Array Size is the maximum number of elements found in an array. For example, in a sample of the column EMAIL, the Schema Tool finds 2 email addresses in one row and 5 email addresses in another. The Max Array Size would be 5.
Occurrences/Density displays:
1. The number of rows within the sample size containing a column value as the given data type (occurrences)
2. The percentage of rows within the sample size containing the column as a given data type (density)

How Sample Size Affects Occurrences/Density

The following examples show how sample size affects the Occurrences/Density values.

In a scenario where the sample size has been set to 5000, the first 5000 rows of the column EMAIL contain 1000 occurrences of the Array data type and 3000 occurrences of the String data type. When the EMAIL column is selected, the Density/Occurrences values for ARRAY would appear as 1000 (20.00%) while the values for STRING would appear as 3000 (60.00%).
In a scenario where the sample size has been set to 10000 rows, the column NAMES is selected from a table with 8000 rows. In this case, column statistics are based on all the data in the column. If 2000 String data types are discovered, the Occurrences/Density values would appear as 2000 (25.00%). In turn, if 4000 Object data types are discovered, the Occurrences/Density values would appear as 4000 (50.00%).

How Sample Size Affects Mapping

Sample size affects column information as well as column statistics. For example, if the Schema Tool discovers that all the sampled rows in a column have the Double data type, that data type is mapped as Double in the relational view of the table. In contrast, if the Schema Tool discovers a column with multiple data types within the sampled rows, the SQL type is determined by the combination of data types detected within the column. For details, see "Default Mapping of Columns with Inconsistent Native Data Types."

Column Size for Varchar

For columns mapped to Varchar, the driver truncates values that exceed the column size defined for the column when constructing the relational map of your data. For example, if you have a column with a defined column size of 150 that contains a value with 200 characters, the driver returns only the first 150 characters of that value.

During the initial discovery and normalization process, you can use the DefaultVarcharSize configuration option to specify the default length of fields that are discovered and mapped as Varchar by the driver. If the driver discovers a field with String data of a greater length, the String data is truncated to the length of the specified value.

After the initial discovery and normalization process, you can define the column size of individual columns from the main display of the Schema Tool. In the main display, select the table from the Available Schemas pane. In the Column Information pane, you can specify a new column size in the VARCHAR Size field.

If you merely want to find out the size of a Varchar column, you can execute the getColumns function. The column size defined will be the value of the COLUMN_SIZE column.

Column Size for Varbinary

For columns mapped to Varbinary, the driver truncates values that exceed the column size defined for the column when constructing the relational map of your data. For example, if you have a column with a defined column size of 8000 that contains a value with 9000 bytes, the driver will return only the first 8000 bytes of that value. The default maximum size of Varbinary columns is set 8000 bytes.

After the initial discovery and normalization process, you can define the column size of individual columns from the main display of the Schema Tool. In the main display, select the table from the Available Schemas pane. In the Column Information pane, you can specify a new column size in the VARCHAR Size field.

If you merely want to find out the size of the Varbinary column, you can execute the getColumns function. The column size defined will be the value of the COLUMN_SIZE column.

Progress DataDirect for JDBC for MongoDB User's Guide Release 6.0.2