adios_db.data_sources.env_canada.v3 package

This version of the Env. Canada data source importing modules is designed to import the file:

ests_data_03-03-2021.csv

Submodules

adios_db.data_sources.env_canada.v3.mapper module

class adios_db.data_sources.env_canada.v3.mapper.EnvCanadaCsvRecordMapper1999(record)

Bases: EnvCanadaCsvRecordMapper

A translation/conversion layer for the Environment Canada imported record object. Basically, the parser has already got the structure mostly in order, but because of the nature of the .csv measurement rows, some re-mapping will be necessary to put it in a form that the Oil object expects.

get_ref_year(name, reference): Search the name and reference text looking for a year

measurement_is_ok(measurement, value_attr)

Determine if a measurement object is good or not.

Parameters:

measurement (JSON struct) – The JSON measurement object.
value_attr (str) – The name of the attribute that contains the value of the measurement

remap_choose_distillation_set()

We start with two attributes (‘cwf_cuts’, ‘tco_cuts’). - cwf_cuts == cuts from Boiling Point Cumulative Weight Fraction - tco_cuts == cuts from Boiling Point Temperature Cut Off

A record may have one or the other or both of these sets. Whichever set exists in the parsed object will be renamed ‘cuts’. In the case that both sets exist in the parsed object, the set with the most data points will be renamed ‘cuts’.

remap_distillation_final_bp()

remap_distillation_sort_by_fraction()

remap_emulsions()

remap_interfacial_tension()

remap_oil_api()

remap_reference_codes(): The content of the reference will be either a single code or a pipe ‘|’ delimited sequence of codes that reference the full title(s) of the reference document. We will convert the sequence of codes into a sequence of full titles separated by a newline ‘

‘.

reorder_methods(methods): This method receives a list of method names, and modifies the list in-place.

value_unit_is_ok(value_unit)

Determine if a value/unit object inside a measurement is good or not.

Parameters:: value_unit (JSON struct) – The JSON value/unit object.

vs_map = {'meso-stable': 'Mesostable', 'no': 'Did not form', 'non': 'Did not form', 'not stable': 'Unstable', 'yes': 'Unknown stability'}

adios_db.data_sources.env_canada.v3.parser module

class adios_db.data_sources.env_canada.v3.parser.BPCumulativeWeightFraction(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)

Bases: ECMeasurement

ref_temp_attr = 'vapor_temp'

value_attr = 'fraction'

class adios_db.data_sources.env_canada.v3.parser.BPTemperatureDistribution(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)

Bases: ECMeasurement

final_bp = False

py_json()

ref_temp_attr = 'vapor_temp'

value_attr = 'fraction'

class adios_db.data_sources.env_canada.v3.parser.ECAdhesion(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)

Bases: ECValueOnly

value_attr = 'adhesion'

class adios_db.data_sources.env_canada.v3.parser.ECCompound(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)

Bases: ECCompoundUngrouped

py_json()

class adios_db.data_sources.env_canada.v3.parser.ECCompoundUngrouped(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)

Bases: ECMeasurement

determine_unit_type()

py_json()

class adios_db.data_sources.env_canada.v3.parser.ECDensity(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)

Bases: ECMeasurement

value_attr = 'density'

class adios_db.data_sources.env_canada.v3.parser.ECDispersibility(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)

Bases: ECValueOnly

py_json()

value_attr = 'effectiveness'

class adios_db.data_sources.env_canada.v3.parser.ECEmulsion(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)

Bases: ECMeasurement

py_json()

class adios_db.data_sources.env_canada.v3.parser.ECEvaporationEq(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)

Bases: ECMeasurement

py_json()

class adios_db.data_sources.env_canada.v3.parser.ECInterfacialTension(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)

Bases: ECMeasurement

py_json(): What do we do when the value is ‘Too Viscous’?

value_attr = 'tension'

class adios_db.data_sources.env_canada.v3.parser.ECMeasurement(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)

Bases: ECMeasurementDataclass

classmethod from_obj(obj)

py_json()

ref_temp_attr = 'ref_temp'

value_attr = 'measurement'

class adios_db.data_sources.env_canada.v3.parser.ECMeasurementDataclass(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)

Bases: object

An incoming density will have the attributes: - value - unit_of_measure - temperature - condition_of_analysis - standard_deviation - replicates - method

We will output an object with the attributes: - measurement (Measurement type) - method - ref_temp (Temperature Measurement type)

condition_of_analysis: str = None

determine_unit_type()

fix_unit()

Some units are in the form ‘X or Y’. We will just choose the first one.

Temperature units (e.g. ‘°C’) need to be stripped of the degree character

Illegible unicode for ‘^2’ needs to be corrected.

fix_value_if_min_max()

There are cases where our self.value contains a string indicating an interval representing a min-max. If this is the case, fix it by splitting it into the self.min_value & self.max_values.

There are a lot of various ways the interval data is represented in the datasheet, so here are the cases I have found:

Case ‘N-N’: Split on the ‘-’. This makes it ambiguous whether the
second number is negative or not, but we accept the data as it comes in from the data sheet.
Case ‘N-N-N’: Split on the ‘-’. Take the first two items.
Case ‘N - N’: Split on the ‘-’. Ignore the spaces.
Case ‘N to N’: Split on the ‘to’. Ignore the spaces.
Case ‘N, N’: Split on the ‘,’. Ignore the spaces.
Case ‘-N’: Single negative number, don’t do anything.
Case ‘>N’: min_value = N, max_value = None
Case ‘<N’: min_value = None, max_value = N
Case ‘N (min.)’: min_value = N
Case ‘N (max.)’: max_value = N
Case ‘N1 (min.), N2 (max.)’: Split on the ‘,’.
min_value = N1, max_value = N2
Case ‘min N1, max N2’: Split on the ‘,’.
min_value = N1, max_value = N2

max_value: float = None

method: str = None

min_value: float = None

parse_temperature_string(): The temperature field can have varying content, like ‘15 °C’ or simply ‘15’, in which case we will assume it is Celsius.

property_group: str = None

property_name: str = None

replicates: float = None

set_ranged_values(min_value, max_value, *args)

each value passed in is part of a min/max interval.
There are also a few cases where the min/max quality of the value is annotated with a ‘ (min.)’ or a ‘ (max.)’ suffix.

set_value(value)

split_value_on_nospace_dashes()

cases like ‘N-N’ and ‘N-N-N’ are problematic when parsed by regex because the ‘-’ character is also used as a sign indicator for the numbers.

A possible algorithm is to split the string on all dashes, and the empty items in our resulting list will indicate a dash was used as a sign character.

Ex. ‘5-4’ -> [‘5’, ‘4’]

Ex. ‘-5–4’ -> [‘’, ‘5’, ‘’, ‘4’]: ^ ^

sign sign
Note: We will not try to turn these items into float values here, but: we would like them to be numeric. Otherwise, we don’t have an interval.

standard_deviation: float = None

temperature: str = None

treat_any_bad_initial_values()

unit_of_measure: str = None

value: float = None

class adios_db.data_sources.env_canada.v3.parser.ECValueOnly(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)

Bases: ECMeasurement

py_json()

class adios_db.data_sources.env_canada.v3.parser.ECViscosity(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)

Bases: ECMeasurement

py_json(): What do we do when the value is ‘Too Viscous’?

value_attr = 'viscosity'

class adios_db.data_sources.env_canada.v3.parser.EnvCanadaCsvRecordParser1999(values)

Bases: ParserBase

A record class for the Env. Canada .csv flat data file. This is intended to be used with a set of data representing a single oil record from the data file. This set is in the form of a list containing dict objects, each representing a single measurement for the oil we are processing.

There are a number of reference fields, i.e. fields that associate a particular measurement to an oil. They are:
- oil_id: ID of an oil record. This appears to be the camelcase name of the oil joined by an underscore with the ECCC oil ID. There is one common value per oil, but there are redundant copies of this field in every measurement.
- oil_index: ID of an oil record with one or more sub-samples. Similar to the ests field in the other EC data set. There is one common value per oil, but there are redundant copies of this field in every measurement.
There are also a number of fields that would not normally be used to link a measurement to an oil, but are clearly oil general properties. There is usually one actual field value per oil, but there are redundant copies in every measurement. Sometimes though, there are multiple names that show up in the measurements for an oil. Biodiesel records are an example of this.
- oil_name
- referenceID: A code to be used to look up the reference title in the
  EC Reference document “References-Catalogue_of_Crude_Oil_and_Oil_Product_Properties_(1999)-Revised_2022_En_and_Fr.pdf”
- sample_reference: The name of the reference. This almost always
  matches the lookup value correlated with the referenceID, so there is some redundancy here.
- reference: Would have expected something like the title of a document
  here, but mostly this contains a ‘N/A’ value. The fields that are actually filled in contain the same referenceID code.
- comments:
- origin: The location where the oil originates.
- synonyms
There are a number of fields that would intuitively seem to associated a measurement with a sub-sample. There is usually one common value per sub-sample, but there are redundant copies in every measurement.
- index_id: ID of an oil sample. This appears to be the concatenation
  of the oil_index and a sample number separated by a period “.”.
- weathering_fraction: This appears to be a percentage value in the
  range 0-100.
- grade
And finally, we have a set of fields that are used uniquely for the measurement
- property_name
- property_group
- property_id
- value
- value_id
- unit_of_measure
- temperature
- condition_of_analysis
- note

property API

API Gravity needs to be stored as an oil property, but it is in fact a sub-sample scoped property. So we need to figure out the fresh sample ID and get that specific API gravity property.

Note: API for Biodiesels shows a weathering value of ‘None’,: but clearly it is the “fresh sample”. We need to allow it.

property fresh_sample_id

get_subsample(sample_id)

oil_common_props = ('oil_name', 'oil_index', 'synonyms', 'origin', 'referenceID', 'comments')

property product_type

prune_incoming(values): The Incoming objects contain some unwanted garbage from the spreadsheet that would be better handled before we start parsing anything.

sample_id_field_name = 'index_id'

property sample_ids: This function relies on dict having keys ordered by the sequence of insertion into the dict. This is true of Python 3.6, but could break in the future.

set_aggregate_oil_property(attr)

Oil scoped properties are redundantly stored in each measurement object in our list, so they need to be accumulated and treated in some way depending on the type of data we would like to set in the model.

Attributes to be treated as strings will have their values accumulated in a unique set to prune the redundant information, and then the unique strings in the set will be concatenated into a single string.
Attributes to be treated as integers will also be accumulated in a unique set to prune the redundant values. But multiple ints can not be stored in another int the same way a string can. So we issue a warning and then use the first one in the set. This isn’t perfect, but there are only a handful of oil scoped attributes and we can make an exception if there is an obvious problem.

set_aggregate_oil_props()

These are properties commonly associated with an oil.

There is a copy of this information inside every measurement, so we need to reconcile them in order to come up with an aggregate value with which to set the oil properties.

set_aggregate_subsample_props()

These are properties commonly associated with a sub-sample. There is a copy of this information inside every measurement, so we need to reconcile them to determine the identifying properties of each sub-sample.

Sub-sample properties:

ests_id: One common value per sub-sample. This could be numeric,
so we force it to be a string.
weathering_fraction: One value per sub-sample. These values
look like some kind of code that EC uses. Probably not useful to us.
weathering_percent: One common value per sub-sample. These
values are mostly a string in the format ‘N.N%’. We will convert to a structure suitable for a Measurement type.
weathering_method: One common value per sub-sample. This is
information that might be good to save, but it doesn’t fit into the Adios oil model.

set_measurement_property(obj_in)

Set a single measurement from an incoming measurement object

Basically we need to decide how to apply the property to our record

oil scoped properties are applied to the oil object.
sample scoped properties can are applied to a particular sub-sample determined by the object

The properties that describe the measurement are:

value_id: This is a concatenation of the ests and property_id
fields delimited with underscores ‘_’.
property_id: This is a concatenation of the camel cased
property_name and, as far as I can tell, the index value of the sequence in which the property appears.
property_group: This is the name of a group or category with
which a set of measurements might be associated.
property_name: The prose name of the property that is measured.
unit_of_measure: The units for which the measurement describes
a quantity.
temperature: The temperature at which the measurement was taken.
condition_of_analysis: A reasonably free-form line of text that
describes some special condition of the measurement, such as a prerequisite for measurement, a specification on the type of measurement, or its result.
value: A number representing the quantity of the measurement
standard_deviation: The amount of variation in the set of
measurements taken.
replicates: A number representing the quantity of repeated
experiments where measurements were taken.
method: A line of text showing the name of the testing method.

set_measurement_props(): All objects in the incoming list have the primary function of describing a particular measurement of an oil. Here we iterate over these objects.

value_is_invalid(value)

class adios_db.data_sources.env_canada.v3.parser.SeparatorString

Bases: str

This is simply a way to specify a custom separator into our mapping list items that are of type str.

sep = '|'

adios_db.data_sources.env_canada.v3.reader module

class adios_db.data_sources.env_canada.v3.reader.EnvCanadaCsvFile1999(name, encoding='utf-8', **kwargs)

Bases: EnvCanadaCsvFile

A file reader for the Env. Canada .csv flat data file referencing the data from the year 1999. This is reasonably similar to the previous data set we received from them, but some of the columns are different and there are some minor differences in the data.

The name of the file is: Catalogue_of_Crude_Oil_and_Oil_Product_Properties_(1999)-Revised_2022_En.csv

The fields are comma separated ‘,’. This may prove to be problematic, as some fields, notably the reference field, could contain commmas as well.
Each row represents a single measurement
There are a number of reference fields, i.e. fields that associate a particular measurement to an oil. They are:
- oil_id: ID of an oil record. This appears to be the camelcase
  name of the oil joined by an underscore with its oil ID.
- oil_index: ECCC ID of an oil record with one or more sub-samples.
  This is similar to the ests field of the other data set.
There are also a number of fields that would not normally be used to link a measurement to an oil, but are clearly oil general properties.
- oil_name
- referenceID
- reference
- sample_reference
- comments
- origin
- synonyms
There are a number of fields that would intuitively seem to be used to link a measurement to a sub-sample
- index_id: ID of an oil sample. This appears to be the concatenation
  of the oil_index and a sample number separated by a period “.”.
- weathering_fraction: This appears to be a percentage value in the
  range 0-100.
- grade
And finally, we have a set of fields that are used uniquely for the measurement
- property_name
- property_group
- property_id
- value
- value_id
- unit_of_measure
- temperature
- condition_of_analysis
- note

number_of_columns = 22

oil_id_field_name = 'oil_index'

adios_db.data_sources.env_canada.v3.refcode_lu module

adios_db.data_sources.env_canada.v3.refcode_lu.get_text_content_lines(filename)

adios_db.data_sources.env_canada.v3.refcode_lu.parse_ref_entries(bufflines): Method for parsing the lines of text in the reference codes document from the organization: Environment and Climate Change Canada