adios_db.data_sources.env_canada.v3 package
Submodules
adios_db.data_sources.env_canada.v3.mapper module
adios_db.data_sources.env_canada.v3.parser module
- class adios_db.data_sources.env_canada.v3.parser.BPCumulativeWeightFraction(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)
Bases:
ECMeasurement
- ref_temp_attr = 'vapor_temp'
- value_attr = 'fraction'
- class adios_db.data_sources.env_canada.v3.parser.BPTemperatureDistribution(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)
Bases:
ECMeasurement
- final_bp = False
- py_json()
- ref_temp_attr = 'vapor_temp'
- value_attr = 'fraction'
- class adios_db.data_sources.env_canada.v3.parser.ECAdhesion(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)
Bases:
ECValueOnly
- value_attr = 'adhesion'
- class adios_db.data_sources.env_canada.v3.parser.ECCompound(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)
Bases:
ECCompoundUngrouped
- py_json()
- class adios_db.data_sources.env_canada.v3.parser.ECCompoundUngrouped(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)
Bases:
ECMeasurement
- determine_unit_type()
- py_json()
- class adios_db.data_sources.env_canada.v3.parser.ECDensity(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)
Bases:
ECMeasurement
- value_attr = 'density'
- class adios_db.data_sources.env_canada.v3.parser.ECDispersibility(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)
Bases:
ECValueOnly
- py_json()
- value_attr = 'effectiveness'
- class adios_db.data_sources.env_canada.v3.parser.ECEmulsion(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)
Bases:
ECMeasurement
- py_json()
- class adios_db.data_sources.env_canada.v3.parser.ECEvaporationEq(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)
Bases:
ECMeasurement
- py_json()
- class adios_db.data_sources.env_canada.v3.parser.ECInterfacialTension(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)
Bases:
ECMeasurement
- py_json()
What do we do when the value is ‘Too Viscous’?
- value_attr = 'tension'
- class adios_db.data_sources.env_canada.v3.parser.ECMeasurement(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)
Bases:
ECMeasurementDataclass
- classmethod from_obj(obj)
- py_json()
- ref_temp_attr = 'ref_temp'
- value_attr = 'measurement'
- class adios_db.data_sources.env_canada.v3.parser.ECMeasurementDataclass(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)
Bases:
object
An incoming density will have the attributes: - value - unit_of_measure - temperature - condition_of_analysis - standard_deviation - replicates - method
We will output an object with the attributes: - measurement (Measurement type) - method - ref_temp (Temperature Measurement type)
- condition_of_analysis: str = None
- determine_unit_type()
- fix_unit()
Some units are in the form ‘X or Y’. We will just choose the first one.
Temperature units (e.g. ‘°C’) need to be stripped of the degree character
Illegible unicode for ‘^2’ needs to be corrected.
- fix_value_if_min_max()
There are cases where our self.value contains a string indicating an interval representing a min-max. If this is the case, fix it by splitting it into the self.min_value & self.max_values.
There are a lot of various ways the interval data is represented in the datasheet, so here are the cases I have found:
- Case ‘N-N’: Split on the ‘-’. This makes it ambiguous whether the
second number is negative or not, but we accept the data as it comes in from the data sheet.
Case ‘N-N-N’: Split on the ‘-’. Take the first two items.
Case ‘N - N’: Split on the ‘-’. Ignore the spaces.
Case ‘N to N’: Split on the ‘to’. Ignore the spaces.
Case ‘N, N’: Split on the ‘,’. Ignore the spaces.
Case ‘-N’: Single negative number, don’t do anything.
Case ‘>N’: min_value = N, max_value = None
Case ‘<N’: min_value = None, max_value = N
Case ‘N (min.)’: min_value = N
Case ‘N (max.)’: max_value = N
- Case ‘N1 (min.), N2 (max.)’: Split on the ‘,’.
min_value = N1, max_value = N2
- Case ‘min N1, max N2’: Split on the ‘,’.
min_value = N1, max_value = N2
- max_value: float = None
- method: str = None
- min_value: float = None
- parse_temperature_string()
The temperature field can have varying content, like ‘15 °C’ or simply ‘15’, in which case we will assume it is Celsius.
- property_group: str = None
- property_name: str = None
- replicates: float = None
- set_ranged_values(min_value, max_value, *args)
each value passed in is part of a min/max interval.
There are also a few cases where the min/max quality of the value is annotated with a ‘ (min.)’ or a ‘ (max.)’ suffix.
- set_value(value)
- split_value_on_nospace_dashes()
cases like ‘N-N’ and ‘N-N-N’ are problematic when parsed by regex because the ‘-’ character is also used as a sign indicator for the numbers.
A possible algorithm is to split the string on all dashes, and the empty items in our resulting list will indicate a dash was used as a sign character.
Ex. ‘5-4’ -> [‘5’, ‘4’]
- Ex. ‘-5–4’ -> [‘’, ‘5’, ‘’, ‘4’]
^ ^
sign sign
- Note: We will not try to turn these items into float values here, but
we would like them to be numeric. Otherwise, we don’t have an interval.
- standard_deviation: float = None
- temperature: str = None
- treat_any_bad_initial_values()
- unit_of_measure: str = None
- value: float = None
- class adios_db.data_sources.env_canada.v3.parser.ECValueOnly(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)
Bases:
ECMeasurement
- py_json()
- class adios_db.data_sources.env_canada.v3.parser.ECViscosity(property_group: str = None, property_name: str = None, value: float = None, min_value: float = None, max_value: float = None, unit_of_measure: str = None, temperature: str = None, condition_of_analysis: str = None, standard_deviation: float = None, replicates: float = None, method: str = None)
Bases:
ECMeasurement
- py_json()
What do we do when the value is ‘Too Viscous’?
- value_attr = 'viscosity'
- class adios_db.data_sources.env_canada.v3.parser.EnvCanadaCsvRecordParser1999(values)
Bases:
ParserBase
A record class for the Env. Canada .csv flat data file. This is intended to be used with a set of data representing a single oil record from the data file. This set is in the form of a list containing dict objects, each representing a single measurement for the oil we are processing.
There are a number of reference fields, i.e. fields that associate a particular measurement to an oil. They are:
oil_id: ID of an oil record. This appears to be the camelcase name of the oil joined by an underscore with the ECCC oil ID. There is one common value per oil, but there are redundant copies of this field in every measurement.
oil_index: ID of an oil record with one or more sub-samples. Similar to the ests field in the other EC data set. There is one common value per oil, but there are redundant copies of this field in every measurement.
There are also a number of fields that would not normally be used to link a measurement to an oil, but are clearly oil general properties. There is usually one actual field value per oil, but there are redundant copies in every measurement. Sometimes though, there are multiple names that show up in the measurements for an oil. Biodiesel records are an example of this.
oil_name
- referenceID: A code to be used to look up the reference title in the
EC Reference document “References-Catalogue_of_Crude_Oil_and_Oil_Product_Properties_(1999)-Revised_2022_En_and_Fr.pdf”
- sample_reference: The name of the reference. This almost always
matches the lookup value correlated with the referenceID, so there is some redundancy here.
- reference: Would have expected something like the title of a document
here, but mostly this contains a ‘N/A’ value. The fields that are actually filled in contain the same referenceID code.
comments:
origin: The location where the oil originates.
synonyms
There are a number of fields that would intuitively seem to associated a measurement with a sub-sample. There is usually one common value per sub-sample, but there are redundant copies in every measurement.
- index_id: ID of an oil sample. This appears to be the concatenation
of the oil_index and a sample number separated by a period “.”.
- weathering_fraction: This appears to be a percentage value in the
range 0-100.
grade
And finally, we have a set of fields that are used uniquely for the measurement
property_name
property_group
property_id
value
value_id
unit_of_measure
temperature
condition_of_analysis
note
- property API
API Gravity needs to be stored as an oil property, but it is in fact a sub-sample scoped property. So we need to figure out the fresh sample ID and get that specific API gravity property.
- Note: API for Biodiesels shows a weathering value of ‘None’,
but clearly it is the “fresh sample”. We need to allow it.
- property fresh_sample_id
- get_subsample(sample_id)
- oil_common_props = ('oil_name', 'oil_index', 'synonyms', 'origin', 'referenceID', 'comments')
- property product_type
- prune_incoming(values)
The Incoming objects contain some unwanted garbage from the spreadsheet that would be better handled before we start parsing anything.
- sample_id_field_name = 'index_id'
- property sample_ids
This function relies on dict having keys ordered by the sequence of insertion into the dict. This is true of Python 3.6, but could break in the future.
- set_aggregate_oil_property(attr)
Oil scoped properties are redundantly stored in each measurement object in our list, so they need to be accumulated and treated in some way depending on the type of data we would like to set in the model.
Attributes to be treated as strings will have their values accumulated in a unique set to prune the redundant information, and then the unique strings in the set will be concatenated into a single string.
Attributes to be treated as integers will also be accumulated in a unique set to prune the redundant values. But multiple ints can not be stored in another int the same way a string can. So we issue a warning and then use the first one in the set. This isn’t perfect, but there are only a handful of oil scoped attributes and we can make an exception if there is an obvious problem.
- set_aggregate_oil_props()
These are properties commonly associated with an oil.
There is a copy of this information inside every measurement, so we need to reconcile them in order to come up with an aggregate value with which to set the oil properties.
- set_aggregate_subsample_props()
These are properties commonly associated with a sub-sample. There is a copy of this information inside every measurement, so we need to reconcile them to determine the identifying properties of each sub-sample.
Sub-sample properties:
- ests_id: One common value per sub-sample. This could be numeric,
so we force it to be a string.
- weathering_fraction: One value per sub-sample. These values
look like some kind of code that EC uses. Probably not useful to us.
- weathering_percent: One common value per sub-sample. These
values are mostly a string in the format ‘N.N%’. We will convert to a structure suitable for a Measurement type.
- weathering_method: One common value per sub-sample. This is
information that might be good to save, but it doesn’t fit into the Adios oil model.
- set_measurement_property(obj_in)
Set a single measurement from an incoming measurement object
Basically we need to decide how to apply the property to our record
oil scoped properties are applied to the oil object.
sample scoped properties can are applied to a particular sub-sample determined by the object
The properties that describe the measurement are:
- value_id: This is a concatenation of the ests and property_id
fields delimited with underscores ‘_’.
- property_id: This is a concatenation of the camel cased
property_name and, as far as I can tell, the index value of the sequence in which the property appears.
- property_group: This is the name of a group or category with
which a set of measurements might be associated.
property_name: The prose name of the property that is measured.
- unit_of_measure: The units for which the measurement describes
a quantity.
temperature: The temperature at which the measurement was taken.
- condition_of_analysis: A reasonably free-form line of text that
describes some special condition of the measurement, such as a prerequisite for measurement, a specification on the type of measurement, or its result.
value: A number representing the quantity of the measurement
- standard_deviation: The amount of variation in the set of
measurements taken.
- replicates: A number representing the quantity of repeated
experiments where measurements were taken.
method: A line of text showing the name of the testing method.
- set_measurement_props()
All objects in the incoming list have the primary function of describing a particular measurement of an oil. Here we iterate over these objects.
- value_is_invalid(value)
adios_db.data_sources.env_canada.v3.reader module
- class adios_db.data_sources.env_canada.v3.reader.EnvCanadaCsvFile1999(name, encoding='utf-8', **kwargs)
Bases:
EnvCanadaCsvFile
A file reader for the Env. Canada .csv flat data file referencing the data from the year 1999. This is reasonably similar to the previous data set we received from them, but some of the columns are different and there are some minor differences in the data.
The name of the file is: Catalogue_of_Crude_Oil_and_Oil_Product_Properties_(1999)-Revised_2022_En.csv
The fields are comma separated ‘,’. This may prove to be problematic, as some fields, notably the reference field, could contain commmas as well.
Each row represents a single measurement
There are a number of reference fields, i.e. fields that associate a particular measurement to an oil. They are:
- oil_id: ID of an oil record. This appears to be the camelcase
name of the oil joined by an underscore with its oil ID.
- oil_index: ECCC ID of an oil record with one or more sub-samples.
This is similar to the ests field of the other data set.
There are also a number of fields that would not normally be used to link a measurement to an oil, but are clearly oil general properties.
oil_name
referenceID
reference
sample_reference
comments
origin
synonyms
There are a number of fields that would intuitively seem to be used to link a measurement to a sub-sample
- index_id: ID of an oil sample. This appears to be the concatenation
of the oil_index and a sample number separated by a period “.”.
- weathering_fraction: This appears to be a percentage value in the
range 0-100.
grade
And finally, we have a set of fields that are used uniquely for the measurement
property_name
property_group
property_id
value
value_id
unit_of_measure
temperature
condition_of_analysis
note
- number_of_columns = 22
- oil_id_field_name = 'oil_index'