rectorch.data

Class list

rectorch.data.DataProcessing(data_config)

Class that manages the pre-processing of raw data sets.

rectorch.data.DataReader(data_config)

Utility class for reading pre-processed dataset.

rectorch.data.DatasetManager(config_file)

Helper class for handling data sets.

The data module manages the reading, writing and loading of the data sets.

The supported data set format is standard csv. For more information about the expected data set fromat please visit Data sets CSV format. The data processing and loading configurations are managed through the configuration files as described in Configuration files format. The data pre-processing phase is highly inspired by VAE-CF source code, which has been lately used on several other research works.

Examples

This module is mainly meant to be used in the following way:

>>> from rectorch.data import DataProcessing, DatasetManager
>>> dproc = DataProcessing("/path/to/the/config/file")
>>> dproc.process()
>>> man = DatasetManager(dproc.cfg)

See also

Research paper: Variational Autoencoders for Collaborative Filtering

Module: configuration

class rectorch.data.DataProcessing(data_config)[source]

Class that manages the pre-processing of raw data sets.

Data sets are expected of being csv files where each row represents a rating. More details about the allowed format are described in Data sets CSV format. The pre-processing is performed following the parameters settings defined in the data configuration file (see Configuration files format for more information).

Parameters
  • data_configrectorch.configuration.DataConfig or str:
    • Represents the data pre-processing configurations. When type(data_config) == str is expected to be the path to the data configuration file. In that case a configuration.DataConfig object is contextually created.

    Raises
  • TypeError
    • Raises when the type of the input parameter is incorrect.

    Attributes
  • cfgrectorch.configuration.DataConfig
  • i2iddict (key - str, value - int)
    • Dictionary which maps the raw item id, i.e., as in the raw csv file, to an internal id which is an integer between 0 and the total number of items -1.

  • u2iddict (key - str, value - int)
    • Dictionary which maps the raw user id, i.e., as in the raw csv file, to an internal id which is an integer between 0 and the total number of users -1.

    process(self)[source]

    Perform the entire pre-processing.

    The pre-processing relies on the configurations provided in the data configuration file. The full pre-processing follows a specific pipeline (the meaning of each configuration parameter is defined in Configuration files format):

    1. Reading the CSV file named data_path;

    2. Filtering the ratings on the basis of the threshold;

    3. Filtering the users and items according to u_min and i_min, respectively;

    4. Splitting the users in training, validation and test sets;

    5. Splitting the validation and test set user ratings in training and test items according
      • to test_prop;

    6. Creating the id mappings (see i2id and u2id);

    7. Saving the pre-processed data set files in proc_path folder.

    Warning

    In step (4) there is the possibility that users in the validation or test sethave less than 2 ratings making step (5) inconsistent for those users. For this reason,this set of users is simply discarded.

    Warning

    In step (5) there is the possibility that users in the validation or test sethave a number of items which could cause problems in applying the diviion betweentraining items and test items (e.g., users with 2 ratings and test_prop = 0.1).In these cases, it is always guaranteed that there is at least one item in the test partof the users.

    The output consists of a series of files saved in proc_path:

    • train.csv(csv file) the training ratings corresponding to all ratings of the
      • training users;

    • validation_tr.csv(csv file) the training ratings corresponding to the validation
      • users;

    • validation_te.csv(csv file) the test ratings corresponding to the validation
      • users;

    • test_tr.csv : (csv file) the training ratings corresponding to the test users;

    • test_te.csv : (csv file) the test ratings corresponding to the test users;

    • unique_uid.txt(txt file) with the user id mapping. Line numbers represent the
      • internal id, while the string on the corresponding line is the raw id;

    • unique_iid.txt(txt file) with the item id mapping. Line numbers represent the
      • internal id, while the string on the corresponding line is the raw id;

    class rectorch.data.DataReader(data_config)[source]

    Utility class for reading pre-processed dataset.

    The reader assumes that the data set has been previously pre-processed using DataProcessing.process(). To avoid malfunctioning, the same configuration file used for the pre-processing should be used to load the data set. Once a reader is created it is possible to load to the training, validation and test set using load_data().

    Parameters
  • data_configrectorch.configuration.DataConfig or str:
    • Represents the data pre-processing configurations. When type(data_config) == str is expected to be the path to the data configuration file. In that case a DataConfig object is contextually created.

    Raises
  • TypeError
  • Attributes
  • cfgrectorch.configuration.DataConfig
    • Object containing the loading configurations.

  • n_itemsint
    • The number of items in the data set.

    load_data(self, datatype='train')[source]

    Load (part of) the pre-processed data set.

    Load from the pre-processed file the data set, or part of it, accordingly to the datatype.

    Parameters
  • datatypestr in {'train', 'validation', 'test', 'full'} [optional]
    • String representing the type of data that has to be loaded, by default 'train'. When datatype is equal to 'full' the entire data set is loaded into a sparse matrix.

    Returns
  • scipy.sparse.csr_matrix or tuple of scipy.sparse.csr_matrix
    • The data set or part of it. When datatype is 'full' or 'train' a single sparse matrix is returned representing the full data set or the training set, respectively. While, if datatype is 'validation' or 'test' a pair of sparse matrices are returned. The first matrix is the training part (i.e., for each user its training set of items), and the second matrix is the test part (i.e., for each user its test set of items).

    Raises
  • ValueError
    • Raised when datatype does not match any of the valid strings.

    load_data_as_dict(self, datatype='train', col='timestamp')[source]

    Load the data as a dictionary

    The loaded dictionary has users as keys and lists of items as values. An entry of the dictionary represents the list of rated items (sorted by col) by the user, i.e., the key.

    Parameters
  • datatypestr in {'train', 'validation', 'test'} [optional]
    • String representing the type of data that has to be loaded, by default 'train'.

  • colstr of None [optional]
    • The name of the column on which items are ordered, by default “timestamp”. If None no ordered is applied.

    Returns
  • dict (key - int, value - list of int) or tuple of dict
    • When datatype is 'train' a single dictionary is returned representing the training set. While, if datatype is 'validation' or 'test' a pair of dictionaries returned. The first dictionary is the training part (i.e., for each user its training set of items), and the second dictionart is the test part (i.e., for each user its test set of items).

    class rectorch.data.DatasetManager(config_file)[source]

    Helper class for handling data sets.

    Given the configuration file, DatasetManager automatically load training, validation, and test sets that will be accessible from its attributes. It also gives the possibility of loading the data set into only a training and a test set. In this latter case, training, validation and the training part of the test set are merged together to form a bigger training set. The test set will be only the test part of the test set.

    Parameters
  • config_filerectorch.configuration.DataConfig or str:
    • Represents the data pre-processing configurations. When type(config_file) == str is expected to be the path to the data configuration file. In that case a DataConfig object is contextually created.

    Attributes
  • n_itemsint
    • Number of items in the data set.

  • training_settuple of scipy.sparse.csr_matrix
    • The first matrix is the sparse training set matrix, while the second element of the tuple is None.

  • validation_settuple of scipy.sparse.csr_matrix
    • The first matrix is the training part of the validation set (i.e., for each user its training set of items), and the second matrix is the test part of the validation set (i.e., for each user its test set of items).

  • test_settuple of scipy.sparse.csr_matrix
    • The first matrix is the training part of the test set (i.e., for each user its training set of items), and the second matrix is the test part of the test set (i.e., for each user its test set of items).

    get_train_and_test(self)[source]

    Return a training and a test set.

    Load the data set into only a training and a test set. Training, validation and the training part of the test set are merged together to form a bigger training set. The test set will be only the test part of the test set. The training part of the test users are the last t rows of the training matrix, where t is the number of test users.

    Returns
  • tuple of scipy.sparse.csr_matrix
    • The first matrix is the training set, the second one is the test set.