rectorch.data¶
Class list¶
|
Class that manages the pre-processing of raw data sets. |
|
Utility class for reading pre-processed dataset. |
|
Helper class for handling data sets. |
The data
module manages the reading, writing and loading of the data sets.
The supported data set format is standard csv. For more information about the expected data set fromat please visit Data sets CSV format. The data processing and loading configurations are managed through the configuration files as described in Configuration files format. The data pre-processing phase is highly inspired by VAE-CF source code, which has been lately used on several other research works.
Examples
This module is mainly meant to be used in the following way:
>>> from rectorch.data import DataProcessing, DatasetManager
>>> dproc = DataProcessing("/path/to/the/config/file")
>>> dproc.process()
>>> man = DatasetManager(dproc.cfg)
-
class
rectorch.data.
DataProcessing
(data_config)[source]¶ Class that manages the pre-processing of raw data sets.
Data sets are expected of being csv files where each row represents a rating. More details about the allowed format are described in Data sets CSV format. The pre-processing is performed following the parameters settings defined in the data configuration file (see Configuration files format for more information).
- Parameters
- data_config
rectorch.configuration.DataConfig
orstr
:Represents the data pre-processing configurations. When
type(data_config) == str
is expected to be the path to the data configuration file. In that case aconfiguration.DataConfig
object is contextually created.- data_config
- Raises
TypeError
Raises when the type of the input parameter is incorrect.
rectorch.configuration.DataConfig
The rectorch.configuration.DataConfig
object containing the pre-processing
configurations.
dict
(key - str
, value - int
)Dictionary which maps the raw item id, i.e., as in the raw csv file, to an internal id which is an integer between 0 and the total number of items -1.
dict
(key - str
, value - int
)Dictionary which maps the raw user id, i.e., as in the raw csv file, to an internal id which is an integer between 0 and the total number of users -1.
-
process
(self)[source]¶ Perform the entire pre-processing.
The pre-processing relies on the configurations provided in the data configuration file. The full pre-processing follows a specific pipeline (the meaning of each configuration parameter is defined in Configuration files format):
Reading the CSV file named
data_path
;Filtering the ratings on the basis of the
threshold
;Filtering the users and items according to
u_min
andi_min
, respectively;Splitting the users in training, validation and test sets;
- Splitting the validation and test set user ratings in training and test items according
to
test_prop
;
Creating the id mappings (see i2id
and u2id
);
Saving the pre-processed data set files in proc_path
folder.
Warning
In step (4) there is the possibility that users in the validation or test sethave less than 2 ratings making step (5) inconsistent for those users. For this reason,this set of users is simply discarded.
Warning
In step (5) there is the possibility that users in the validation or test sethave a number of items which could cause problems in applying the diviion betweentraining items and test items (e.g., users with 2 ratings and test_prop
= 0.1).In these cases, it is always guaranteed that there is at least one item in the test partof the users.
The output consists of a series of files saved in proc_path
:
train.csv
(csv file) the training ratings corresponding to all ratings of thevalidation_tr.csv
(csv file) the training ratings corresponding to the validationvalidation_te.csv
(csv file) the test ratings corresponding to the validationtest_tr.csv
: (csv file) the training ratings corresponding to the test users;test_te.csv
: (csv file) the test ratings corresponding to the test users;unique_uid.txt
(txt file) with the user id mapping. Line numbers represent theunique_iid.txt
(txt file) with the item id mapping. Line numbers represent the
training users;
users;
users;
internal id, while the string on the corresponding line is the raw id;
internal id, while the string on the corresponding line is the raw id;
-
class
rectorch.data.
DataReader
(data_config)[source]¶ Utility class for reading pre-processed dataset.
The reader assumes that the data set has been previously pre-processed using
DataProcessing.process()
. To avoid malfunctioning, the same configuration file used for the pre-processing should be used to load the data set. Once a reader is created it is possible to load to the training, validation and test set usingload_data()
.- Parameters
- data_config
rectorch.configuration.DataConfig
orstr
:Represents the data pre-processing configurations. When
type(data_config) == str
is expected to be the path to the data configuration file. In that case aDataConfig
object is contextually created.- data_config
- Raises
TypeError
Raised when
data_config
is neither astr
nor arectorch.configuration.DataConfig
.
rectorch.configuration.DataConfig
Object containing the loading configurations.
int
The number of items in the data set.
-
load_data
(self, datatype='train')[source]¶ Load (part of) the pre-processed data set.
Load from the pre-processed file the data set, or part of it, accordingly to the
datatype
.- Parameters
- datatype
str
in {'train'
,'validation'
,'test'
,'full'
} [optional]String representing the type of data that has to be loaded, by default
'train'
. Whendatatype
is equal to'full'
the entire data set is loaded into a sparse matrix.- datatype
- Returns
scipy.sparse.csr_matrix
ortuple
ofscipy.sparse.csr_matrix
The data set or part of it. When
datatype
is'full'
or'train'
a single sparse matrix is returned representing the full data set or the training set, respectively. While, ifdatatype
is'validation'
or'test'
a pair of sparse matrices are returned. The first matrix is the training part (i.e., for each user its training set of items), and the second matrix is the test part (i.e., for each user its test set of items).
ValueError
Raised when datatype
does not match any of the valid strings.
-
load_data_as_dict
(self, datatype='train', col='timestamp')[source]¶ Load the data as a dictionary
The loaded dictionary has users as keys and lists of items as values. An entry of the dictionary represents the list of rated items (sorted by
col
) by the user, i.e., the key.- Returns
dict
(key -int
, value -list
ofint
) ortuple
ofdict
When
datatype
is'train'
a single dictionary is returned representing the training set. While, ifdatatype
is'validation'
or'test'
a pair of dictionaries returned. The first dictionary is the training part (i.e., for each user its training set of items), and the second dictionart is the test part (i.e., for each user its test set of items).
-
class
rectorch.data.
DatasetManager
(config_file)[source]¶ Helper class for handling data sets.
Given the configuration file,
DatasetManager
automatically load training, validation, and test sets that will be accessible from its attributes. It also gives the possibility of loading the data set into only a training and a test set. In this latter case, training, validation and the training part of the test set are merged together to form a bigger training set. The test set will be only the test part of the test set.- Parameters
- config_file
rectorch.configuration.DataConfig
orstr
:Represents the data pre-processing configurations. When
type(config_file) == str
is expected to be the path to the data configuration file. In that case aDataConfig
object is contextually created.- config_file
- Attributes
- n_items
int
Number of items in the data set.
- training_set
tuple
ofscipy.sparse.csr_matrix
The first matrix is the sparse training set matrix, while the second element of the tuple is
None
.- validation_set
tuple
ofscipy.sparse.csr_matrix
The first matrix is the training part of the validation set (i.e., for each user its training set of items), and the second matrix is the test part of the validation set (i.e., for each user its test set of items).
- test_set
tuple
ofscipy.sparse.csr_matrix
The first matrix is the training part of the test set (i.e., for each user its training set of items), and the second matrix is the test part of the test set (i.e., for each user its test set of items).
- n_items
-
get_train_and_test
(self)[source]¶ Return a training and a test set.
Load the data set into only a training and a test set. Training, validation and the training part of the test set are merged together to form a bigger training set. The test set will be only the test part of the test set. The training part of the test users are the last
t
rows of the training matrix, wheret
is the number of test users.- Returns
tuple
ofscipy.sparse.csr_matrix
The first matrix is the training set, the second one is the test set.