save¶
Module save¶
A module containing the Saver class, used for storing DataFrames with molecules on disk.
- class npfc.save.SafeHDF5Store(*args, **kwargs)[source]¶
Implement safe HDFStore by obtaining file lock. Multiple writes will queue if lock is not obtained.
Edited after: https://stackoverflow.com/questions/41231678/obtaining-a-exclusive-lock-when-writing-to-an-hdf5-file
Initialize and obtain file lock.
- npfc.save.chunk(df, chunk_name_template, chunk_size, shuffle=False, random_seed=None, col_mol='mol', col_idm='idm', csv_sep='|', encode=True)[source]¶
Save an input DataFrame into several chunks, written on disk. The input data has to be converted as a whole to a DF first.
- Parameters
df (
DataFrame) – input DataFramechunk_name_template (
str) – path of the output file if there was only one (i.e. dir/file.csv). Is modified to add chunk IDs (i.e. dir/file_001.csv, dir/file_002.csv, etc.).chunk_size (
int) – number of record for each chunk (last chunk might contain less)shuffle (
bool) – shuffle the records before splitting into chunksrandom_seed (
Optional[int]) – random seed to use for shuffling recordscol_mol (
str) – for SDF format only, column with the RDKit Mol objectscol_idm (
str) – for SDF format only, column with the molecule ids, if not None, info is saved as property and as molecule titlecsv_sep (
str) – for CSV format only, delimiter to useencode (
bool) – encode RDKit Mol and other predefined objects into base64 strings.
- Return type
- Returns
a list of tuples containing each chunk name and its number of records
- npfc.save.chunk_sdf(input_sdf, output_dir, chunk_size=None, prefix=None, keep_uncompressed=False)[source]¶
Split an input SDF file into SDF chunks using memory-efficient line by line text parsing, suitable for large files. Molecules are not parsed, no change is made to the molblocks.
- Parameters
input_sdf (
str) – input SDFoutput_dir (
str) –chunk_name_template – path of the output file as if there was only one (i.e. dir/file.csv). It is modified to add chunk IDs (i.e. dir/file_001.csv, dir/file_002.csv, etc.).
chunk_size (
Optional[int]) – number of record for each chunk (last chunk might contain less)prefix (
Optional[str]) – prefix to use for chunks. If left to None, the input SDF filename will be used.keep_uncompressed (
bool) – in case of gzip input, keep the uncompressed file instead of deleting it as a temp file
- Return type
- Returns
a list of tuples containing each chunk name and its number of records
TODO: support for gzip outputs TODO: support for gzip input
- npfc.save.file(df, output_file, col_mol='mol', col_idm='idm', csv_sep='|', encode=True)[source]¶
Save an input DataFrame into a single file.
- Parameters
df (
DataFrame) – input DataFramecol_mol (
str) – for SDF format only, column with the RDKit Mol objectscol_idm (
str) – for SDF format only, column with the molecule ids, if not None, info is saved as property and as molecule titlecsv_sep (
str) – for CSV format only, delimiter to useencode (
bool) – encode mols and objects into base64 strings based on predefined column names.
- Returns
a tuple containing the output file name and its number of records
- npfc.save.random() x in the interval [0, 1).¶