save

Module save

A module containing the Saver class, used for storing DataFrames with molecules on disk.

class npfc.save.SafeHDF5Store(*args, **kwargs)[source]

Implement safe HDFStore by obtaining file lock. Multiple writes will queue if lock is not obtained.

Edited after: https://stackoverflow.com/questions/41231678/obtaining-a-exclusive-lock-when-writing-to-an-hdf5-file

Initialize and obtain file lock.

npfc.save.chunk(df, chunk_name_template, chunk_size, shuffle=False, random_seed=None, col_mol='mol', col_idm='idm', csv_sep='|', encode=True)[source]

Save an input DataFrame into several chunks, written on disk. The input data has to be converted as a whole to a DF first.

Parameters
  • df (DataFrame) – input DataFrame

  • chunk_name_template (str) – path of the output file if there was only one (i.e. dir/file.csv). Is modified to add chunk IDs (i.e. dir/file_001.csv, dir/file_002.csv, etc.).

  • chunk_size (int) – number of record for each chunk (last chunk might contain less)

  • shuffle (bool) – shuffle the records before splitting into chunks

  • random_seed (Optional[int]) – random seed to use for shuffling records

  • col_mol (str) – for SDF format only, column with the RDKit Mol objects

  • col_idm (str) – for SDF format only, column with the molecule ids, if not None, info is saved as property and as molecule title

  • csv_sep (str) – for CSV format only, delimiter to use

  • encode (bool) – encode RDKit Mol and other predefined objects into base64 strings.

Return type

List[Tuple[str, int]]

Returns

a list of tuples containing each chunk name and its number of records

npfc.save.chunk_sdf(input_sdf, output_dir, chunk_size=None, prefix=None, keep_uncompressed=False)[source]

Split an input SDF file into SDF chunks using memory-efficient line by line text parsing, suitable for large files. Molecules are not parsed, no change is made to the molblocks.

Parameters
  • input_sdf (str) – input SDF

  • output_dir (str) –

  • chunk_name_template – path of the output file as if there was only one (i.e. dir/file.csv). It is modified to add chunk IDs (i.e. dir/file_001.csv, dir/file_002.csv, etc.).

  • chunk_size (Optional[int]) – number of record for each chunk (last chunk might contain less)

  • prefix (Optional[str]) – prefix to use for chunks. If left to None, the input SDF filename will be used.

  • keep_uncompressed (bool) – in case of gzip input, keep the uncompressed file instead of deleting it as a temp file

Return type

List[Tuple[str, int]]

Returns

a list of tuples containing each chunk name and its number of records

TODO: support for gzip outputs TODO: support for gzip input

npfc.save.file(df, output_file, col_mol='mol', col_idm='idm', csv_sep='|', encode=True)[source]

Save an input DataFrame into a single file.

Parameters
  • df (DataFrame) – input DataFrame

  • output_file (Union[str, Path]) – output file path

  • col_mol (str) – for SDF format only, column with the RDKit Mol objects

  • col_idm (str) – for SDF format only, column with the molecule ids, if not None, info is saved as property and as molecule title

  • csv_sep (str) – for CSV format only, delimiter to use

  • encode (bool) – encode mols and objects into base64 strings based on predefined column names.

Returns

a tuple containing the output file name and its number of records

npfc.save.random() x in the interval [0, 1).