load

Module load

A module for loading files in different formats into DataFrames.

npfc.load.count_mols(input_file, buffer_size=10240, keep_uncompressed=False)[source]

Count the number of molecules in an input file. The method varies depending on the format:

  • SDF: count the $$$$ pattern

  • CSV: count the number of lines, minus 1 for column headers

  • HDF: load file into memory using Pandas and then count number of rows

  • PARQUET: not implemented yet

This function is optmized for memory, so it should handle very large files, apart from HDF files.

Parameters
  • input_file (str) – input file

  • buffer_size (int) – buffer size in bytes to use for scanning the input text file (SDF and CSV). Default is 10Mb.

  • keep_uncompressed (bool) – in case of gzip file, leave the uncompressed file after execution

Returns

counf of molecules

npfc.load.file(input_file, col_mol=None, col_idm=None, mol_format='rdkit', keep_props=True, decode=True, csv_sep='|')[source]

Load a file into a DataFrame.

Parameters
  • input_file (str) – the input file to load

  • col_idm (Optional[str]) – the column/property to use for molecule ids. If left by default and no idm col is found, then _Name is used instead. If this property is not set, then a sequential idm will be generated (MOL_0000001, etc.).

  • col_mol (Optional[str]) – the column to use for molecules (irrerlevant for SDF)

  • csv_sep – the column separator to use for parsing the input file (CSV)

  • mol_format (str) – the input format for molecules

  • out_id – the column name used for storing molecule ids

  • out_mol – the column name used for storing molecules

  • keep_props (bool) – keep all properties found in the input file. If False, then only out_id and out_mol are kept.

  • decode (bool) – decode base64 strings into objects. Columns with encoded objects are labelled with a leading ‘_’. For molecules, reserved names are ‘mol’ and ‘mol_frag’.

Returns

a DataFrame

..warning:: if a ‘idm’ property exists in the input file but the user picks another property for in_id, the pre-existing ‘idm’ will be renamed into ‘idm.1’ (and overwritten if already present).