load¶

Module load¶

A module for loading files in different formats into DataFrames.

npfc.load.count_mols(input_file, buffer_size=10240, keep_uncompressed=False)[source]¶

Count the number of molecules in an input file. The method varies depending on the format:

SDF: count the $$$$ pattern

CSV: count the number of lines, minus 1 for column headers

HDF: load file into memory using Pandas and then count number of rows

PARQUET: not implemented yet

This function is optmized for memory, so it should handle very large files, apart from HDF files.

Parameters

input_file (str) – input file
buffer_size (int) – buffer size in bytes to use for scanning the input text file (SDF and CSV). Default is 10Mb.
keep_uncompressed (bool) – in case of gzip file, leave the uncompressed file after execution

Returns

counf of molecules

npfc.load.file(input_file, col_mol=None, col_idm=None, mol_format='rdkit', keep_props=True, decode=True, csv_sep='|')[source]¶

Load a file into a DataFrame.

Parameters

input_file (str) – the input file to load
col_idm (Optional[str]) – the column/property to use for molecule ids. If left by default and no idm col is found, then _Name is used instead. If this property is not set, then a sequential idm will be generated (MOL_0000001, etc.).
col_mol (Optional[str]) – the column to use for molecules (irrerlevant for SDF)
csv_sep – the column separator to use for parsing the input file (CSV)
mol_format (str) – the input format for molecules
out_id – the column name used for storing molecule ids
out_mol – the column name used for storing molecules
keep_props (bool) – keep all properties found in the input file. If False, then only out_id and out_mol are kept.
decode (bool) – decode base64 strings into objects. Columns with encoded objects are labelled with a leading ‘_’. For molecules, reserved names are ‘mol’ and ‘mol_frag’.

Returns

a DataFrame

..warning:: if a ‘idm’ property exists in the input file but the user picks another property for in_id, the pre-existing ‘idm’ will be renamed into ‘idm.1’ (and overwritten if already present).