load¶
Module load¶
A module for loading files in different formats into DataFrames.
- npfc.load.count_mols(input_file, buffer_size=10240, keep_uncompressed=False)[source]¶
Count the number of molecules in an input file. The method varies depending on the format:
SDF: count the $$$$ pattern
CSV: count the number of lines, minus 1 for column headers
HDF: load file into memory using Pandas and then count number of rows
PARQUET: not implemented yet
This function is optmized for memory, so it should handle very large files, apart from HDF files.
- npfc.load.file(input_file, col_mol=None, col_idm=None, mol_format='rdkit', keep_props=True, decode=True, csv_sep='|')[source]¶
Load a file into a DataFrame.
- Parameters
input_file (
str) – the input file to loadcol_idm (
Optional[str]) – the column/property to use for molecule ids. If left by default and no idm col is found, then _Name is used instead. If this property is not set, then a sequential idm will be generated (MOL_0000001, etc.).col_mol (
Optional[str]) – the column to use for molecules (irrerlevant for SDF)csv_sep – the column separator to use for parsing the input file (CSV)
mol_format (
str) – the input format for moleculesout_id – the column name used for storing molecule ids
out_mol – the column name used for storing molecules
keep_props (
bool) – keep all properties found in the input file. If False, then only out_id and out_mol are kept.decode (
bool) – decode base64 strings into objects. Columns with encoded objects are labelled with a leading ‘_’. For molecules, reserved names are ‘mol’ and ‘mol_frag’.
- Returns
a DataFrame
..warning:: if a ‘idm’ property exists in the input file but the user picks another property for in_id, the pre-existing ‘idm’ will be renamed into ‘idm.1’ (and overwritten if already present).