Chemistry::OpenSMILES - OpenSMILES format reader and writer


    use Chemistry::OpenSMILES::Parser;

    my $parser = Chemistry::OpenSMILES::Parser->new;
    my @moieties = $parser->parse( 'C#C.c1ccccc1' );

    $\ = "\n";
    for my $moiety (@moieties) {
        #  $moiety is a Graph::Undirected object
        print scalar $moiety->vertices;
        print scalar $moiety->edges;

    use Chemistry::OpenSMILES::Writer qw(write_SMILES);

    print write_SMILES( \@moieties );


Chemistry::OpenSMILES provides support for SMILES chemical identifiers conforming to OpenSMILES v1.0 specification (

Chemistry::OpenSMILES::Parser reads in SMILES strings and returns them parsed to arrays of Graph::Undirected objects. Each atom is represented by a hash.

Chemistry::OpenSMILES::Writer performs the inverse operation. Generated SMILES strings are by no means optimal.

Molecular graph

Disconnected parts of a compound are represented as separate Graph::Undirected objects. Atoms are represented as vertices, and bonds are represented as edges.


Atoms, or vertices of a molecular graph, are represented as hash references:

        "symbol"    => "C",
        "isotope"   => 13,
        "chirality" => "@@",
        "hcount"    => 3,
        "charge"    => 1,
        "class"     => 0,
        "number"    => 0,

Except for symbol, class and number, all keys of hash are optional. Per OpenSMILES specification, default values for hcount and class are 0.

For chiral atoms, the order of its neighbours in input is preserved in an array added as value for chirality_neighbours key of the atom hash.


Bonds, or edges of a molecular graph, rely completely on Graph::Undirected internal representation. Bond orders other than single (-, which is also a default) are represented as values of edge attribute bond. They correspond to the symbols used in OpenSMILES specification.


parse accepts the following options for key-value pairs in an anonymous hash for its second parameter:


In OpenSMILES specification the number of attached hydrogen atoms for atoms in square brackets is limited to 9. IUPAC SMILES+ has increased this number to 99. With the value of max_hydrogen_count_digits the parser could be instructed to allow other than 1 digit for attached hydrogen count.


With raw set to anything evaluating to false, the parser will not convert neither implicit nor explicit hydrogen atoms in square brackets to atom hashes of their own. Moreover, it will not attempt to unify the representations of chirality. It should be noted, though, that many of subroutines of Chemistry::OpenSMILES expect non-raw data structures, thus processing raw output may produce weird results.


Element symbols in square brackets are not limited to the ones known to chemistry. Currently any single or two-letter symbol is allowed.

Deprecated charge notations (-- and ++) are supported.

OpenSMILES specification mandates a strict order of ring bonds and branches:

    branched_atom ::= atom ringbond* branch*

Chemistry::OpenSMILES::Parser supports both the mandated, and inverted structure, where ring bonds follow branch descriptions.

Whitespace is not supported yet. SMILES descriptors must be cleaned of it before attempting reading with Chemistry::OpenSMILES::Parser.

The derivation of implicit hydrogen counts for aromatic atoms is not unambiguously defined in the OpenSMILES specification. Thus only aromatic carbon is accounted for as if having valence of 3.




Andrius Merkys, <>