User Guide v5.4.2

Concept Searching – Product Guide

Last Modified 03/02/2017 (v5.4.2)

The following product guide details the configuration options and functionality exposed via the conceptQS administration interface.

Key Terminology:

  • QS – The main administration interface controlling the conceptSearch product suite
  • Source – Any external system being crawled/processed
  • Taxonomy/Termset – Synonymous, Termset refers to the SharePoint terminology for a hierarchical set of metadata
  • Class/Term – Synonymous, used to describe a node in a taxonomy/termset (term refers to the SharePoint terminology)
  • Workflow – A rule based system, custom actions are run based on rules configured in the registered taxonomies/termsets

Dashboard

Introduction

The Dashboard administration area provides a selection of tools to review application health.

The area can be accessed at the following URL: http://conceptsearchingserver/conceptQS/Stats.

Thee default screen shows a high level overview of service statistics.
The last active times of each of the core windows services are shown, with inactive services shown in red.
You can view more details on the status of these services by clicking on ‘Service Viewer’ in the top links.

You can view the current status of your content.
New content will be shown as awaiting collection, and progress through to fully processed once it has been classified.

Dashboard

System Health

The health service provides a traffic light based reporting system.
Colour-coded traffic lights will appear in the top menu bar when issues are detected. The traffic lights link to this page to display more detailed information.

You will then see the list of reported issues, with the ability to view a detailed description of the problem and suggested resolution steps.

It is also possible to configure notifications of system issues, along with daily reports of outstanding system issues.
Please see the Health Service Notifications configuration for more details.

A list of known environmental/setup issues and resolutions can also be found within our Knowledge base, available here: https://www.conceptsearching.com/resources/knowledge-base-archive/.

Health Service

Service Viewer

From the service viewer it is possible to view a live stream of the windows services work.
As the services process work, the display will change. Once all work is complete “Idle …” will be displayed.

It is possible to use this to check which sources are currently being processed, as well as to ensure that the windows services
are currently running.

Service Viewer

Sources

Introduction

The Sources administration area provides a web-based console for adding and managing external systems
to be crawled and classified. The screen below shows a few configured SharePoint sources as well as two source groups.
Source groups are used as logical containers for sources. There is no limit to the number of sources that can be configured.

The area can be accessed at the following URL: http://conceptsearchingserver/conceptQS/Sources.

Usage can be restricted to selected users based on either their Windows identity or using non-Windows based access controls, as required. See the Users area for more information on how to restrict access.

Manage

When on the main management screen you can perform the following actions:

  • Delete – Removes the source from processing, this will be removed from search results in due course. This does not delete content from the external system
  • Re-collect – Queues the source for re-processing, crawled items will be deleted and the entire source re-crawled
  • Re-index – Queues a source or item to be re-processed, with a check for changes. If changes are found the item will be re-classified
  • Re-classify – Queues a source or item to be re-classified against the latest configured classification rules
  • Pause – Temporarily pauses a source from processing
  • Resume – Resumes a source from a temporary pause
  • Add to Group – Allows a source to be moved into a logical container (Source Group), either an existing group or a newly created one.

Clicking on the chart icon on a source row will show source specific statistics, in a similar manner to the main dashboard.
It is also possible to search within a source using the find icon, and change the connection information using the cog icon.

Sources

Clicking on a source row will show the crawled data directly below it with slightly reduced options.

By clicking through the possible levels within the sources area, you can traverse the whole structure of the crawled content.
Alternatively you can use the following icons: Page Views to switch between the structured (parent/child) view and a flat view that shows all content within a source.
It is also possible to filter the grid by status and URL to assist with finding specific content.

Page Options

Eventually, by using the flat view, or by traversing the source structure, you will find the documents being crawled within the source (shown below).

Pages

The documents each have an icon indicating the file type, as well as links for: the extracted
text (“Text”), the extracted metadata (“Info”) and the applied classifications (“Classifications”). Each of these links will open in a popup:
classifications - User Guide v5.4.2

Each document has an associated status shown in numeric form, clicking the number will display a textual representation of the status:
Status
For content sources that support the writing of classifications back to the source system, such as writing classifications to SharePoint managed metadata fields, a tick will also be
displayed if the write was successful, or a cross displayed if the write fails. As with the status, clicking the icon will provide an text description.

Source Groups

Source groups provide a way of logically grouping specific sources, perhaps by type, or perhaps by an internal business specification.

Selecting the option “Add to Group” on the main sources grid screen will present the following popup:
Add to group
A group can either be “mixed”, which allows it to contain all source types, or source specific. In the example above a group would
be created entitled “New Group”, which only supports the addition of SharePoint sources. If a supporting source group already exists,
this can be selected from the dropdown list provided.

Selecting the cog icon on the main sources grid screen for a source group allows you to amend the group settings
Edit Group
Here you can amend the group name, delete the group, or, supply regular expression rules to support automatically assigning sources
to a specific group. Deleting a group will remove all existing items from the group leaving them unassigned. You can also remove
a specific source(s) from a group by selecting the source group in the grid and then selecting “Remove from Group” for the required source(s).

Adding a Source

To add a new content source, first navigate to the Sources area and select “Add Source”.

Sources

Each source type has specific configuration options described in the following sections.

File System

There are two ways to add file system content: as individual files, as folders.

Folders

The folders section can be used to add either windows directories, or SAMBA shares, to the index.

addfolder - User Guide v5.4.2

Enter the UNC path of the root folder where collection is to start.

The “include sub-folders” option should be checked, if sub-folders are to be collected.

The “Depth Limit” field specifies how many levels the indexing should process (if “include sub-folders” is checked).

The “allow anonymous access” checkbox is used to disable security filtering for selected sources. This option should only be selected for public file shares. If this checkbox is not checked, then the indexing processes will collect Windows Access Control Lists (ACLs) for the files and search results will be filtered based upon the end user’s Windows identity.

The “enable duplicate detection” option should be checked if documents that contain the same text content should be excluded from the index.

The “write classifications to files” option should be selected, if you wish to write classifications directly into the document properties (DOC/DOCX/XLS/XLSX/PPT/PPTX/PDF).
The configuration of which classifications are to be written, as well as the write format, is detailed in the Manage File System section.

For information on “Text Patterns” please see the Text Patterns Configuration section.

The “Re-index Period” field specifies how often the source should be checked for changes. The number specifies the period in days.

The “Max Collector Retries” field specifies how many retries are attempted before automatically removing items from the index when incremental collection indicates that the file has been deleted.

The “Doc Types” field can be used to specify a value which can be used to restrict queries when utilising the conceptSearching search index.

Files

Alternatively, individual files can be added by using the “Files” section:

addfile - User Guide v5.4.2

When “Upload Files” is selected the file will be uploaded into the conceptSearching SQL database. This allows an application to present the file to users even if they do not have access to the original file location.

SharePoint

The SharePoint section allows for one or more site collections to be queued for processing that share the same set of crawling credentials.

The following versions of SharePoint are supported: 2007, 2010, 2013, 2016 and SharePoint Online.

addsharepoint - User Guide v5.4.2

The “SharePoint URL(s)” should be the root of the site collections to be added.

The “Username” can be entered in the following formats: DOMAIN\USERNAME and USERNAME@DOMAIN.

Selecting “Write Classifications to SharePoint” will enable a synchronisation of classifications back to the SharePoint
managed metadata fields. The written classifications will be subject to the classification configuration for the site collection
(see Farm/App conceptClassifier User Guides).

The “Re-Index Period” field specifies how often the source should be checked for changes. The number specifies the period in days.

The “Doc Types” field can be used to specify a value which can be used to restrict queries when utilising the conceptSearching search index.

SharePoint Online (OneDrive for Business)

addsharepointonline - User Guide v5.4.2

Office 365 customers can configure conceptSearching to automatically detect and queue their employees One Drive (Personal Sites) hosted in Office 365.
An account with Tenant administration rights must be supplied, and the frequency of the detection of new One Drive sites must be set.
It is also possible to provide a filter expression to ensure that certain OneDrive paths are included and others excluded as required.

Optionally, it is also possible to set up the resources necessary to ensure conceptClassifier is enabled and configured on the detected OneDrive sites.
This is achieved by allowing the user to select the “Classification Template”, which details the necessary configuration, please see the “conceptClassifier – Templated Configuration Guide” for more information.

SQL

It is also possible to index a wide variety of other sources, including:

  • Microsoft SQL Server
  • Oracle Databases
  • PostGres Databases
  • EMC Documentum DMS
  • Interwoven Worksite DMS
  • Hummingbird DMS

Content must either be configured/crawled using the configured service accounts (IIS Application Pool User, Windows Services) or by using specific connection details.

For PostGres connections the username/password must be specified.

Once connected it is possible to create an intelligent content mapping, crawling certain fields as unstructured index text, and other fields as mapped metadata.
For more information please see the Manage SQL section.

addsql - User Guide v5.4.2

The “Server” field should specify the server name of the database system to be crawled (“.” can be used to indicate the local server).

The “Database Name” field should specify the database that will be crawled. It is possible to configure multiple databases from the same server.

When the connection configuration has been completed you will be redirected to the Source Configuration, this allows you to define how the
database will be crawled. It is possible to crawl either specific tables, or crawl custom queries (defined select statements, which may use JOIN statements across multiple tables).

Websites

The web source configuration may be used to add web sites (or single web pages) to the index.

addweb - User Guide v5.4.2

Enter the starting point for the crawl in the first edit box. All linked pages will then be discovered and added to the index, unless the “this page only” checkbox is checked.

The “Sub Domain” field may be used to restrict links that are followed to the selected sub-domain.

The “Username” and “Password” fields may be used to allow crawling of authenticated sites. The Collector supports the following authentication modes: BASIC, NTLM and Lotus Notes (DOMINO) SSO.

For information on “Text Patterns” please see the Text Patterns Configuration section.

The “Re-index Period” field specifies how often the source should be checked for changes. The number specifies the period in days.

The “Depth Limit” field specifies how many levels the indexing should process.

The “Max Collector Retries” field specifies how many retries are attempted before automatically removing items from the index when incremental collection indicates that the file has been deleted.

The “Doc Types” field can be used to specify a value which can be used to restrict queries when utilising the conceptSearching search index.

Manage Sources

The connection configuration of a source can be amended by clicking on the cog shown in each source row:
Sources
Source specific configuration can be found under each sub heading, i.e file source specific settings are found under the sub heading “File”.

File System

Write Configuration

The “Write Configuration” options define how classifications should be written directly to supported files.

The following file types are currently supported:

  • DOC/DOCX
  • PPT/PPTX
  • XLS/XLSX
  • PDF

File Write Configuration

Settings can be configured at a global level (default), or at a source level by selecting “Write Configuration” from the default sources screen for a folder.

Each registered taxonomy can be configured separately with the following options:

  • Field Name – The property name to persist the classifications with (document property name)
  • Format – How the classifications should be formatted, either a custom delimited combination of the labels/GUIDs, or, the SharePoint specific format.
  • Migration Destination – When the “SharePoint” format is selected you can select a SharePoint site collection destination. Selecting this will support pre-population of SharePoint WSS Ids into the document properties.
    This allows for the metadata to be automatically promoted into SharePoint fields upon upload.
  • Write – Enables/Disables the writing of classifications for the selected taxonomy row
Files Excluded

When indexing files from a file system the list of file locations that will be ignored is defined by the “Files Excluded” list. The definitions in this list may be viewed and modified via the Files Excluded form:
Files Excluded
Any file with a path that matches one of these patterns will be ignored.

Wildcards may be used anywhere in the pattern definition, with:

  • The asterisk character (*) matching any sequence of characters
  • The question mark character (?) matching any single character
Files Included

When indexing files from a file system the list of file extensions that will be processed is defined in the “Files Included” list. The definitions in this list may be viewed and modified via the Files Included form:
Files Included
Any file with a file extension not listed in this list will be ignored.

SharePoint

Dashboard

The SharePoint dashboard provides the same dashboard display as the main reporting dashboard, with the results filtered to SharePoint types.
Classification coverage identifies the percentage of content that has had classifications applied, and the percentage that has not.
SharePoint Dashboard

SharePoint Excluded

When indexing files from SharePoint the list of file locations that will be ignored is defined in the “SharePoint Excluded” list. The definitions in this list may be viewed and modified via the SharePoint Excluded form:
SharePoint Excluded
Any file with a path that matches one of these patterns will be ignored.

Wildcards may be used anywhere in the pattern definition, with:

  • The asterisk character (*) matching any sequence of characters
  • The Question mark character (?) matching any single character
Templating

Templating allows an administrator to pre-configure classification configurations for site collections. For more information please review the following guide: “conceptClassifier – Templated Configuration Guide”.

SQL

Dashboard

The SQL dashboard provides the same dashboard display as the main reporting dashboard, with the results filtered to SQL types.
Classification coverage identifies the percentage of content that has had classifications applied, and the percentage that has not.
SQL Dashboard

Source Configuration

The “Source Configuration” screen allows you to define which tables/views/queries will be crawled. With the following options available:

  • Add Source – Add a new SQL database connection
  • Edit Connection – Amend the connection details of the currently selected source
  • Add Query – Add a custom method for crawling content (custom SELECT statements), Templates are provided for Hummingbird, Worksite and Documentum.

Selecting one of the tables/queries on the list will redirect you to the entity level configuration, which identifies how content will be mapped into
the conceptSearching index.
SQL Configuration

Selecting the “Add Query” option will present a popup allowing you to select a unique name for the query, as well as a template to use for the queries (such as Documentum, Worksite etc).
Add SQL Query

Adding the query will take you to the custom query configuration. Here you must define the primary key query and the content query, all other
configuration options are described in the Table Configuration section:

Primary Key Query
The primary key query should return a set of values that uniquely identify each row to be crawled, in the event that JOINs
are used you should JOIN from the largest dataset to the smallest, to ensure that each row is unique.

Example: SELECT PageID FROM Pages

Content Query
The content query must return all fields to be indexed/classified on, as well as the fields included in the primary key query.

Example: SELECT * FROM Pages

Amend SQL Query

Table Configuration

The table configuration allows you to choose how each specific entity will be crawled:

Include
When checked the table/entity will be enabled in the collection schema.

Upload Content
When checked the Content fields will be uploaded into the conceptSearching database.
Uploaded content can be retrieved after collection by passing the PageId for the record to the QS API call “GetDownload”.

PK – Primary Key
Please select the fields which uniquely identify the row to be crawled,
in the event that multiple rows are returned by the Primary Key, the query will be aborted.
Custom queries will not require the primary key to be defined, this will be set automatically from the primary key query.

Content
Identifies the fields that will be crawled as searchable text in the conceptSearching search index.
Multiple fields can be mapped to Content, each will be appended with a line break.

It is also possible to configure a single binary field type that contains a document,
the collection process will load the binary and attempt to convert and extract text from the document.
When this functionality is used we recommend setting the “ContentFilename” or “ContentType” index mapping to aid the
process of text extraction.

Metadata
Identifies the fields that will be crawled as searchable text in the conceptSearching search index.
Multiple fields can be mapped to Content, each will be appended with a line break.
It is also possible to configure a single binary field type that contains a document,
the collection process will extract the binary and attempt to convert

Index Mappings
Index mappings identifies mappings between the entities fields and the internal conceptSearching database.
Each row also contains an information icon identifying its purpose within the crawling process.

Modified Filter (Incremental Crawls)
This should be set to a field that defines when a row has changed (the modified date for the row).
When set the collection process will automatically filter the re-indexing process to rows that have a modified date
that is larger than the last crawl time.

Re-Index Period
This value is the number of days/hours/minutes that will pass between Re-Indexing. The Re-Indexing process involves querying the table(s) to find new and changed records.

SQL Table Configuration

Websites

Pages Excluded

When indexing files from a web site the list of file locations that will be ignored is defined in the “Pages Excluded” list. The definitions in this list may be viewed and modified via the Pages Excluded form:
Pages Excluded
Any file whose path matches one of these patterns will be ignored.

Wildcards may be used anywhere in the pattern definition, with:

  • The asterisk character (*) matching any sequence of characters
  • The Question mark character (?) matching any single character
Pages Included

When indexing files from a web site the list of content types that will be processed is defined in the “PagesIncluded” list. The definitions in this list may be viewed and modified via the Pages Included form:
Pages Included
Any file with a content type not listed in this table will be ignored.

Taxonomies

Introduction

The Taxonomies administration area provides a web based console for creating and managing taxonomies.

The area can be accessed at the following URL: http://conceptsearchingserver/conceptQS/Taxonomies.

Usage can be restricted to selected users based on either their Windows identity or using non-Windows based access controls, as required. See the Users area for more information on how to restrict access.
Taxonomies

Product Tour

You can view an interactive taxonomies specific product tour at any time. To run this, go to the “Help” tab and select the link under “Tour”.

Manage Taxonomies

When working with taxonomies the hierarchical structure is displayed on the left hand side of the page, allowing for
specific terms to be selected and managed.

The first dropdown list allows you to select which group of taxonomies you want to work with, if there is only one group the dropdown will be hidden.
The second dropdown allows you to select which taxonomy you are working with.
Treeview
Right-Clicking the treeview nodes provides a number of management options at both the term and termset level including:

  • Add Child Term
  • Rename Term
  • Delete Term
  • Re-Classify Term
  • Re-Classify Tagged
  • Pin Term With Children
  • Reuse Terms
  • Export CSV

You can also drag-and-drop a node from one location on the treeview to another, once you have dropped the node you can select to either move or copy the node.

Searching for Taxonomy Terms

A search facility is provided to locate terms that contains specified text:

Click the magnifying glass icon to the right of the taxonomy dropdown and a new edit box appears where search text may be entered:
Taxonomy Search

Source Filter

A filter facility is also provided to restrict all search/browse results to a specific source:

Click the source filter link in the top right of the display, then, select a source:
Source Filter

Add Taxonomies

Create a blank taxonomy

SQL taxonomies reside within the conceptSearching database, they are fully functional with the exception of writing metadata back to SharePoint.

To add a SQL taxonomy select the option “[new taxonomy]” from the taxonomies dropdown list and supply a taxonomy name. The taxonomy will then be loaded ready for management.
Create SQL Taxonomy
When there are multiple groups available you will also need to select “Default Group” from the group drop down list.

If “Default Group” is not available in the drop down list then please load the “Settings” tab and enable the option “Always show Default Group”.

Importing Taxonomies

To import an existing taxonomy go to the “Global Settings” tab and select “Add Taxonomies”.
Taxonomies can be imported in 3 different ways:

  • Connect to SharePoint TermStore – The URL should be set to any site collection within the farm or tenancy, such as: https://conceptsearching.sharepoint.com.
    The supplied credentials must have access to both the site collection specified, as well as the termstore (preferably as a term store administrator).
  • Load XML file to SQL – Imports an XML file directly into the conceptSearching database, large taxonomies will be imported by the background services.
  • Load Default Taxonomies – 4 taxonomies are provided OOTB, these can be fully used, or simply used as a reference for regular expression and metadata clues.

Add Taxonomies

Backup/Delete Taxonomies

Existing taxonomies can be managed via the “Global Settings” tab:

Manage Taxonomies
Taxonomies can be exported as XML regardless of the taxonomy type, as well as removed from the conceptSearching database.
When removing SharePoint termset registrations the source termset remains intact, all that is removed is the link from conceptSearching to the termstore.

Classification Rules (Clues)

Clue Types

Introduction

Clues for each class can be viewed and managed by selecting the class in the treeview and then selecting the Clues tab:
Clues
Clues are used to describe the language found in documents that make them about a particular topic. Selecting the “Doc Counts” checkbox gives an indication of the number of documents
that match the word/phrase used within the clue.

There may be any number of words up to a maximum of 200 characters per clue. However, most clues will consist of one, two or three words. Use double quotes around single words to disable stemming. Use double quotes around phrases to invoke exact phrase matching.

Example: A class called “Global Warming” may have the following clues:

Global Warming
Greenhouse Gases
CO2 Emissions
Pollution

Use the Mandatory checkbox to indicate that a clue is required. A document cannot be classified against a category unless it matches all of the mandatory clues.

The following clue types of clues are available, each clue type is described in detail below:

  • Standard Clues
  • Case-Sensitive Clues
  • Phrase match (Wildcard) Clues
  • Metadata Clues
  • Phonetic Clues
  • Regex Clues
  • Required Term Clues
  • Term Boost Clues
  • Language Clues
  • Static Clues
  • Hierarchical Clues
Standard Clues

A single word, multi-word concepts or phrases.

The text can include wildcard characters. Valid wildcards are:
? = any single character
* = any characters

Use quotes around standard clues to invoke a case-insensitive exact match on entered text, including any punctuation. Putting quotes around the clue text will disable wildcard processing, allowing exact matching against text that contains wildcard characters.

Examples:

A standard clue matched on a fuzzy basis with word stemming enabled:
training
will match against:
train, training, trains.

A standard clue matched on a fuzzy basis with word stemming enabled:
Train timetables
will match against:
train timetables, training timetables, trains timetable.

A standard clue enclosed in double quotes will be matched on an exact match basis:
“Train timetables in the U.K.”
will match only against:
Train timetables in the U.K. (Case-insensitive)

A standard clue containing wildcards:
Train time* in the U.K.
will match against:
Train timetables in the U.K. (Case-insensitive)

A standard clue containing wildcards enclosed in double quotes will be matched on an exact match basis:
“Train time* in the U.K.”
will only match against:
Train time* in the U.K. (Case-insensitive)

Case-Sensitive Clues

A case-sensitive phrase match clue, including any punctuation. There is no need to put double quotes around the text (double quotes at the start and/or end of the text will be removed).

Phrasematch (Wildcard) Clues

A phrase match clue that supports the use of ‘*’ and ‘?’ wildcards when matching document text (see Regex Clues for full REGEX support)

Metadata Clues

A clue based on document metadata, with matching based on:

  • Exact string matches – Such as: AUTHOR=JOHN SMITH
  • Wildcard string matches – Such as: AUTHOR*=john sm?th*
  • Date Range matches – Such as: FIELD > VALUE
  • Dynamic Date Range matches – Such as: FIELD>TODAY OR FIELD>TODAY-14 (Matching the last 2 weeks)
  • Integer Range matches – Such as FIELD > VALUE or FIELD < VALUE

Both field and value are case-insensitive for metadata matches. Wildcard matches must included a * character before the equals sign (as shown in the example above).

The following special metadata fields can be used:

CSE-CONTENTTYPE
The raw content type, for example:

  • text/*
  • text/html; charset=utf-8
  • pdf
  • application/pdf

Most applications should use the CSE-TYPE field or the FILE TYPE field (see below) rather than the CSE-CONTENTTYPE field due to the highly variable nature of the raw values.

Examples:

A clue based on PDF documents would look like this
cse-type = application/pdf

A clue based on a specific author would look like this
author=john smith

CSE-DOCTYPE
The DocType integer field

CSE-FILENAME
The document filename (e.g. “Pensions.doc”)

CSE-FILEPATH
The document path not including the filename (e.g. “http://www.bbc.co.uk/sport/”)

CSE-FOLDERS
Used to match folders including sub-folders. For example:

CSE-FOLDERS=http://www.abc.com/jobs/
matches:
and also:

A clue based on a right truncated path would look like this
CSE-FOLDERS=c:\myfolder\subfolder\
or
CSE-FOLDERS=http://www.abc.com/jobs/
Note that when using cse-Folders with a right-truncated path the path must always end with a slash character.

A clue based on selected folders within the path would look like this
CSE-FOLDERS=myfolder/myfolder2
Note that when using cse-Folders with subfolder matches the value must not begin or end with a slash character.

CSE-FOLDER
Used to match folders without including sub-folders. For example:

CSE-FOLDER=http://www.abc.com/jobs/
matches:
does not match:

CSE-LASTMODIFIEDDATE
The LastModifiedDate from the collected content in the format “YYYY-MM-DD HH:MM:SS”.

This field can only be matched using the greater than or less than operators, for example:
CSE-LASTMODIFIEDDATE < 2010-01-01
CSE-LASTMODIFIEDDATE > 2010-01-01
Only the date can be specified, not the time.

CSE-LANG
The dominant language of the document, using ISO 639-1 two-letter codes. See Language Detection settings for more information.

CSE-METADATACOLLECTIONONLY
This value will be set to “1” if the document was too large for the conceptSearching index (max 500MB), but was processed using metadata only.

CSE-PAGETITLE
The Title extracted from the document itself.

CSE-TEXTLENGTH
The length of the plain text extracted from the document, in characters.

This field can only be matched using the equals, greater than or less than operators, for example:
CSE-TEXTLENGTH = 50000
CSE-TEXTLENGTH > 50000
CSE-TEXTLENGTH < 50000

CSE-TITLE
The Title extracted from metadata.

CSE-URL
The document Url, including the filename (e.g. “http://www.bbc.co.uk/sport/Pensions.doc”)

FILE TYPE
The short normalised content type, always one of the following:

Adobe PDF files:

  • PDF

Corel WordPerfect files:

  • WPD

Microsoft Excel files:

  • XLS
  • XLSX

Microsoft Outlook MSG files:

  • MSG

Microsoft PowerPoint files:

  • PPT
  • PPTX

Microsoft Rich Text Format files:

  • RTF

Microsoft Word files:

  • DOC
  • DOCX

Text files (including HTML, XML, CSV, etc.):

  • TXT
  • HTML
  • XML

All other file types

  • OTHER

FILE SIZE
The length of the document, in bytes.
This field can be matched using the equal, greater than or less than operators, for example:
FILE SIZE = 10000
FILE SIZE < 10000
FILE SIZE > 10000

The Modified date from the document metadata in the format “YYYY-MM-DD HH:MM:SS”.

This field can be matched using the equal, greater than or less than operators, for example:
MODIFIED = 2010-01-01
MODIFIED < 2010-01-01
MODIFIED > 2010-01-01
Only the date can be specified, not the time.

Phonetic Clues

A case-insensitive fuzzy/phonetic phrase match clue. There is no need to put double quotes around the text (double quotes at the start and/or end of the text will be removed).

Phonetic clues ignore all non alphanumeric characters. Words that contain no digits are matched using a phonetic algorithm so that words that sound similar will be matched.

Phonetic clues do not use word stemming in the matching process.

For example, the following clue:
Intelligence Organisations in the Middle East

Would match any of the following:
Intelligence Organizations in the Middle East
Intelligence Organisations in the Middle-East
Inteligence organisations, in the “middle east”.

But not any of the following:
Intelligence Organisations located in the Middle East
Intelligence Organisations in the Mid-East

Regex Clues

A Regular Expression match clue.

Definitions of Regular Expression Syntax can be found in many places, including here:
http://msdn.microsoft.com/en-us/library/ae5bf541.aspx
and here:
http://www.regular-expressions.info/reference.html

For example, the following clue matches US Social Security Numbers found anywhere in the document text:

[/,,/.,/=,\s]((?!000)[0-6]\d{2}|7[0-6]\d|77[0-2])-((?!00)\d{2})-((?!0000)\d{4})[/,,/.,\s]

This RegEx clue ensures that:

  • The SSN must consist of 11 characters in this format: NNN-NN-NNNN
  • The SSN must be preceded by a white space or a dot or a comma or an equals sign.
  • The SSN must be followed by a white space or a dot or a comma.
  • The two hyphens must be present.
  • None of the three sections can be equal to zero.
  • The first section must be in the range 001 – 772

Any regular expression matches found will be extracted and added to the conceptSearching index automatically. For example, if we have a document that contains this text:

Here is one SSN: 407-54-8831
And here is another 407-54-8832 in the middle of this sentence.

Then the following metadata entries will be generated automatically:

Regex-SSN:407-54-8831;Regex-SSN:407-54-8832;

These can be seen in the Metadata field in the Pages table. Note that the metadata field name is the Term name prefixed with “Regex-“.

Required Term Clues

The Required Term clue type can be used to require another class to be classified as a pre-requisite for this class. This is most often used when the children of a class require the parent to also be classified.

The valid entries for this type of clue are:

  • Parent
  • Grandparent
  • Any specific term in any taxonomy

A tree view control makes selecting the required class easy:
Required Term
For example, suppose that we have a topic “Pensions” with two children:

  • Pensions
  • USA
  • Canada

The purpose of the two child classes is to identify documents that are about pensions in the USA or about pensions in Canada. Rather than add clues to identify pensions documents to the children you can simply require documents to be about Pensions by using a Required Class clue type.

Term Boost Clues

The Term Boost clue type can be used to specify that a Class Score is to be boosted from another term. This is most often used when a complex class is implemented using several child (or even grandchild) classes.

A tree view control makes selection of boosting classes easy.
Term Boost
The score may be entered as a number (if a fixed boost is required regardless of the source term’s score) or as a percentage (if the boost score is to be calculated as a percentage of the source term’s score).

Language Clues

The language clue type can be used to require documents to be written primarily in a specified language as a filter on classification.

For example, if you create a new class and want documents to be classified only if they are written in a Scandinavian language then you would create a Language clue, like this:
Language Clue

Static Clues

The static clue applies a score to the class without any pre-conditions, this can be useful when creating NOT functionality.

For example:
If you want to classify any document where a word does NOT exist (such as “Pensions”), you could first add a static clue with a score of 50, and then add a standard clue looking for “Pensions” with a negative score (-50).

Hierarchical Clues

Hierarchical clues support a parent-child clue hierarchy, if the child clues achieve the parent clue threshold
then the hierarchical score will be applied.

This can be useful when you only want to apply a score if two or more conditions to match, or perhaps to only apply a small static score if
a word appears X times within a document.

Scoring

Higher scores indicate a stronger association with the topic.

Example: Global Warming with a score of 50 will cause a document with this concept to be matched.

Example: Pollution with a score of 20 (on its own) will not be sufficient to cause the document as being about global warming.

Clues can also be assigned a negative value, which will prevent incorrect associations.

Example: Noise pollution should not be associated with Global Warming. So Noise pollution would be added with a negative value.

Scores are expressed as percentages of the Threshold, for example if the Threshold is 50 then:

  • 50 = guarantees that this term alone will be sufficient to classify the document
  • 25 = this term will get half way to the target
  • 10 = this term is of low importance but its presence should boost a document score
  • 0 = zero weight – use to disable a clue
  • -10 = this term is a small negative indicator
  • -50 = this term is a strong negative indicator
  • -1000 = the presence of this term should force the document to not be classified

When viewing document in the “Search”, “Browse” and “Working Set” tabs the following facilities are available: “Show document movements”, “Classification” and “Calculations”.

Document Movements

To view how recent changes to the term will affect the document classifications, select ‘Show document movements’ and the “movement” of the document since the last classification will be shown. Possible scenarios are:
Movements Key
Document Movements

Classifications

To see the current classifications for a selected document click the “classification” link:
Classifications
Classifications are clickable – clicking the link will select the relevant term in the taxonomy tree view.

Calculations

To see how the classification scores are calculated click the “calculations” link:
Calculations
This will show the classification calculation using the latest clues definition.

There are three sections:

  • Clues – Shows how each clue contributed to the total score.
  • Boosts – Shows what boosts were added to the clue scores when related terms were processed (e.g. Parent and Child terms).
  • Filters – Shows and conditions that were not satisfied, such as a Mandatory Clue or a Required Class. If any filters are listed then the document will not be classified against this term regardless of the score achieved.

Synonyms

The “Synonyms” link beside each clue can be used to enter synonym definitions.

In general, the use of this facility is not recommended and the preferred approach is to enter each synonym as separate clues. Entering each synonym as separate clues will generally result in more accurate scoring and therefore to better classification results.

The Synonyms link is only available with SQL taxonomies and can be disabled/enabled in the Core Configuration.

Languages

Each clue can be restricted to documents written in a subset of the available languages. This is useful is a word in one language also appears in another language but has a different meaning.

In this case you can click the “Languages” link beside each clue and select any subset of the available languages:
Clue Languages

Bulk Edit

The Bulk Edit link can be used to make changes to several clues at one time:
Bulk Edit
When this link is used the form changes into a grid editor and many values can be changes and saved in a single operation.

It is also possible to preview the changes made whilst in the bulk editor. The “Preview” functionality provides an indication of the number of documents affected, and the resultant score change:
Bulk Edit Preview

Bulk Import

Clues can also be imported in bulk from an Excel Spreadsheet (or input in bulk manually).
The spreadsheet should contain 3 columns: Type (Standard, Case-Sensitive, Wildcard Phrasematch or Metadata), Clue Text and Score:
Clue Import
The “Bulk Insert” link is available on the “Clues” tab below the main entry grid.

Suggestions

Clues can be used to statistically produce a list of suggested clues that can be assigned to the term.
Bulk Edit
Clues can be suggested for a term via the following methods:

Suggest Clues for whole term: Click on the ‘Suggest Clues for class’ link under the class heading to produce a list of suggestions, based on all existing clues in the class.

Single Clue: Click on the ‘Suggest’ link against each clue to produce a list of suggestions, based on only this clue.

Class Document: Click on the ‘Suggest’ link against each class document to produce a list of suggestions, based on the document (See Class Documents).

Once the list of suggested clues has been generated they can be selected and added to the term clues:

Please note: Changes made to a class will have no effect unless documents are re-classified using conceptClassifier.

The clue type can be set to one of the following:

  • Standard
  • Case-Sensitive
  • Phonetic
  • Create Tree Node

If “Create Tree Node” is selected then these topics shall be added as children of the currently selected node in the taxonomy structure.

Search

It is also possible to search for documents based on the class clues.

This can be done by clicking on the name of any single clue in the clue management screen (or even any suggested clue) and it will be used as the basis of a search against the current corpus. The type of the clue is also taken into account when a search is carried out (See Clues).
Search Tab
This can be helpful when evaluating the usefulness of a clue by quickly examining its usage within the corpus.

The following facilities are available: Show document movements, Classifications and Calculations.

Browse

To view the documents classified for each term click on the ‘Browse’ tab, which will display a list of documents achieving the minimum score set for classification in the term (See Clues).
Search Tab
The list may be filtered by entering a starting URL and only documents with a URL that starts with this value will be returned. The URL filter, if used, must end on a folder boundary.

This list only displays the current classification status of each document and any changes made to the class, since the last classification, are not taken into account.

The following facilities are available: Show document movements, Classifications and Calculations.

If a new class is selected in the treeview menu the view will remain in ‘Browse’ mode and will show the documents for the selected class.

Export Search Results

Search/Browse Results can be exported quickly and easily by selecting the “Export to CSV” option below the search results:
Search Tab Export
If there are less than 1000 results, or you wish to have access to the results immediately you can select the “Quick Export” option. Alternatively the export results will be created in the background,
and made available later view the Queued Reports area. A notification will be sent to the selected email group upon the completion of report processing.

Working Set

A Working Set of documents can be defined and used to test the accuracy of classification rules against a controlled set of documents.

The Working Set is mode can be selected in the Core Settings.

If Class Level is selected then a different Working Set can be defined for every class.
If Taxonomy Level is selected then the same Working Set will be used for all classes.

Documents can be added to the Working Set from the Search or Browse tabs by using the “Add to Working Set” links:
Search Tab
The following facilities are available: Show document movements, Classifications and Calculations.

Related

The ‘Related’ tab allows you to view and modify the non-hierarchical relationships between preferred terms. This tab will only appear if the taxonomy is in SQL, as the SharePoint Term Store does not support this functionality.

Graph

The ‘Graph’ tab shows a graphical representation of classification intersection points.
Graph
In the example above 12 documents are tagged with “Environment”, 12 of these documents are also tagged with “Communications”. Its also possible to see that there are 19 documents that are tagged with both “Communications” and “Equipment” (highlighted by the green links).

Info

The ‘Info’ tab displays the term description (aka Scope Notes) for each preferred term.

The Description field is often populated automatically when an external taxonomy is imported automatically using the Scope Notes.

Logs

All changes made to a term are recorded. The change history may be viewed from the Logs Tab:
Logs

User Edits

When auto-classifications are amended in SharePoint the user edits are recorded in the conceptSearching database,
these can later be reviewed to identify terms that require review:
User Edits

User Suggestions

An optional interface can be enabled to allow users to suggest new terms for the termset hierarchy (http://conceptsearchingserver/conceptQS/Taxonomies/TermSuggest.aspx).
Suggestions can trigger automatic notifications to taxonomy administrators, as well as being recorded in the conceptSearching database for later review on the “User Suggestions” tab:
User Suggestions

Settings

Taxonomy/TermSet Level

When the root node is selected in the treeview (the termset) the “Settings” tab will display top level taxonomy settings. As well as global settings applicable to the Taxonomies area.
Taxonomy Settings
Content Filters:
This field allows the taxonomy to be restricted based on a booleanfilter (e.g. using the “CSE-FOLDERS” field) or any of the 8 documentidfilters. See the “conceptClassifier – Design Guide” for more information about the ContentFilter field in the Taxonomies table.

Max Categories:
Sets the maximum number of classes from this taxonomy that will be allocated to each document. To set the Max Categories value across all taxonomies use the Settings tab in Index Manager.

Create Default Clues:
This setting controls the creation of default clues when. If enabled then a default clue is added to all Classes based on the title of the class.

Default Clue Score:
Sets the default score value for new clues.

Count Mode:
Sets the display mode for counts in the treeview.

Show Empty Nodes:
Sets the display mode for empty nodes in the treeview.

Synchronise Termset:
Enables/Disables automatic synchronisation through conceptTermStoreManager for the whole termset.

Class/Term Level

When a child node is selected in the treeview the “Settings” tab will display settings for the selected term:
Term Settings
Available for Tagging:
The “Available for Tagging” field can be used to prevent any documents getting classified against a class. This would normally only be set to “No” when a class is being used to boost another class – see Term Boosts for information on terms that use the “Term Boost” type clues.

Synchronise Term:
Enables/Disables automatic synchronisation through conceptTermStoreManager for the term and its children.

Relevance Threshold:
The threshold for each Class defaults to 50 – but can be raised (to reduce the number of documents that get classified) or lowered (to increase the number of documents that get classified).

Boosts:
The Weighting Boosts can also be adjusted for each Class. Based on the values above you would expect a 10% score boost if one of its child terms was classified.

It is possible to set the “Child” boost to 100%, doing so will in effect enable the parent to always be tagged if the child is tagged. An example for this would be
a taxonomy containing regions, if a document was tagged as “England” it should also be tagged as “Europe”.

Help

The ‘Help’ tab displays a list of clue type information, as well as allows you to run the product tour specific to the Taxonomies area.

Multi-User Environments

When several users are maintaining the taxonomy structure simultaneously there is a need to prevent concurrent access to individual classes so that one user’s work is not overwritten by another user working in the same area of the taxonomy.

In order to allow multiple users to work simultaneously we provide a locking facility that allows each user to reserve one or more classes for private editing. When they have finished a batch of work then they can unlock the classes to release.

In order to enable this facility the administrator should “Enable User Locking” under the Taxonomy Settings.

The administrator should also ensure that Anonymous Access is disabled for the conceptQS web application in IIS so that individual Windows identities are available within Taxonomy Manager for locking purposes.

When this facility has been enabled then you will see a Lock Class button in the Clues tab for all Classes:
Lock Term
A checkbox allows users to Lock a selected class and optionally all of its children in a single operation.

If you click the “Lock Class” button then:

  • The button text changes to “Unlock Class”
  • The folder icon in the treeview changes to red

Lock Term

Note that other users see:

  • A grey folder icon, indicating that the class has been reserved by someone else
  • Text above the button indicates who has reserved the class

Other users are unable to alter or unlock a class that has been locked by another user. However super-users are also able to “Unlock” a class.

Workflows

Introduction

The Workflows administration area provides a web based console for creating and managing actions that follow a classification decision.

The area can be accessed at the following URL: http://conceptsearchingserver/conceptQS/Workflows

Usage can be restricted to selected users based on either their Windows identity or using non-Windows based access controls, as required. See the Users area for more information on how to restrict access.
Workflows

Enabling “Workflows”

The Workflows area is disabled by default. To enable, please set the “Workflow Manager” setting to “Enabled” in the QS section of the Core Configuration.

What are Workflows?

Workflows can perform an action on a document following a classification decision.

Taxonomy Workflow is not a general purpose workflow engine and therefore does not compete with Microsoft’s Workflow Foundation, Nintex or K2.

A number of pre-configured actions are available, such as:

  • Send an email message to an administrator
  • Move a document from one location to another
  • Copy a document from one location to another
  • Change the document’s access rights in SharePoint (2010+)
  • Change the document’s Content Type in SharePoint (2010+)
  • Generate additional metadata in SharePoint (2010+)

In the case of moving/copying documents, there are two methods employed:

  • Movement within a SharePoint site collection can be achieved with a SharePoint dynamic action provider (see SharePoint Actions)
  • Moving a document between sources can be achieved with a Migration Action Provider (see SharePoint Migration

Configuring a Workflow

Add/Edit a Workflow

Workflows are configured from the default administration screen.
The display is grouped by source type and source (where the source has been specified), expanding the grouping allows you to edit or delete the workflows within that group.

Workflows can also be renamed (purely for display purposes), and temporarily paused/resumed.

To add a workflow select the button “Add Workflow”:
Workflows
First, select the type of documents the workflow should run against:

  • All types – Rules will operate on all document types
  • FILE – Rules will operate on FILE document types only
  • HTTP – Rules will operate on HTTP document types only
  • SharePoint – Rules will operate on SharePoint document types only
  • conceptSQL – Rules will operate on SQL document types only

Migration Actions are only available if the “Workflow Source Type” is set to “FILE”, conceptSQL or “SharePoint”.

SharePoint Actions are only available if the “Workflow Source Type” is set to “SharePoint”.

If “FILE”, “HTTP” or “SharePoint” is selected as the source then the individual sources and source groups of that type are listed. Selecting an individual source or source group will restrict the workflow to processing to content from that source only and will also allow you to configure actions to be triggered the from the actions configured for that specific source. If the workflow source is not selected then the actions available will be restricted to the global actions available for the source type.
Workflow Type
Give the new workflow a suitable description name and click “Add Workflow”:
Workflow Name
You will then be redirected to the workflow configuration screen to specify the taxonomy conditions, and workflow actions.

Taxonomy Conditions

Taxonomy Conditions are used to define the classification decisions that this workflow will act upon.

Click “Add new Taxonomy Condition” to add a condition:
Edit Workflow
There are three different types of condition:

Any Taxonomy
The condition will match if any classifications are found.

Any Document
The condition will match all documents.

Term Conditions
Choose a single entry from the taxonomy if this rule should apply to a single topic. Select “All Nodes” if this rule should run against all topics for this taxonomy.

Multiple conditions may be added, if required. If multiple conditions are specified then the workflow will only run if all conditions are met.

Actions

Actions define what will happen when one or more Taxonomy Conditions are met.

Here are some examples of action types:

  • Migration
  • Classification
  • Email
  • SharePoint
  • Plugin

Each type of action provider is documented in the following sub-sections:

Migration

The Migration actions can be used to copy or move a document between sources.

There are two type of migration:

  • Migrate to File System
  • Migrate to SharePoint

Migrations actions are currently supported against the following source types:

  • File Systems
  • conceptSQL
  • SharePoint (2010+)
Migrate to File System

This allows the document to be moved or copied to a selected file system:
File System Migrate
The list of Migrations Destinations is populated using the Migration Config settings – see File System Migration Configurations for details.

If “Maintain Folder Structure” is selected then the folder structure of the source document will be preserved, with additional folders being created as required.

Select “Delete Original Item” if a Move is required, rather than a Copy operation.

Select “Mark Original Item as Read Only” if the original document should be set to “read-only” on the file system. This option will only appear for FILE sources.

When files are migrated to a File System destination then the following fields are preserved from the source document:

  • Created
  • Modified
Migrate to SharePoint

This allows the document to be moved or copied to a selected SharePoint server:
Migrate SharePont
The list of Migrations Destinations is populated using the Migration Config settings – see SharePoint Migration Configurations for details.

After selecting a Migration Destination you can then select the required location in SharePoint using the SharePoint Destination dropdown.

The “Mode” allows you to define how complete the migration should be, the default option is to move purely the binary file, the other options also copy the metadata and content type detail (as well as versioning detail where appropriate).

The “Dynamic Destination Field Name” provides functionality to move a document back to a dynamic destination (not the library specified). If specified the migration logic will look for the field name against the metadata of the document being processed, if the field is available it will use the value as the destination for the file – for this functionality to operate there are two requirements:

  • The dynamic destination must be within a crawled source
  • The workflow migration credentials must have permission to access the dynamic destination source/site collection.

In the event that the dynamic destination cannot be found the migration will fall back to the default (the selected library).

If “Maintain Folder Structure” is selected then the folder structure of the source document will be preserved, with additional folders being created as required.

Select “Delete Original Item” if a Move is required, rather than a Copy operation.

Select “Mark Original Item as Read Only” if the original document should be set to “read-only” on the file system. This option will only appear for FILE sources.

When files are migrated to a SharePoint destination then the following fields are automatically preserved from the source document:

  • Created
  • Modified
  • Created By (if known)
  • Modified By (if known)
Migration Actions

If the migration is to a SharePoint destination then further actions can be applied to the document after it has been migrated. These actions are called Migration Actions.

Initially there will be no migration actions configured:
No Migration Actions
However, by clicking on the link that says “No Migration Actions” it is possible to define actions that will be applied to the document after it is moved:
Add Migration Action
For SharePoint actions the behaviour is exactly the same as it would be if it had been configured as a normal action, except that the action will be applied to the document after it has been migrated.
This is especially useful if the migrated document has had its content type changed and so it is possible that the migrated document may have different SharePoint properties available compared to the original document.

Classification
Manually Classify

Add the specified node/nodes to the classification metadata for a document. The nodes selected must be from a single taxonomy/termset. Note: The additional classification will not trigger other workflows or be passed to back to the source item (i.e. conceptClassifier for SharePoint fields) as the workflow actions are executed at the final stage of the document processing.

Remove Classifications

Permanently removes all existing classifications on a document and disables future auto-classification for it.

Email

The only action available when Email is selected is “Email Alert”:
Email Action
The Email message can be sent to any named address.

The title and body can include templates that select data from the conceptSearching Pages table for the current document. Simply specify the required field in this format:
[cs:fieldname]

If the Workflow Source Type is “SharePoint” then the email can also be sent to the:

  • Created By User, or
  • Modified By User

If one of these options is selected then the relevant user’s email address will be retrieved automatically from Active Directory.

In all cases, a valid SMTP server must be configured – see Workflow Action Configurations for details.

SharePoint

If the Workflow Source Type is “SharePoint” then the following SharePoint Static Actions will always be presented:

  • Filtered Targeted Meta Update
  • Send crawled value
  • Send classification value(s)
  • Send fixed value

If the Workflow Source Type is “SharePoint” and a single Site Collection is selected then all SharePoint Dynamic Actions configured on this site collection will be presented, in addition to the static actions.
For more information on the dynamic actions, please refer to the appropriate conceptClassifier guide (Farm or App).

Plugin

If plugin actions have been configured then these will also appear in the list of available actions. See Workflow Plugins for more information.

Configuring Multiple Actions

Multiple actions may be added, if required. If multiple rules have been configured then they will be processed in the order listed on screen.

Use the red down arrow or green up-arrow to change the processing sequence as required:
Re-Order Actions
Check the “Processing stops if this rule is run” checkbox if you would like the workflow to abort if one of the conditions associated with this rule is run. If this checkbox is not checked then further actions may be applied if additional conditions are met.

External Action Configuration

The external action configuration is completed under the “Configs” sub heading:
Email Config
On first load the only option available will be the “Email Config”, however, if plugins are installed
there may be more options.

Migration Destination Configuration

The external migration configuration is completed under the “Configs” sub heading:
Email Config
On first load the only options available will be: “File System” and “SharePoint”, however, if plugins are installed
there may be more options.

File System

An entry should be made for each possible migration destination that resides on a file system:
File System Migration Config

SharePoint

An entry should be made for each possible migration destination that resides on a SharePoint site collection:
SharePoint Migration Config
We recommend specifying an account that is a site collection administrator.

SharePoint Content Type Hubs

SharePoint 2010+ supports Enterprise Content Types allowing Content Types to be defined on a Publishing SharePoint site with one or more secondary sites consuming the Enterprise Content Types.

Once conceptClassifier for SharePoint is installed on the SharePoint Farm it is possible to define SharePoint workflow actions at on the SharePoint Content Type Hub site. Any actions of type “Content Type Update” may be run on the site collection itself however they may also be run on consuming SharePoint Site collections.

This form is used to register the SharePoint Hubs Sites that are available in the SharePoint Farm. Workflows set up on sites that consume Content Types from the configured SharePoint Hub Site will also be able to trigger the “Content Type Update” functions of the hub site.

To configure such a workflow first select “Add a workflow” then choose the Workflow Source Type as “SharePoint” and the Workflow Source as “Content Type Hub”.

Workflow Run Log

All workflows that result in action being taken are logged. Click the “Logs” menu item to view the audit trail:
Workflow Logs
You can click the “Details” link to see the detailed information related to the workflow action as well as sort and filter the log.

Workflow Plugins

A range of Workflow actions are provided with the product, but the product can also be extended by writing additional actions using the plugin interfaces.

Plugins are implemented as DLLs and are placed in the plugins folder, which is typically located here:

C:\Program Files\ConceptSearching\Plugins\

The following sample plugins are provided with the product (complete with code):

  • FTP Migration action
  • Http Save Files action
  • Twitter action
  • SQL Lookup

Click the “Detect New Plugins” button to search the plugins folder for new plugins.

Click the “Enable” link to enable selected plugins.
Workflow Plugins

Configuration

Introduction

The Config administration area provides a web based console for altering global system configuration settings.

The area can be accessed at the following URL: http://conceptsearchingserver/conceptQS/Configuration.

The default screen shows the most commonly amended settings, each setting is described in the Core Configuration section.

Config

Core Configuration

Collector

  • Collector User Agent – is used by the Collector service to identify itself to crawled sites.
  • Max Doc Size – can be set to any value up to a maximum of 500MB. Files larger than this setting are not indexed.
    If “Collect metadata of excluded item(s)” is checked then oversize documents will be processed using their metadata.
  • Collector Threads – The number of background collection threads (overall), it is recommended that this setting is not changed without advice from support staff.
  • Collector Domain Threads – The number of background collection threads (per domain), it is recommended that this setting is not changed without advice from support staff.
  • Collector File Threads – The number of threads used to crawl file system content, it is recommended that this setting is not changed without advice from support staff.
  • Collector Reader Process Pool Size – The number of external processes utilised for iFilter conversion, it is recommended that this setting is not changed without advice from support staff.
  • Enable OCR for file types – When enabled supported (and configured) image types will be processed through a built-in OCR engine, the extracted text will be available for search and classification.
  • Process Document Images – Enables the extraction of images from supported types (DOCX/PPTX/XLSX/PDF), images will be subsequently pushed through the built-in OCR engine (subject to configuration).
  • Document Set Mode – Defines how SharePoint Document Sets will be treated:
    • Process as Folder – Classifications will only be written to the child items.
    • Process Set and Children – Classifications will be written to both the root item, and the individual children.
    • Process Set as Document – Classifications will only be written to the root item.

Indexer

  • DocumentID Mappings – Used to map fields into the search index ID references, allows for custom queries based on external IDs.
  • Indexer Threads – The number of background threads to be used for the Indexer’s processing, it is recommended that this setting is not changed without advice from support staff.

Classifier

  • Classifier Mode – Enables/Disables the classification engine.
  • Max Categories – The maximum number of classifications to be allocated to each document. If a document matches so many categories that this value would be exceeded then conceptClassifier will select the required number of categories based on those that have achieved the highest match score. The maximum value for this setting is 256 on 32-bit Windows and 1024 on 64-bit Windows. Higher values use more RAM and so the default value should be used unless it is essential that more categories are allocated to each document.
  • Retain existing metadata mode – When enabled the classification engine will leave existing classifications in place in SharePoint if no auto-classifications have been generated. I.E, it will not clear managed metadata fields.
  • SharePoint EMM No Classify Mode – When enabled if a user updates the “No Classify” setting in the Taxonomies area it will also update the “Deprecated” flag in SharePoint.
  • Classifier Threads – The number of background threads used for classification, it is recommended that this setting is not changed without advice from support staff.
  • Classifier Write Threads – The number of background threads used for writing classifications back to source systems, it is recommended that this setting is not changed without advice from support staff.

Query Server

  • Workflow Mode – Set to “On” if you want to use Workflow Manager. Chose the “API Only” setting if you want workflows to be disabled unless fired via a call to the RunWorkflow API.
  • Create Default Clues – Used to control whether new taxonomy terms have a default clue generated automatically based on the term name.
  • Synonyms Enabled – Used to control whether synonyms are configurable for taxonomies residing in the conceptSearching database.
  • Working Set Mode – Configures the Working Set functionality under “Taxonomies”. The “Term Level” is used if a different working set is required for each class. If “Taxonomy Level” is selected then the same working set shall be used for all classes in the taxonomy.
  • Default Page Size – Configures the number of search results returned per page in the Taxonomies area.
  • Taxonomy Batch Size – Disabled by default, allows the taxonomy treeview control to load data in partial batches.
  • User term suggestions – Enables/Disables a custom form designed to allow end-users to make suggestions on new terms for administrators.
  • # Doc Metadata Fields Shown – Configures the number of metadata field names shown when selecting from a metadata field dropdown list.
  • Active Directory Group Lookups Enabled – Enables/Disables the use of AD groups in User Manager. It is recommended that this option is disabled unless AD group support is specifically required.

Logging

  • Event Log Mode – Defines which levels of logging should be persisted to the Windows Event Viewer.
  • Tracing – Used to assist with problem resolution and should be left blank unless working with the Concept Searching support team.

Utilities

The utilities subsection provides a number of diagnostic tools which can be used when attempting to resolve
system issues.

You can also:

  • Reset QS Cache – Force the QS caches to be reset.
  • Run Product Tour – Runs a product tour, taking you around the key areas of the product.

Backup/Restore

Only available for ‘Superusers’.

The Backup utility allows for the migration of complex conceptSearching instance configurations.

This allows a user to safely design and test a conceptSearching configuration within a development environment and then copy the configuration, or specific parts of the configuration, to a different environment (I.E production).

The tool supports text replacement to allow user defined URL’s to be replaced by the equivalent destination URL. The following configuration options are available for import/export:

  • Source Registrations
  • SharePoint Termset Registrations
  • Workflow Configurations
  • Core Configuration Options:
    • Files Excluded
    • Files Included
    • Mapped Metadata Fields
    • Mapped Metadata Values
    • Supported Languages
    • Pages Excluded
    • Pages Included
    • SharePoint Excluded
    • Text Patterns

Backup
To create a backup simply select “Create Backup” and select the elements that you wish to include. The backup password will be required if you export a backup to XML
and re-import to a different environment.

Upon import any items that already exist will be skipped.

Cleaner

Only available for ‘Superusers’.

The Cleaner utility allows you to reprocess content or clean the environment on a large scale. This can be useful after a large amount of content has been deleted or after configuring a DQS environment:
Cleaner
The following actions will occur based on each action:

  • Rebuild – Running a rebuild will retain all existing collected content (text/metadata), but will truncate the search index. Once the operation has completed the services will begin re-processing all indexing/classification work.
  • Re-Collect – (Post DQS configuration) Running a re-collect will delete all leaf level items (documents/records), after completion the services will being re-crawling all configured sources.
  • Delete – The delete operation removes all content from both the search index, and the SQL database.

DQS

Only available for ‘Superusers’.

The Distributed Query Server (DQS) is a component of conceptSearching that allows an index to be distributed across multiple servers.

A distributed index means that there are multiple servers running the Collector, Indexer and QueryServer, each with its own set of “.cse” files. All servers share a single SQL database.

When an existing environment is converted over to a DQS configuration all of the existing documents will be ‘allocated’ to the first server, to re-distribute the content you can
use the “Cleaner” utility to re-collect the index.
dqs
The APIs used by the application are unchanged by a Distributed Query Server configuration.
The application simply communicates normally with any one of the servers running the conceptQS Web Service and this server will automatically communicate with the other servers to assemble the required results.

If you are considering improving your environments performance by creating a DQS configuration we would recommend contacting our Support team to assist in the process.

Metadata Configuration

Document Metadata Fields

This list specifies which internally generated fields are to be used:
Config

Metadata Field Mappings

This table allows additional metadata fields to be generated by mapping an already existing field name to a new name.
Config
For example, if we create an entry with Source=Author and Target=Publisher then a document with this metadata:

“Author: John Challis;”

Will generate an index with this metadata:

“Author: John Challis; Publisher: John Challis;”

This facility can be useful when you need to align metadata field names across a variety of sources and/or document types.

Metadata Value Mappings

This list allows metadata values to be mapped from a source value to a new target value.
Config
For example, if we create an entry for the field “Modified By”, with Source=”Cheryl Tweedy” and Target=”Cheryl Cole”, then a document with this metadata:

“Modified By: Cheryl Tweedy;”

Will generate am index with this metadata:

“Modified By: Cheryl Cole;”

This facility can be useful when you need to align metadata field values for example when employees change their name or are replaced by different people.

Email Configuration

Email Servers

Email servers can be configured to enable external communication from the conceptSearching product. For instance
when the health service identifies an issue.

Servers can be amended post configuration by selecting “Edit”, or, new SMTP servers can be added by selecting “Add Email Server Configuration”.
Config

The SMTP details should be entered based on the values provided by your network team.
Each configuration supports both SSL enabled SMTP servers, and those without SSL enabled.

It is also possible to supply a test email address which will be used to test the configuration settings.
Config

Email Groups

Email groups are used to define a logical group of people to email, essentially – a mailing list.

Each email group is linked to an SMTP server, so, before configuring an email group, you must configure your Email Servers.

To add a new group select “Add Email Server Group”, or select “Edit” on each row to configure the group members.
Config

Each group can have one or more members, and can be assigned a friendly name, which will be displayed when selecting an email group:
Config

Health Service Notifications

Health Service Notifications can be configured to email a specific group of people when something goes wrong within the conceptSearching product.

Each notification configuration is linked to an email group, so, before configuring notifications, you must configure your Email Groups.

To add a new notification configuration select “Add Notification Configuration”, or select “Edit” on each row to change the configuration.
Config

Notifications can be set to trigger on warnings, or just on errors – by default problems of any level will be reported.

The “Daily Summary” can also be disabled/enabled, this functionality sends out a summary email of outstanding problems each morning.
Config

Text Handling

Best Bets

Sometimes an application may wish to push selected documents to the top of a hitlist for specific queries.
This may be implemented by specifying “Best Bets” for specific query text.
Config
First, enter the search term that you wish to match and then click the Add button.

Next, click on the term, and specify one or more URLs that should appear at the top of the hit list.

Content Type Extension Methods

Sometimes an organisation may wish to process certain file types as a different content type.

The primary use case for this is internal content types that map to a content type already understood by conceptSearching.

In this case the example has a .rpt file being treated as a text file, as such the file will be copied to a temporary location as a .txt file and processed as if it
were any other text file.
Config

Content Type Extraction Methods

The Content Type Extraction methods describes how documents will be handled by the APIs and the core services. A number of built-in processing methods are available, where there
is no available method the processing will default to running through standard Microsoft Search iFilter processing.

The methods can be easily altered by clicking “Edit” and then selecting the preferred processing method.
Config

Language Detection

The language detection list specifies which languages will be considered for auto-detection.
Config
If a language is excluded then it cannot be used to identify the language of a document and it will be removed from the language options in Taxonomy Manager.

Synonyms

Often it is important to submit a query and have synonyms automatically included. A generic set of synonyms may be configured by using the Synonyms form.
Config

Text Patterns

Many HTML web pages contain navigation information and other extraneous information that is the same for all pages and/or not relevant to the individual page content.
If all of the text is indexed from these HTML pages then this can lead to unwanted search results where a match is made, for example, to an entry in a standard page navigation area.

The Text Patterns feature is provided to assist with the cleanup of HTML documents. TextPatterns can also be used to index terms that would normally be discarded.
Config

The StartTag and EndTag values are case sensitive strings used to identify the content to be managed, the content is then managed based on the filter type.

There are three tag types that can be used to assist in the cleanup:

  • FILTER – Extracts a subset of the HTML page, prior to extracting the plain text. Only a single section will be extracted for each TextFilter processed.
  • DELETE – Deletes sections of the HTML page, prior to extracting the plain text.
  • INDEX TERM (EndTag ignored) – Create index terms that would otherwise not be formed. For example the term “E.ON” is a useful one for people interested in energy companies. However, this term would not normally be created because a full stop normally acts as a term separator. However, if we create an INDEX TERM for this pattern then it will be detected and indexed as required.

See Section 5.3.13 of the “conceptSearching – Application Design Guide” for more detailed information on how to use Text Patterns.

System Configuration

AD Domains Excluded

The AD Domains Excluded list is used to disable Active Directory expansion for certain domain names.
This is useful in a multi-Domain forest, where the conceptSearching server does not have access to all
domains within the forest.
Config

Attachments Excluded

When indexing files from that potentially contain attachments (SharePoint List Items) the list of file locations that will be ignored is defined by the “Attachments Excluded” list.
The definitions in this list may be viewed and modified via the Attachments Excluded form:
Attachments Excluded
Any file with a path that matches one of these patterns will be ignored.

Wildcards may be used anywhere in the pattern definition, with:

  • The asterisk character (*) matching any sequence of characters
  • The question mark character (?) matching any single character

No Index

Sometimes an application may wish to remove selected documents from all search results. This may be implemented by specifying “No Index” entries.
Config
Any number of URLs (or Filenames) may be entered and none of these will ever appear in search results.

Wildcards may be used anywhere in the pattern definition, with:

  • The asterisk character (*) matching any sequence of characters
  • The Question mark character (?) matching any single character

Proxy Server

The Proxy Server form may be used to define a proxy server to be used when crawling websites, the proxy server is not used for SharePoint crawling.
Config
Set Bypass Local to “Yes” to bypass the proxy server for local addresses (localhost etc).

Any other exclusions that should not go through the proxy server should be defined in the “Exceptions” list.

Suspend Services (Scheduler)

The conceptCollector, conceptIndexer and conceptClassifier run as Windows services.
These services are responsible for building the search index and classifying documents against the registered taxonomies.

It can be useful to suspend these services from running so that they do not impact query performance during the peak hours of the working day.
Sometimes it may be useful to suspend these services for some lower priority sources but have them continue to process higher priority sources.

Config

Service suspensions can be configured in the following ways:

  • Source – Which source types the suspension is in place for: all source types, specific source types (SharePoint, Web etc) or specifically against Re-Indexing operations.
  • Service – Which services are affected by the suspension: All Services, or, a choice of: Collector, Indexer, Classifier.
  • Day/Times – Allows the configuration of which days and times the suspension will be in place.

Security

Introduction

The Users administration area provides a web based console for creating and managing users
who are authorised to use the conceptSearching administrative functions. It also provides a central mechanism to manage passwords
used by conceptSearching to crawl content, as well as the ability to restrict access to the conceptSearching APIs.

The area can be accessed at the following URL: http://conceptsearchingserver/conceptQS/Users.

By default no users are defined and usage of the administrative functions built into the QS is unrestricted.

You must add at least one user in order to restrict the access to the QS administrative functions.

The QS supports the following types of authentication mechanisms: Windows, ADFS and Forms.

Users

User Management

Authentication Mechanisms

On first install the QS will be configured for Windows authentication. To setup the QS to use an ADFS server please follow the
“conceptSearching Installation and Configuration” guide using the section “ADFS”. To use forms based authentication please disable
all other authentication methods in IIS other than: Anonymous and Forms:
IIS

Super Users

Super Users always have access to all Query Server administrative functions.

Non-Super Users must have their access rights specifically configured and all rights are disabled by default. See Permission Management for details about configuring the access rights for non-Super Users.

Regardless of the authentication mode selected the usage of the QS administrative functions will continue to be unrestricted until at least one user is added. The first user must be a Super User.

If Windows or ADFS Authentication are being used then the first user will default to the currently logged in user, although this can be changed if required.

If Non-Windows Authentication is enabled then additional information must be entered to define the non-Windows user:

Adding/Removing Users

More users can be added at any time from the default Users screen, as well as allowing for users to be removed.
Add User

Additional Windows users can be validated using Integrated Windows Authentication.

Additional non-Windows users can only be added if the Non-Windows Authentication mode is enabled.

If the only user defined is a Super User and that user is deleted then all security is removed and usage of the QS administrative functions reverts to unrestricted.

Permission Management

In order to allocate granular permissions to a user (non-Super Users), simply select their username from the main grid.

At a top level each checkbox defines whether or not a user has access to each of the top level administrative areas.

When an area is enabled there are typically more granular permissions that can be enabled.
Within the “Taxonomies” area it is also possible to assign permissions at a specific termset or term branch level.

Permissions

Password Manager

Password manager can be used to automatically schedule password changes, for service accounts that conceptSearching is using to
access external systems. This is particularly useful when there are business policies in place to change passwords on a rolling basis.

Password Manager

To amend the passwords for a username record first select “Passwords” from the main display. Then either click “Edit” on a particular password
row, or, click “Add Password” to add a new password for the account. It is not possible to have overlapping date ranges for the defined passwords, nor is it possible to remove
all passwords from a user record.

Web Service Security

Web Service Security can be used to restrict external access to the conceptSearching APIs, we recommend when using this functionality that
you list the conceptSearching service account under the “Allow Only Listed” records. When “Block All” is selected certain functionality within
conceptSearching will be impacted (if there is API use).

There are three modes available:

  • Allow All – No restrictions, all users have access to the APIs
  • Block All – No API use supported
  • Allow Only Listed – Blocks all API use except for those users (or groups) listed

Each mode is assigned to a specific grouping of service methods, you can see which API functions are affected by clicking the “Show” link next to “Group Members”.

Web Service Secuity

Reporting

Introduction

The Reports administration area helps a user extract a wealth of information from the conceptSearching index.

The area can be accessed at the following URL: http://conceptsearchingserver/conceptQS/Reports.

The main dashboard has three high level graphs highlighting the current state of processing:

  • Document Progress – A graphical display of the main stats display, once processing is complete documents will be allocated to either “Fully Processed” or “Errors”
  • Index Size – Shows the percentage of each source type being processed: Files, SharePoint, SQL and Web sources
  • Classification Coverage – Shows the percentage of classified content, broken down by type, and the percentage of content that has not received any auto-classifications

Reports Dashboard

The Classification Distribution graph highlights areas of classification overlap. In the example below the classification “Communications” has been found to be the most highly scoring term
that overlaps on all 3 of the site collections displayed. However, the classification “Data” applies very strongly to the “2013” site collection.

It is possible to filter and refine this display, to look for the areas that contain the largest amount of documents tagged with a particular term, or to only review specific content.
Classification Distribution

Built-in Reports

There are a number of reports provided that can be run in browser, as well as exported to excel, these are described below:

  • Classification Coverage – Provides a list of documents that have been tagged with X or less classifications. Can help with locating documents that have a low number of auto classifications. Supports filtering by URL.
  • Duplicate Documents – Provides a list of documents that are considered “duplicates” within the index, either because they contain the same text (regardless of location), or, because they contain the same text and reside within the same source location. Supports filtering by URL.
  • Failed Write Classifications – Provides a list of documents in the Concept Searching Index that failed to have their classifications written to the source system (such as SharePoint Managed Metadata Columns). Supports filtering by URL.
  • Files Skipped – Provides a list of documents that have been excluded from processing because they were not explicitly included, or were specifically excluded. See Files Included and Files Excluded for more information on file inclusion/exclusion.
  • iFilters Detected – Provides a list of detected iFilters per server. iFilters are the Microsoft standard for implementing text extraction from binary files. They are used by many search engines (including Microsoft Search) to obtain the plain text required to build a search index.
  • Page Statuses – Provides a list of documents at a given status within the index. Supports filtering by URL.
  • Provides a list of documents in the Concept Searching Index that failed text extraction (granular iFilter error codes).

Duplicate Documents

Document Classifications

Provides a list of documents tagged with a particular term or terms (using either an AND or OR operator). Supports filtering by URL.
Document Classifications

Queued Reports

When large search exports are run the report may take some time to compile, in this instance the background processes create the report and make it available for download
via the “Queued Reports” dashboard. Reports can be deleted prior to, or after, processing as well as downloaded as many times as necessary.
Queued Reports

Custom Reports

While there are a number of reports included in the product by default, it is also expected that specific business
needs may arise that require reporting not covered by the default reports.

With this in mind it is also possible to create custom report “Plugins”. Once the custom report plugin is deployed
the report will appear in the main reports list with the built-in reports. For more information on how to create
a report plugin, or help in creating one, please reach out to Concept Searching support.
Report Plugins

Concept Searching