How to get indexed

You have a Movable Type related resource website? You want your articles to be found with MTLookup? This page will give you all the information needed for cooperating with MTLookup.

Important Information

My original idea for MTLookup was based on an XML-file that is created by authors. It should list the articles that are to be included into MTLookup. The following text describes this XML-file.

As development for MTLookup proceeded, the MTLookupBot - which is the central component for spidering the websites - got better and better. Today it is able to spider an entire website, extract reasonable excerpts and categorize the content. Today, it is in a state, where the XML-file is not needed.

So if you want your website to be included into MTLookup, just send me note by email. You do not have to create the XML-file.

In case you are interested, here is the original specification for the XML-file.

Two Choices

One of MTLookup's central components is the MTLookupBot. Its task is to read pages from other websites, and save information about the extracted text into the central database, which will later be queried with fulltext searching.

Basically, the same is done, if a website is included into the index of Google, MSN , or Yahoo. However, there is an important difference: MTLookup does not spider the internet on its own. It is not a job that is running continuously day and night, hopping from website to website searching for Movable Type related content.

Somebody has to tell MTLookup about a website's pages. Only then, it will start doing its job. Currently, there are two ways for telling MTLookup...

  • I might find an interesting site, or you might send me an email about it. Then I will point MTLookup to this site, and make it be spidered.
  • You as the author of a Movable Type resource website create a file for MTLookup. This file - let us call it MTLookup.xml - contains information about web pages that are to be included into the database. This might just be a list of URLs. However, if you want so, you can also give additional information.

You won't be surprised that I prefer the latter. First, it means less work on my side. Furthermore, in most cases the result will be better.

Advantages of MTLookup.xml

Let us look at some reasons, why using MTLookup.xml is the better way.

  • New web pages from existing websites can be found much quicker and easier. Even new websites can be inserted into the MTLookup database right from their start.
  • Sometimes web pages have to be changed. It is much easier to read an XML file for locating the files that must be read again than continuously visiting all websites again and again.
  • Some authors mix personal content with technical articles on a single website. The MTLookupBot has a difficult job, deciding whether the article should be included into the database or not. The author knows best what is interesting for the Movable Type community.
  • An MTLookup result list shows the articles' titles and excerpts. If the MTLookupBot has to extract these two items just by reading the HTML file, the quality will not always be that good. Just look at the excerpts that are created by the big search engines. Most of the time, these are just a couple of words, connected by three dots. I prefer to have readable text that really informs me about the article.

The work that has to be done by an author is not that big. Speaking Movable Type, it is just an index template creating the XML file. Moreover, not all elements of the file are mandatory. If you want, you can leave them empty.

General Remarks

Let us start by giving some formal infos.

Encoding

The MTLookup.xml is a standard XML file. For being able to work with most languages, it should be encoded as UTF-8. However, "iso-8859-1" is acceptable as well.

Mandatory / Optional Elements

Not all elements are mandatory. If you do not want to fill an element, either remove it completely or put an empty content inside.

<ID>4711</ID>
<Title></Title>

<ID>4711</ID>

Date Format

Sometimes a date has to be given. The value must be formatted as follows "yyyy-mm-dd hh:nn:ss". You can remove the seconds, the minutes or the hours. In that case, the value will be zero.

<PublishedOn>2005-07-13 22:05:15</PublishedOn>
<PublishedOn>2005-07-13 22:05/PublishedOn>
<PublishedOn>2005-07-13 22</PublishedOn>
<PublishedOn>2005-07-13</PublishedOn>

Multi-Value Elements

Sometimes an element can be given several values. MTLookup allows two ways for specifying that. You can either list all values within one element, putting the delimiter "|" (vertical bar) in between. Or you can repeat the element.

As the XML-file is typically created by the Movable Type tag language, one or the other technique might fit better into your weblog.

<TitleZap>first term | second term | third term</TitleZap>

<TitleZap>first term</TitleZap>
<TitleZap>second term</TitleZap>
<TitleZap>third term</TitleZap>

Spaces next to a delimiter are ignored.

Duplicate Entries

With the help of the XML-file you tell MTLookup about the pages that are to be included into MTLookup. When creating that file, you do not have to take care that a page is only mentioned once in the file.

If your way for creating the XML-file generates the same entry several times (e.g. because it is in several categories), MTLookup will not complain. The last definition will be used.

Accessing your XML-File

The XML-file does not have to be named MTLookup.xml, nor does it have to be placed at some defined URL. If you want, you can keep the file completely private. You will tell me by private email, where MTLookup can find it.

Overview

As you can see from the basic structure below, the XML-file contains three different parts: one »Meta« element, one »Weblog« element, and many »Entry« elements.

<?xml version="1.0" encoding="utf-8"?>

<MovableTypeLookup>

<Meta>...</Meta>

<Weblog>...</Weblog>

<Entry>...</Entry>
<Entry>...</Entry>
<Entry>...</Entry>
<Entry>...</Entry>

</MovableTypeLookup>

The »Meta« element gives meta information about the XML file. The »Weblog« describes your weblog, and each »Entry« element describes an article that is to be included into MTLookup.

Meta Element

The »Meta« element gives information about the XML file. It contains two other elements, both of which are mandatory.

Name

Mandatory

Description

Version

yes

This element tells MTLookup which version the file is based on. Currently the value has to be "1.0.0"

Future versions of this specification might offer additional features, which will be recognized with the help of this element.

CreatedOn

yes

This element gives the date and time, when the XML file was created. MTLookupBot will use the information for deciding whether the file is more recent than the one stored in its database.

An example is...

<Meta>
<Version>1.0.0</Version>
<CreatedOn>2005-07-13 12:45:16</CreatedOn>
</Meta>

Weblog Element

The »Weblog« element is used for describing your weblog. It contains the following elements.

Name

Mandatory

Description

ID

yes

Each weblog, whose articles are to be included into MTLookup, is given a unique ID. I will choose the ID and tell you by email. This ID will never change.

ShortName

yes

As the name implies, the ShortName should be a short name for your weblog. It might be an abbreviation, or simply a short name.

In result lists, the ShortName is used for describing the owner of an article. The ShortName should be about 3-10 characters long.

LongName

yes

The LongName gives the full name of your weblog. It is used in situations where the name of a weblog is given as a header - for example in a list of articles, grouped by website.

Icon

no

Result lists show articles from different websites. Each entry can be be accompanied by a small icon. Please create a 16x16 JPG for that.

For example, the favicon converted to a JPG is a good choice for that.

TitleZap

no

In the next chapter, I will describe that you can choose the title for an entry yourself. If you do not want to do so, MTLookup will use the title from the HTML's title-tag in the header.

Sometimes, a page's title is prefixed or suffixed with some constant value (for example the name of the weblog). This is ok, if the page is shown in a browser. However, with MTLookup's result lists this would be bad.

So if you use this element, MTLookup will remove the given text from the title.

This element is multi-valued.

An example is...

<Weblog>
<ID>123</ID>
<ShortName>XMP</ShortName>
<LongName>Example Weblog</LongName>
<Icon>http://www.example.com/MTLookupIcon.jpg</Icon>
<TitleZap>Example Weblog -</TitleZap>
</Weblog>

Entry Element

Each »Entry« describes one page that is the be inserted into MTLookup. It contains the following elements.

Name

Mandatory

Description

ID

yes

This element holds a unique ID for your article. MTLookup might read your files many times. It is the ID, which identifies a specific article over the time.

The Movable Type tag »MTEntryID« can be used for this element. It is guarateed to never change.

URL

yes

The URL is the address of your article. MTLookup will use it for reading the article. Moreover, a row in the result list will contain a link to the value given here.

Title

no

You can specify an article's title yourself. The Movable Type tag »MTEntryTitle« can be used for that.

If you do not define a title, MTLookup will use the title tag from within the HTML file, and optionally modify it according to the TitleZap element.

Excerpt

no

A good excerpt is important when searching is done. After having entered a search phrase, the user will be shown a list of titles and excerpts.

It depends on your weblog, what tag should be chosen for this. Maybe it is the »MTEntryBody« or the »MTEntryExcerpt«.

If your excerpt is too long, it will be truncated. Currently the maximum number of characters is 500. Most HTML within the excerpt will be removed - links will survive.

If you do not give a excerpt, MTLookup will try to find a good excerpt itself. However, sometimes it is no easy job to skip the "noise words" on a page - like navigation and sidebar.

Category

no

The element can be one of the values »Tutorials«, »News«, »Links«, »Templates«, or »NonMT«. This element is discussed in detail below.

If the element is missing, the MTLookupBot will try to find out the category itsself.

PublishedOn

no

The element should be set to the date, when the entry was published. See below for some more remarks.

You can use the Movable Type tag »MTEntryDate« for that.

If the element is missing, the date when the article was inserted into MTLookup is assumed.

UpdatedOn

no

The element should be set to the date, when the entry was last updated. See below for some more remarks.

You can use the Movable Type tag »MTEntryModifiedDate« for that.

If the element is missing, MTLookup will try to find out by "asking your server" via HTTP. If that fails, the PublishedOn date is used.

An example is...

<Entry>
<ID>3</ID>
<URL>http://www.example.com/css-vs-tables.html</URL>
<Title>CSS vs Tables</Title>
<Excerpt>For some time, tables have been the primary method for page layout. However, recently...</Excerpt>
<Category>Tutorials</Category>
<PublishedOn>2005-03-20 12:15:16</PublishedOn>
<UpdatedOn>2005-07-11 19:20:21</UpdatedOn>
</Entry>

Entry Element - Category

MTLookup is a tool for finding articles about Movable Type. Each of these articles should be categorized by one of the following categories...

  • The »Tutorials« are articles that teach some subject concerning Movable Type. There is no distinction between a "big article" and a "small tip".
  • Often, in weblogs you will also find so-called »News«. For example, SixApart announces the availability of some new version, which causes several weblogs to tell its readers about this information.
  • Some authors publish templates and styles, which can be downloaded by readers. Such articles should be categorized with »Templates«.
  • Many Movable Type resource websites also have »Links« pages. Most of the time, these pages contain only very few text.

A user, searching with MTLookup, will be able to specify which categories should be returned. It will be possible to search for one category or a combination of categories.

NonMT

Many authors run weblogs, where not only Movable Type related information is given. They mix technical articles with personal pages.

If you have such a weblog, and you want MTLookup to also read the other articles and put them into the database, you can do so with the »NonMT« type.

If a user searches with MTLookup, these entries will not be part of result lists. However, you will be able to put a dialog on your website, which uses MTLookup to search all your articles.

A separate document will describe this subject further. If you are interested, read Movable Type Weblog powered by MTLookup, where this technique is mentioned and shown.

Entry Element - PublishedOn and UpdatedOn

The »UpdatedOn« value is used by MTLookup for deciding whether the article should be read again and updated in the database.

The »PublishedOn« value is the date, where the article was published. The user searching for an article will be able to sort by this field, so recent entries can be found more easily at the top of the list.

The situation is easy, if an article has just been published. Both dates should be the same.

However, some months later you might modify the article. Of course, the »UpdatedOn« value should be modified. Otherwise MTLookup would not know about the change. Do you also have to modify the »PublishedOn«? Sometimes yes, sometimes no.

  • If the change has only been minor, maybe just correcting some typos, the »PublishedOn« value should remain the same.
  • If the change has been big, maybe writing some new chapters, the »PublishedOn« value should be modified. It is no longer an article that is months old.

If you use the Movable Type tags »MTEntryDate« and »MTEntryModifiedDate« the »UpdatedOn« will be changed automatically. The »PublishedOn« will be set to the value of the »Authored On« textbox in the Movable Type dialog.

Important: With Movable Type's current version v3.17, the output of the »MTEntryModifiedDate« is broken, if MySQL is used. You do not have to do anything regarding this problem. Just leave the garbage inside the XML. MTLookup will take care of the value.

Examples

The first example shows a very basic XML-file. The author has only given the mandatory data.

The second example shows an XML-file, where the author has given more information, thus enhancing the result list quality.

The third example shows an Movable Type Index Template for creating the XML-file. The sample code assumes that the author has created special categories for deciding where the articles are to be included into MTLookup.

It depends on your weblog, how the condition for inclusion into MTLookup has to be coded. Maybe you already have a category hierarchy, with all Movable Type related entries being under a certain top-level category. If I can help you with that, please let me know.

XML-File or Not?

So what should you do now? Create an XML-file or let MTLookup do its job without any work on your side?

During the last weeks, I made major improvements to the MTLookupBot. When spidering a website, it will now find interesting articles on its own, categorize these articles and extract a reasonable excerpt (which is a very difficult task in general).

If you have a Movable Type related resource website and you want your articles to appear in MTLookup, tell me by email. I will make the MTLookupBot spider your website. We can then look at the summary file and decide whether an extra XML-file is necessary.

Comment

If you want to comment this article, please use my email address, which can be found on Contact.

mgs | September 11th 2005