# String Matching Research Paper

**Pages:** 6 (1884 words) ·
**Bibliography Sources:**
4 · **File:** .docx · **Level:** College Senior · **Topic:** Education - Computers

String Matching Algorithm

String searching algorithms are important algorithm in that they try to find a place where one or more strings (also called patterns) are found in the text or in the larger string. Given that ? represents the finite set, both the pattern and the searched text are elements of ?. The set, for instance, may represent the alphabet, in which case any of the letters of the alphabet belong to set ?, or, set ? may in a different language correspond to any of the digits of the binary system (? = {0,1}) or of the DNA string in bioinformatics in which case the algorithm is defined as (? = {a, C, G, T}).

The string can be encoded in various ways, and the way in which it is encoded affects the search. To elaborate: A variable width encoding is time proportion to N. And it is slow to find that Nth character. The slowness of the system simultaneously effects the sped of the other more advanced systems. An alternate solution would be to search for the sequence of code units, but this may have problems of its own in that false matches may be produced unless the coding is specifically designed to avert that problem.

There are various string-searching algorithmic methods used, each produced by its own author, and each having its own advantages and disadvantages. The Boyer-Moore string search algorithm has been the one that is the most frequently used and preferred.

Origin

Download full

paper NOW! The history of string search algorithms started with the beginning of computer and is ongoing. In the continuous search to produce yet a still more effective search algorithm, various methods have been devised and more are in the process of being produced all the time.

## TOPIC: Research Paper on

The first was naive string theory that simply ran through sentence after sentence searching for a match. Long and inefficient, it was succeeded by the Knuth-Morris-Prastt method that was devised in 1970. Boyer and Moore created theirs in 1976, and Karp and Rabin produced a still more effective one in 1980. Since then, these have been succeeded by the Bitap algorithms, and most recent ones that focus on finding matches in mathematical models and similar graphics.

The most effective string-search algorithm, up-to-date, is the Boyer-Moore application.

Detailed Description

String-matching algorthims can be categorized in one of two ways. a. Basic Classification .i.e. according to the number of pattterns each algorthm uses or, b. according to its preprocessing methods.

A. Basic Classification

String matching algorithm consists of various algorithms, each of which uses different search patterns, some of which go faster than others and each of which has specific advantages and disadvantages. As listed in fig. 1. They appear as the following:

Algorithm

Preprocessing time

Matching time1

Naive string search algorithm

0 (no preprocessing)

((n-m+1) m) *2

Rabin -- Karp string search algorithm

(m)

average ?(n+m), worst ?((n-m+1) m)

Finite-state automaton-based search

(m |?|)

(n)

Knuth -- Morris -- Pratt algorithm

(m)

(n)

Boyer -- Moore string search algorithm

(m + |?|)

(n/m), O (n)

Bitap algorithm (shift-or, shift-and, Baeza -- Yates -- Gonnet)

(m + |?|)

O (mn)

*1Asymptotic times (i.e. times producing no symptoms) are expressed using O, ?, and ? notation

*2n= an array of characters

*2m= length of pattern to be searched through

Naive string search algorithm

Naive string search is the simplest and least efficient way to conduct a search. This is done by rigorously checking through the text one sentence after the other to see if the phrase or semantic repeats itself, and to see if one string occurs inside the other. Using naive search would lead us though a process that resembles the following: in a normal case, we have to search for only one or two characters in the haystack to ascertain whether they are in the wrong position. This would take 0m. The average case would take 0(n+m) steps, where n= the length of the haystack and m= the length of the needle. In the worst case, however, searching for a string like xxxxy in a string consisting of xxxxxxxxxxxy would take 0(nm) steps which shows why naive algorithm is ineffective.

Rabin -- Karp string search algorithm

This search algorithm created by Michael O. Rabin and Richard M. Karp in 1987 uses a hash function to find patterns in the text. Hashing -- or the hash function -- converts every string into its corresponding numeric value, for example hash ("goodbye") = 5. If two strings are equal, their value system also corresponds. So Robin-Karpp computes hash-value of strings that it's searching for and then searches for a substring with the same value. The problem with this system is that it can take very long and become inefficient. Because of this slow worst-case behavior, it is therefore inferior to other faster single-patterns searching algorithms such as Knuth -- Morris -- Pratt algorithm or the Boyer -- Moore string search algorithm. However, it is often used for multiple pattern searches.

Finite-state automaton-based search

Backtracking is averted in this case since a "deterministic finite automaton" (DFA) is constructed which mechanically and instinctively recognizes strings containing the desired search string. Using the power user construction, these search algorithms are expensive to construct, but rapid.

Knuth -- Morris -- Pratt algorithm (KMP)

The Knuth -- Morris -- Pratt, invented by Donald Knuth, Vaughan Pratt and James H. Morris in 1977,computes a "deterministic finite automaton" by using suffixes for particular words for its input and recognizing these in the string. Far preferred to a naive search, the algorithm can skip many characters at a time. The less, therefore, that it has to backtrack the faster it can go. A word, therefore, such as ABCEG works well, but something like GGGG takes the system to long and makes it work inefficiently.

Boyer-Moore String search algorithm

The Boyer-Moore, created by Bob Boyer and J. Strother Moore in 1977, is the preffered search algorithm method used. It preprocesses the key that is beign searched for but not the string that it is in. It searches from the end of the needle, thus being able to usually jump ahead a whole needle-length at each step and skips over many characters. As the searched for key becomes longer, the algorithm generally becomes faster and its efficiency comes from the fact that it cancels out information that does not fit its match so this accumulated canceling shortens its search.

Bitap algorithm

The Bitap algorithm, invented by Balint Domolki in 1964 and extended on by various engineers, is adaptable to fuzzy string searching since it keeps track of whether the previous x characters were a prefix of the search string. The algorithm, in other words, informs whether a given text contains a substring that is "approximately equal" to a certain pattern. If the substring and pattern are within k (i.e., a given) distance of one another, the algorithm considers them equal. Once given the given alphabet and word length, its running time is predicable, running in 0mn operation, regardless of structure of text or pattern. Due to the structures required by the algorithm, however, it prefers patterns that are less than a certain length and woks best on inputs over a small alphabet range.

B. Preprocessing classification

Some of the algorithm methods preprocess the text whereas others don't. Naive or elementary algorithms do not preprocess patterns or text, whereas the constructed search engines preprocess patterns but do not preprocess text. The Index methods, on the other hand, preprocess the text but do not preprocess the patterns, whereas the signature methods preprocess both text and patterns.

The whole can be demonstrated in this Table:

Text not preprocessed

Text preprocessed

Patterns not preprocessed

Elementary algorithms

Index methods

Patterns preprocessed

Constructed search engines

Classes of String searching algorithms

Index methods

These refer to the faster search algorithms that are based on preprocessing of the text. Construction of a string index, for instance, that consists, for example, of a string of characters (suffix tree) facilitates the search so that it proceeds faster. A suffix tree, for instance, can be constructed in ?(n) time and all occurences of the syntax searched for can be found in 0(m+a) time if a were the syntax searched for.

Other search methods, sometimes called "fuzzy strings" or "fuzzy searches" search for closeness between the search and the theme in the text rather than for a syntactical "match/nonmatch' system.

Primary Uses

Stringsearch algorithms are primarily used in database searches that range from the regular alphabet to the binary alphabet to DNA alphabet in bioinformatics. One common application is on the World Wide Web where search engine software such as famously Yahoo, Google, or Lexis-nexus use string-matching algorithms to find the particular alphanumeric string and produce output that is prioritized and relevant. In DNA applications and DNA engineering, string-matching algorithms match similar DNA sequences and patterns.

Efficiency of search engine -- thus algorithm used -- must be extremely high, particularly when the internet search engine, for instance, receives thousands of search requests each… [END OF PREVIEW] . . . READ MORE

String searching algorithms are important algorithm in that they try to find a place where one or more strings (also called patterns) are found in the text or in the larger string. Given that ? represents the finite set, both the pattern and the searched text are elements of ?. The set, for instance, may represent the alphabet, in which case any of the letters of the alphabet belong to set ?, or, set ? may in a different language correspond to any of the digits of the binary system (? = {0,1}) or of the DNA string in bioinformatics in which case the algorithm is defined as (? = {a, C, G, T}).

The string can be encoded in various ways, and the way in which it is encoded affects the search. To elaborate: A variable width encoding is time proportion to N. And it is slow to find that Nth character. The slowness of the system simultaneously effects the sped of the other more advanced systems. An alternate solution would be to search for the sequence of code units, but this may have problems of its own in that false matches may be produced unless the coding is specifically designed to avert that problem.

There are various string-searching algorithmic methods used, each produced by its own author, and each having its own advantages and disadvantages. The Boyer-Moore string search algorithm has been the one that is the most frequently used and preferred.

Origin

Download full

paper NOW! The history of string search algorithms started with the beginning of computer and is ongoing. In the continuous search to produce yet a still more effective search algorithm, various methods have been devised and more are in the process of being produced all the time.

## TOPIC: Research Paper on *String Matching* Assignment

The first was naive string theory that simply ran through sentence after sentence searching for a match. Long and inefficient, it was succeeded by the Knuth-Morris-Prastt method that was devised in 1970. Boyer and Moore created theirs in 1976, and Karp and Rabin produced a still more effective one in 1980. Since then, these have been succeeded by the Bitap algorithms, and most recent ones that focus on finding matches in mathematical models and similar graphics.The most effective string-search algorithm, up-to-date, is the Boyer-Moore application.

Detailed Description

String-matching algorthims can be categorized in one of two ways. a. Basic Classification .i.e. according to the number of pattterns each algorthm uses or, b. according to its preprocessing methods.

A. Basic Classification

String matching algorithm consists of various algorithms, each of which uses different search patterns, some of which go faster than others and each of which has specific advantages and disadvantages. As listed in fig. 1. They appear as the following:

Algorithm

Preprocessing time

Matching time1

Naive string search algorithm

0 (no preprocessing)

((n-m+1) m) *2

Rabin -- Karp string search algorithm

(m)

average ?(n+m), worst ?((n-m+1) m)

Finite-state automaton-based search

(m |?|)

(n)

Knuth -- Morris -- Pratt algorithm

(m)

(n)

Boyer -- Moore string search algorithm

(m + |?|)

(n/m), O (n)

Bitap algorithm (shift-or, shift-and, Baeza -- Yates -- Gonnet)

(m + |?|)

O (mn)

*1Asymptotic times (i.e. times producing no symptoms) are expressed using O, ?, and ? notation

*2n= an array of characters

*2m= length of pattern to be searched through

Naive string search algorithm

Naive string search is the simplest and least efficient way to conduct a search. This is done by rigorously checking through the text one sentence after the other to see if the phrase or semantic repeats itself, and to see if one string occurs inside the other. Using naive search would lead us though a process that resembles the following: in a normal case, we have to search for only one or two characters in the haystack to ascertain whether they are in the wrong position. This would take 0m. The average case would take 0(n+m) steps, where n= the length of the haystack and m= the length of the needle. In the worst case, however, searching for a string like xxxxy in a string consisting of xxxxxxxxxxxy would take 0(nm) steps which shows why naive algorithm is ineffective.

Rabin -- Karp string search algorithm

This search algorithm created by Michael O. Rabin and Richard M. Karp in 1987 uses a hash function to find patterns in the text. Hashing -- or the hash function -- converts every string into its corresponding numeric value, for example hash ("goodbye") = 5. If two strings are equal, their value system also corresponds. So Robin-Karpp computes hash-value of strings that it's searching for and then searches for a substring with the same value. The problem with this system is that it can take very long and become inefficient. Because of this slow worst-case behavior, it is therefore inferior to other faster single-patterns searching algorithms such as Knuth -- Morris -- Pratt algorithm or the Boyer -- Moore string search algorithm. However, it is often used for multiple pattern searches.

Finite-state automaton-based search

Backtracking is averted in this case since a "deterministic finite automaton" (DFA) is constructed which mechanically and instinctively recognizes strings containing the desired search string. Using the power user construction, these search algorithms are expensive to construct, but rapid.

Knuth -- Morris -- Pratt algorithm (KMP)

The Knuth -- Morris -- Pratt, invented by Donald Knuth, Vaughan Pratt and James H. Morris in 1977,computes a "deterministic finite automaton" by using suffixes for particular words for its input and recognizing these in the string. Far preferred to a naive search, the algorithm can skip many characters at a time. The less, therefore, that it has to backtrack the faster it can go. A word, therefore, such as ABCEG works well, but something like GGGG takes the system to long and makes it work inefficiently.

Boyer-Moore String search algorithm

The Boyer-Moore, created by Bob Boyer and J. Strother Moore in 1977, is the preffered search algorithm method used. It preprocesses the key that is beign searched for but not the string that it is in. It searches from the end of the needle, thus being able to usually jump ahead a whole needle-length at each step and skips over many characters. As the searched for key becomes longer, the algorithm generally becomes faster and its efficiency comes from the fact that it cancels out information that does not fit its match so this accumulated canceling shortens its search.

Bitap algorithm

The Bitap algorithm, invented by Balint Domolki in 1964 and extended on by various engineers, is adaptable to fuzzy string searching since it keeps track of whether the previous x characters were a prefix of the search string. The algorithm, in other words, informs whether a given text contains a substring that is "approximately equal" to a certain pattern. If the substring and pattern are within k (i.e., a given) distance of one another, the algorithm considers them equal. Once given the given alphabet and word length, its running time is predicable, running in 0mn operation, regardless of structure of text or pattern. Due to the structures required by the algorithm, however, it prefers patterns that are less than a certain length and woks best on inputs over a small alphabet range.

B. Preprocessing classification

Some of the algorithm methods preprocess the text whereas others don't. Naive or elementary algorithms do not preprocess patterns or text, whereas the constructed search engines preprocess patterns but do not preprocess text. The Index methods, on the other hand, preprocess the text but do not preprocess the patterns, whereas the signature methods preprocess both text and patterns.

The whole can be demonstrated in this Table:

Text not preprocessed

Text preprocessed

Patterns not preprocessed

Elementary algorithms

Index methods

Patterns preprocessed

Constructed search engines

Classes of String searching algorithms

Index methods

These refer to the faster search algorithms that are based on preprocessing of the text. Construction of a string index, for instance, that consists, for example, of a string of characters (suffix tree) facilitates the search so that it proceeds faster. A suffix tree, for instance, can be constructed in ?(n) time and all occurences of the syntax searched for can be found in 0(m+a) time if a were the syntax searched for.

Other search methods, sometimes called "fuzzy strings" or "fuzzy searches" search for closeness between the search and the theme in the text rather than for a syntactical "match/nonmatch' system.

Primary Uses

Stringsearch algorithms are primarily used in database searches that range from the regular alphabet to the binary alphabet to DNA alphabet in bioinformatics. One common application is on the World Wide Web where search engine software such as famously Yahoo, Google, or Lexis-nexus use string-matching algorithms to find the particular alphanumeric string and produce output that is prioritized and relevant. In DNA applications and DNA engineering, string-matching algorithms match similar DNA sequences and patterns.

Efficiency of search engine -- thus algorithm used -- must be extremely high, particularly when the internet search engine, for instance, receives thousands of search requests each… [END OF PREVIEW] . . . READ MORE

Two Ordering Options:

?

**1.**Download full paper (6 pages)

Download the perfectly formatted MS Word file!

- or -

**2.**Write a NEW paper for me!

We'll follow your exact instructions!

Chat with the writer 24/7.

#### Music Education or Cross Platform Development Term Paper …

#### Malware Since the Earliest Days of Humankind Term Paper …

#### Washington Irving's the Legend of Sleepy Hollow Term Paper …

#### SSO With an Example Application Research Paper …

#### Advanced Data Clustering Methods of Mining Web Documents Term Paper …

### How to Cite "String Matching" Research Paper in a Bibliography:

APA Style

String Matching. (2011, April 10). Retrieved October 16, 2021, from https://www.essaytown.com/subjects/paper/string-matching/7236MLA Format

"String Matching." 10 April 2011. Web. 16 October 2021. <https://www.essaytown.com/subjects/paper/string-matching/7236>.Chicago Style

"String Matching." Essaytown.com. April 10, 2011. Accessed October 16, 2021.https://www.essaytown.com/subjects/paper/string-matching/7236.