DSB's Quest to Build a Chatterbot

Started by DontSayBanana, January 12, 2012, 02:50:02 PM

Previous topic - Next topic

DontSayBanana

Posting this here in the "nerdy" subforum, partly to serve as a "build log" of sorts, and also hopefully to spark some discussion on the subject.

Day One: While this isn't really day one, as I've been working with some of these problems for quite a while, it just seems like a good place to (re)start.  Also, it's been suggested that I work in secrecy a little bit less, as I might be able to get further if I get some input here and there. ;)

At this point, I'm just starting to construct a lexer.  The lexer's the part that breaks the input stream into smaller tokens (words and punctuation, for my purposes), which the system can then analyze and work on.  My current rule set sets up two candidates to be recognized as "words" by the lexer.  I've also included Java-style regex as a shorthand and for reference.

- An unbroken stream of alphabetic characters between whitespace or terminated by punctuation ("\s[A-Za-z]+[\s(\p{Punct}\s)(\p{Punct}[A-Z])]")
- A stream of alphabetic characters broken only by a single apostrophe (contractions; "\s[A-Za-z]+'[A-Za-z]+[\s(\p{Punct}\s")(\p{Punct}[A-Z])]")

I originally considered using hyphenation as a candidate, but decided to consider it a grouping character along with parentheses and brackets.  It's not as if the components of hyphenated compound words would be thrown out in the first pass, and I also think I could handle hyphens algorithmically with a little more efficiency by bundling that in with garbage collection, which leads me to the second part of my design considerations for today: What parts of input strings should be thrown out?  What don't look like words, but might convey meaning in a close enough manner to still be handled?  At the moment, I'm thinking of certain things as "avatars" for words.

- ultra-compacted contractions like " 'n' ", as in "chicken 'n' rice.  Depending on how it's handled, this mechanism could assist in computer parsing of "txtspeak."
- Arabic numerals.  "40" in natural language is an avatar for "forty," as it would be used as an adjective within a sentence.

The issue I'm running into is that, ironically, so far, my program wouldn't be very computer-savvy, so I'll leave you guys with this bombshell: computer and Internet usernames- how would you deal with their notoriety for involving any character that can be typed?

This has been a pet project of mine for a while, so partly I'm looking for a sounding board, and partly we argue semantics enough here- it's about time it was used for something constructive. :P
Experience bij!

mongers

Quote from: DontSayBanana on January 12, 2012, 02:50:02 PM
Posting this here in the "nerdy" subforum, partly to serve as a "build log" of sorts, and also hopefully to spark some discussion on the subject.

This has been a pet project of mine for a while, so partly I'm looking for a sounding board, and partly we argue semantics enough here- it's about time it was used for something constructive. :P

Did you not work it out, that's the whole point of Languish; 90% of the posters are bots, hence the inflexibility of views, never admitting they're wrong, arguing in the most obtuse way.  The other 10% are gullible humans who form the subject matter of The real experiment.    :ph34r:
"We have it in our power to begin the world over again"

DontSayBanana

Experience bij!