Building content
Ya. I know. Boring. But, I’m laying the groundwork for something interesting. Well, I think it’s interesting.
In the last episode I talked about what content is, where to find it, and what it is made up of. In this edition I want to explore the way words come together to make our content.
There are a couple of terms that should be defined before I get much farther into this mess.
- Lemma. From wikipukia:
In a dictionary, the lemma “go” represents the inflected forms “go”, “goes”, “going”, “went”, and “gone”. The relationship between an inflected form and its lemma is usually denoted by an angle bracket, e.g. “went” < “go”………….
Lemmas are used often in corpus linguistics for determining word frequency. In such usage the specific definition of “lemma” is flexible depending on the task it is being used for.
- Corpus :
From Apple Developer.
A collection of one or more documents, typically related, and available to an information retrieval system. Plural: corpora.
From Macmillan English Dictionary.
a collection of written and/or spoken language stored on a computer and used for language research and writing dictionaries
Yep, pretty heavy stuff. There’s a point to all of this, so stay with me.
I made a reference to the Oxford site in the last post, and it’s time to revisit that page. The link will open a new window and you may want to leave it open as we study a few things on that page. AskOxford: Language Facts
In the fifth paragraph, the author cites some interesting facts. For instance, “Just ten different lemmas (the, be, to, of, and, a, in, that, have, and I) account for a remarkable 25% of all the one billion words used in the Oxford English Corpus. If you were to read through the corpus, one word in four would be an example of one of these ten lemmas. Similarly, the 100 most common lemmas account for 50% of the corpus, and the 1,000 most common lemmas account for 75%. But to account for 90% of the corpus you would need a vocabulary of 7,000 lemmas, and to get to 95% the figure would be around 50,000 lemmas.” The remaining 5% might only show up once in several million.
My takeaway on this is the statement that the 1k most common lemmas will cover 75% of the corpus. The author states that the Oxford corpus is in the range of one billion words!
If my ‘old’ math is working, that means that about one-thousand base words would cover 750 million words (in their corpus). Impressed the hell outa me.
Further, the author states that 25k lemmas will represent the set of most significant words in English: those which occur reasonably frequently and which account for all but a small part of everything we may encounter in speech or writing. It includes all the words that we actively use in general everyday life.
Now, go take a gander at the table the author has under the heading ‘What is the commonest word?’. That’s the 100 most common English words. Are you beginning to see some light in this deep dark cave I’ve built for you?
In the next table down he shows us the most used ‘content words’. And he shows them by nouns, verbs, and adjectives. Are there any gears spinning at your place?
What if we could build our own corpus? Our corpus would be a collection of text articles related to a website subject. Within our corpus would be the common words most used by others relative to the subject, and our chosen keywords. Each webmaster would have a different corpus, and, would very likely have several different corpora.
Most of you do have your own corpus on your hard drive. That is, a collection of articles (scraped or written, doesn’t matter) which you use to generate content on demand. You very likely have several corpora devoted to different niches. Some of you will leave the corpus on the web and retrieve it as/when you need it.
You will use your corpus depending on how you want your content to read. If you are only interested in having some content for spider food, you may leave your corpus intact. Perhaps injecting a keyword here and a link there, but basically intact.
If you are after more than keeping a spider busy, you may elect to massage the corpus a little. Here is where the fun begins. And it’s a good enough place to end this edition.
Stay tuned. Next time I’ll give you my most famous recipe. Frickasie corpora served on crepe Suzieanne with a vintage, (Nov.) tokay. Mmmmmm. Won’t want to miss that one. lol
~dink



No Comments »
No comments yet.
RSS feed for comments on this post. TrackBack URI
Leave a comment
If you want to leave a feedback to this post or to some other user´s comment, simply fill out the form below. Just in case you know some HTML, you may use the following tags to format your text:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>