We aren’t born with fantastic survival instincts. Parents need to fuss over us until we learn. Billion-dollar companies cater to new parents with everything from childproofing tools to butt wipes. We do eventually learn to wipe our own butts, though, don’t we?
Think of search engines the same way. They needed to crawl before they could walk, and they needed to walk before they could run. In this metaphor, it’s fair to argue modern search engines are transitioning from flying to space travel.
Imagine your young self sitting in a high chair. A thick stream of drool runs down your face as your parents urge you to transition from aimless babble into your first word.
“Mama,” your mom mouths, coy smile on her face.
“Data.” Your father nudges her with an elbow, accompanied by a wink.
You sit there aghast. Confused eyes darting back and forth between your parents’ faces. You think you’ve figured out what a word is, but you’re pretty sure your dad hasn’t figured out how to make a joke funny.
It’s exactly what he said, though. The only blockers to your first word are:
1. Enough data to understand what a word means.
2. Enough experience babbling to know how to make the sound.
Your first word will surely be ‘mama’, to spite your father’s pun. It’ll continue a time-honored tradition of not acknowledging or encouraging dad jokes.
Unsure of yourself, your crusty nose wrinkles as you form the word.
Ma…ma.
Fast forward 15 years and you’ve collected much more data. The butt wiping came in time. You’ve written off many more dad jokes. The dishes are still a challenge, but you’ve figured out how to feed yourself using a microwave. Not only are you somewhat capable of self-sufficiency, you could even get a job and help other people.
What’s my point here? Search engines used to need a lot of help. They could bring in a lot of data. An entire internet worth of data. They just couldn’t understand it very well. They also couldn’t share it very coherently. Imagine yourself as an upgrade upon your parents. In all the available gene pool, they picked great people to procreate with. But you didn’t come out as an instant upgrade. It took decades of education and tuning to get you where you are. You at sixteen were barely able to deliver a pizza. Now you’re ready a book on SEO.
That’s no knock on pizza delivery. It’s an admission your younger self could’ve been on time more and had better people skills.
Search engines were a massive leap in technology. It took time to learn about humans before they knew how to communicate with them.
Searching Google in 1998, it could show you anything it found on the internet. On paper, that’s not very different from what Google does today. Less impressive in 1998? How well it understood what you actually wanted.
That’s where we get into how search engines work: how they collect data, understand it, and deliver it to a searcher.
NLP vs Boolean
Speaking of early career experiences. My first job in tech ended up being in search science. To be specific, a lesser-known aspect called “user intent”.
Remember that awkward teenager version of yourself? Capable of delivering a pizza, but not the best at taking an order via the phone? Well, these were the awkward teenage years of search engines.
Search engines could take your search and give you results, but there’s no guarantee it would be for what you wanted. You could type in “lifesavers” and get a mix of results between Baywatch-style life preservers and fruity candy. A search for “spirit” could get you half ghosts, have booze.
Booze would help you recover from a ghost sighting. LifeSavers would be a good snack while watching Baywatch. Yet these aren’t the results you’re looking for.
Natural Language Processing (NLP)
To fix this experience, search, among many other industries, have been working on natural language processing (NLP). The science of seeing words and understanding their context.
Take the sentence, “I didn’t say we should kill him.”
It’s dark, but follow me. This sentence means something different, depending on which word we emphasize:
I didn’t say we should kill him.
I didn’t say we should kill him.
I didn’t say we should kill him.
I didn’t say we should kill him.
I didn’t say we should kill him.
I didn’t say we should kill him.
I didn’t say we should kill him.
To basic software, there’s only one thing to understand here. You didn’t say we should kill him. But that ignores the fact that the context of how it’s said matters. In this case, an emphasis on “kill” is the difference between being an accomplice and being innocent. An emphasis on “him” is the difference between a mob hit on the right or wrong guy.
Imagine the ways Alexa or Siri could mess that one up.
Order
How about order? Let’s say you’re searching for mint-flavored chocolate. A guilty pleasure of mine. Typing “mint chocolate” would get you different results than “chocolate mint”. One is chocolate with a mint flavor and the other is a mint with chocolate flavor.
Homonyms
Then there’s the challenge of the homonym. A word with two meanings. The greatest curse of the english language.
Imagine telling a robot you’re armed. They could think, “technically, you have two of them.” How about the quandary of how a dog is growing bark? Or why baseball players use a nocturnal mammal to hit baseballs?
We have to rely on a machine to see the word “glasses” and think, “According to data so far, it’s 75% likely the user is looking for corrective eyewear.” They’d write off the lower likelihood of kitchen wears and even lower odds of searching for the 2019 M. Night Shyamalan film Glass.
The ability to see the word “glasses” and know the difference between “glasses” and the word “glass” differentiates the film from the product. That’s the easy part. Understanding most users are looking for eyewear over glassware only comes with time and data. Once there’s enough data, the search will deliver results like Warby Parker, instead of the kitchen department at Target.
I think of user intent in four aspects:
How a machine understands you when you mess up.
How happy you are with a machine when it messes up.
How wrong everyone else is when they search for similar information.
How malicious a website is when it writes content.
Search engines are working on all these scenarios. They want to understand your search, present you with better results, and filter out malicious content.
Malicious content
That’s right. Malicious content. We’ll have a whole chapter on that. Remember I mentioned ways the industry treats SEO like a gimmick? Keyword stuffing, link farms, invisible content, toxic backlinks. There were times these were industry staple “hacks”. Now they’re defunct and considered malicious. They’re also behaviors which will get your site blocked from Google. Don’t fall for the hack, because today’s hack is tomorrow’s folly.
Boolean search
How many times have you taken a moment on Google’s home page to think, “what should I type to find what I want?”
Imagine how many searches Google sees every day for things like, “guy with the abs who did that movie with Al Pacino.” In the age of NLP, search engines are pretty likely to figure out what you wanted. I just typed it in. “Al Pacino stars with Matthew McConaughey and Rene Russo in Two for the Money, a 2005 film directed by D.J. Caruso,” is the second result.
In the early days of search, these websites relied on something called “boolean” operators. They were a very manual way of explaining your search. Here’s an example:
“Will Smith” AND (Jeff Goldblum OR Bill Paxton) AND welcome to earth* NOT “After Earth”
First of all: annoying. Second, any ideas?
It’s 1996’s Independence Day, starring Will Smith and Jeff Goldblum. Thankfully, my “OR” compensated for me confusing the late, great Bill Paxton with the legendary Bill Pullman. The movie had a line like “welcome to earth” and was not the poorly-received Will Smith film After Earth.
You’d rather search for “Will Smith movie where Randy Quaid flies a fighter jet up a UFO’s butt crack,” wouldn’t you? We all would. Which is why companies like Google have spent insane amounts of money teaching their search engines to be smarter. Boolean search isn’t user friendly, and it also doesn’t compensate for a much larger problem. Users very rarely search for what they actually want.
Circling back to the idea of user intent. It’s an admission that any search could have many interpretations. In order for search engines to improve, they need massive amounts of correct examples to learn from. My first job in search was to oversee the collection of those examples. We’d collect huge rosters of ‘judges’ around the world, give them big lists of search results, and ask them which results were best. Sometimes the search engine did great. Sometimes (more often than not) we’d find places it needed to improve. It happens behind the scenes, but very real people are feeding knowledge into engines like Google every day. Also, search engines remember every search ever completed and any indications a user liked or disliked the results.
This is why quality content is so important to your SEO journey. First, to compensate for any shortcomings of search engines as they continue to grow. Second, because the best content will get more and more recognition as engines improve. And third, because user happiness matters and search engines are watching.
What are web crawlers?
Entering the world of the optimizers, you’re going to hear some less-than-pleasant words. Things like “crawlers” and “scraping”. I promise, these are not bad things. After deep-diving into why search engines need data, it only makes sense to then dig into how they get data.
One, which we’ve already talked about, is by paying for real humans to rate search results. That gives search engines great, highly-accurate, highly-specific data. We’ve also touched on when that’s appropriate: a specific search where they see less engagement from users.
There are limitations, though. The logistics of collecting this information aren’t easy. I can testify to that. They’re also expensive for the same reason. Source a real company to source real people to judge real data. Someone has to put together the data they want judged, someone has to format it into a tool for judgment, and someone has to rate it. At the end of the circle of life, someone has to feed it to a search engine’s algorithm.
That’s where we hit the second flaw of this data source: it’s not all that big. Yes, when tackled by one, one hundred, or even one thousand people, they can feel massive. But when compared to the entire scope and scale of the internet, these are mere grains of sand.
A geologist may find it helpful to bring a microscope to beach and take a peek at a couple sand grains. A tiny sample could get them a good idea of what percent is granular glass and shells, versus how much is silica quartz. Their study of a beach may say it’s 40 percent, silica, 40 percent shells, and 20 percent worn glass.
That’s all fine and dandy. But if you want to understand how much beach there is, you may want a tractor. That study would return a measurement like, “this beach has three acres of land.”
If you used a tool like a backhoe to dig past the surface layer, you could come up with a measurement of how much sand there is:
“This beach averages sand to a depth of a couple meters.”
These are all very valid ways to measure a beach. But only when combined can you tell someone how many cubic acres each of silica, shells, and glass the beach has.
Enter the crawler. A digital robot search engines like Google user to scrape the internet for data.
How do crawlers work?
To understand how crawlers work, let me take you back to 2002. The video game Halo was all the rage, and had certainly taken over the weekends at the Garves household. To avoid any brawls over XBOX time, my brother and I would coop the game’s campaigns together. For those reading this book who weren’t even born in 2002, Halo was a first-person shooter. For those who aren’t video game nerds, that means you run around and shoot aliens. For any aliens reading this, it was a game and we didn’t mean you any real harm.
My brother and I had a system whenever we’d enter a new level: Righty Rule or Lefty Law. That meant either we’d explore by taking all left turns until we ran out of lefts, or we take all right turns until we ran out of rights. While somewhat inefficient, the math checked out and we’d meander our way safely through tough maps.
Web crawlers work in a very similar way. But, instead of left or right turns, they follow links. A crawler will start with a piece of content they know, note all the words and other content on the page, and make a list of any links it shares. That’s the scrape: the logging of any important information from a webpage. Yes, it’ll log the links and the words used in the links, but it also does something super smart with them. Those links become the items the crawler will scrape next. The crawler will scrape those pages, find more links, scrape those pages, find more links, etc.
It’s the Righty Rule, but with links. I wasn’t lying. I guess we can call it the Linky Law? I’m oversimplifying the process a bit. A lot, actually. For example, crawlers may not scrape those links immediately. If they recently visited the link, for example, it would be inefficient to visit the page again.
Indexing
Pages the crawlers stumble across get stored in a process we call indexing. There are countless ways to describe this. Think of it as logging in a way which makes it easy to find again. A new book at a library gets a barcode and a librarian files it with similar content by topic using the Dewey Decimal System.
UPDATE: I’ve Googled it and apparently the Dewey Decimal System has been replaced and libraries now use the Dewey-free system? I should go to libraries more.
For those who remember phone books, the indexing system in the white pages sorted by last name. The yellow pages sorted by business category. All these are forms of indexing.
It’s also possible to ask a search engine to not index your content. Pages can have a “nofollow” attribute (think of it as a little line of code) which asks scrapers to skip you. Also, links can have a “nofollow” attribute, which asks the search engine to not follow the link you share on your page.
It’s not a flawless process, though. While you can ask for a search engine to ignore your pages and links, they’re still publicly-available. It’s then up to the search engine to choose to respect your noindex and nofollow attributes. Some search engines may scrape and store the data anyway. Some may even share it if they deem the content valuable.
Which is why we’re always saying to be careful about what you put on the internet, folks.
That should provide you with a good understanding of crawler and scraping basics. Now that Google has your data, let’s dig (yeah…) into how they decide if the content is good.