Japanese Keyword Research & Discovery

Keyword Research

Choosing Japanese Keywords for SEO

How do you determine your keywords for the Japanese market?

You start the same way that you do with English websites. You think about your website and jot down words that you think your market would enter into Yahoo or Google when searching for sites such as yours. Then you enter those words into Yahoo! Japan and Google (although Google can deal with Japanese in the U.S. version, it’s better to start from http://google.co.jp/), and see what comes up. These websites may be your competition.

Ignoring websites that are more broadly focused than yours or otherwise less relevant, gather the URLs of your potential competitors into a spreadsheet or a list. At some point it may be worthwhile to enter them into a project in and SEO application like Advanced Web Ranking (for Windows or Macintosh), which will automate some of the process, and which works fine with Japanese keywords.

These competing sites are ranking high for your draft keywords, so presumably they’re doing something right. Examine the words and phrases used on the pages, especially within the HTML title tags, and in headings. Also view the source code in your browser and check out the meta keywords, which do not really affect search engine ranking anymore, but superstitious Japanese web developers often hide their keyword crown jewels there — just in case.

Add these words to your list, organize it, and refine it.

At this point with English keyword research you have a number of options, but with Japanese keywords, as far as we know, you’re pretty much stuck with the Google Keyword Tool, which you can access from your AdWords account, or, in a slightly simplified form, via a public page. A warning to those who are experienced with the Google Keyword Tool: It seems a bit less useful than it is in English, perhaps because Google is not quite as dominant in Japan as it is in the United States, so their keyword data may be a bit smaller.

Besides the Google Keyword tool the other major overseas keywords tools and services work to a certain extent, and if you do contextual advertising the Yahoo and Microsoft advertiser centers can help in keyword research.

Both Yahoo! Japan and Google Japan offer suggested search keywords when you start to type in a query word, and they both offer kanji and kana keyword phrases even before you convert your roman letters using the henkan key, so this is another source of popular keywords. Since it’s believed that these suggestions are ordered by popularity and are based on actual user keyword entries, they can be valuable, especially for rarer keywords (since they only show a few choices). And most keywords displayed begin with the word you are typing. Sometimes a keyword will be a compound with the typed portion in the middle or at the end, but that is rarer. So the suggestions are not useful for finding patterns like “[some query word] [fixed geographical word].”

An indigenous online Japanese keyword tool called Ferret+ promises to help you discover Japanese language keywords. We have not tried it. An online tool called Keyword Hunter went belly up in late 2009.

Entering your keywords here and playing around with the options, you can get suggestions from Google for other keywords, as well as estimates of how frequently the keywords are searched for, along with how much competition there is among advertisers for the keywords. Add what you learn here to your list and refine it.

Finally, you will want to organize and polish your keyword list. For instance, if you were doing keyword research for a ramen shop in Shibuya, you may have come up with “ラーメン 渋谷” (ramen shibuya) as a keyword. How about the reverse order of the words? Would people search for that? You need to decide, but the Google Keyword Tool, with the proper settings, will help. Will the result be different? Try it in Yahoo and Google and see. Would some people add 区 to 渋谷? You have to decide about whether to include those versions. The more formal name might be used more in searches for government offices, but more savvy searchers might also use it to filter out some noise in the results.

Later on, when you’ve gotten into your competitive back link research, you’ll probably start to discover that some of your competition has been doing link building, perhaps legitimately, perhaps using house spam-blogs. You want to pay attention to the anchor text of these back links, particularly the spam, because by using these keywords as anchor text your competitors are telling you what they think are the important keywords. It goes without saying that you should evaluate these keywords for addition to your list.

What do you do with these keywords? You can:

  • Cluster them by topic and use them to write web pages, as explained above.
  • Enter them into an application like Advanced Web Ranking to track your website pages’ rankings over time for each keyword. Programs like this are crawlers that grab results pages from Yahoo and Google and track the positions of your and your competion’s pages in the search results. There are also online web applications with similar capabilities.
  • Use them as anchor text in links that you solicit or obtain during link building efforts (see below).

What is a Japanese Word?

On the subject of words in Japanese, here’s a little bit of Information Retrieval 101. The way a simple internet search engine works is as follows:

  1. Automated software connects to a URL and downloads the text/HTML portion only of a web page. This part is the crawler, robot or spider.
  2. The retrieved text is divided into words based on spaces and punctuation.
  3. The words are further processed by stemming (changing them to standard base forms, like stemming to stem, and removing extremely common words like the that appear in the “stoplist.”)
  4. The position of each word on the page is noted (in the title, in a heading, at the beginning of the page or at the end) as well as information like how many times each word appears on the page.
  5. Based on all the words processed out of a page and their positional and other characteristics, a relevancy algorithm is applied, and a relevancy score is given to each unique word (on-page, internal ranking).
  6. Based on other, external off-page information known about the site (such as incoming links), a factor is applied to increase or reduced the on-page relevancy score.
  7. The software at this point checks the “index” (or “inverted file”), an alphabetized list of words in all pages crawled, to see if each word is listed yet; if not, it lists it, and it adds the URL of the newly crawled web page and the word’s relevancy score for its appearance on that page.
  8. When the word is searched for, the search engine simply finds the word in the index, sorts the URLs by relevancy score, and returns the list of URLs.

This is very simplified. For instance, it doesn’t deal with searches for multi-word phrases.

But you can see the problem with building a Japanese search engine: Sentences in Japanese cannot be broken up into words by looking for spaces, since all the words are run together. How do the developers of search software for Japanese and other languages that do not use the roman alphabet or other alphabets that use spaces get around this problem?

There are two ways:

  • Lexical analysis: Breaking up sentences into words by using gramatical analysis and word dictionaries.
  • N-gram search algorithms: Ignoring word boundaries and indexing by arbitrary sequences of n characters, where n is usually 2 or 3. Example: “japan” might be indexed three times as “jap,” “apa,” “pan.”

Many search engines in the mid- to late-1990s in Japan used n-gram searches. The problem with n-gram search results is that because the search engine is not aware of words, you occasionally get quirky results, such as this famous example that you would have encountered if you searched the internet in the winter months of the late 1990s in Japan: a search for スキー (ski or skiing) would also return results for ウィスキー (whisky), because the Japanese word for ski/skiing is contained within the word for whiskey. Here’s another that you would have encountered in the hot summer months: a search for cola would bring up any page containing the word chocolate.

So what the early search engine developers eventually shifted to was software that did the lexical analysis and broke the text up into words. A few commercial search packages had this, but they were not available under open source licenses. However, Kyoto University had a lexical analysis module that was in the public domain, and that’s what everyone used as a start. So no embarassing ski/whiskey gaffes anymore, right? Not quite.

Getting a computer to understand a natural language is very hard, and there will always be misinterpretations. But the biggest problem with lexical analysis is that it is dependent on a complete and up-to-date dictionary. If a word is not in the dictionary, the indexer will have problems.

Here’s an example: When Infoseek first hit Japan they kludged together a Japanese-capable indexer using the Kyoto University software. The dictionary was really old and creaky, as might be expected from a piece of software written by busy and low-paid academic researchers. The proof of concept was done, and the researchers didn’t see any benefit in wasting time in the boring task of keeping the dictionary up to date.

It turns out that one of the words that was not in the Infoseek dictionary was インターネット (internet). Wow, that was embarrassing! Right? Actually, it wasn’t so embarrassing, because the words インター (inter) and ネット (net) were in the dictionary, and the indexer used proximity analysis to rank pages higher where the words appeared close to one another. So the pages that were ranked highest were those that had the word internet in them. Below that were pages that had the words inter and net in them, separated by other words. After that were pages that had only inter or net, because Infoseek was throwing out Boolean OR results also, to pad their search results numbers.

But as is the case today, nobody ever looked beyond the first two or three pages of search results. So after an initial period during which the Infoseek search engine was wowing the internet media because it had eleventy-seven zillion results for internet, the Japanese search engine tracking site Search Desk broke the news that it was all fake (although probably not intentional on Infoseek Japan’s part), and Infoseek’s index was no larger than anyone else’s.

Today everyone uses lexical analysis, but they have staff maintaining the dictionaries, and they have software that discovers potential new words automatically. Had such software been available in the mid-1990s, Infoseek would have been alerted that the two words inter and net were appearing next to one another to an extent that was unnatural, and perhaps they should be examined to see if the new unit should be added to the dictionary.