Enhanced OCR for Japenese

  • Status: Closed
  • Pris: £300
  • Bidrag mottagna: 1
  • Vinnare: WangJing0612

Tävlingssammandrag

I have a printed Japanese language intonation dictionary that I would like to convert into a machine readable file format.

One difficulty that prevent simple OCR is the special intonation 'guides' printed above the readings that indicate intonation. I've attached a scan of some pages from dictionary as examples.

Here are the first four entries from the dictionary, including the above text marks that indicate intonation.

 ─
アー (~言う, ~した) →61

─┐
アー (~は行かない, ~だこうだ) →76, 86a 【感】(~驚いた) →66

 ─────
アークトー arc 灯 →15

 ───┐ ─┐
アーケード,アーケード arcade →9

And one more (from page 1) to highlight the use of handakuten (カ゚キ゚ク゚ケ゚コ゚):

アイキョーケ゚ン (間狂言) →15

Each entry in the dictionary has the following pattern:

1- one or more readings (with intonation above) separated by commas
2- a space
3- one or more word data sectionss

Word data is:

a- the word
b- descriptive data about the word

I would need the above turned into a file with three columns,:

1- the reading with intonation
2- the intonation patter (H: High, L: Low)
3- the word(s) only (no descriptive data)

For the five entries above, the result would be:

アー {LH} (~言う, ~した)
アー {HL} (~は行かない, ~だこうだ)
アークトー {LHHHH} arc 灯
アーケード,アーケード {HHHLL,HLLLL} arcade
アイキョーケ゚ン {HHHHLLL} (間狂言)


I've tried to keep the above simple, but I'm sure I am missing some edge cases.

Please don't hesitate to ask if anything is not clear or you have questions.

I don't know much about OCR, so I am hoping you will be able to help me figure out the following:

1- Is what I am asking for possible?
2- How long will it take (or how much will it cost)?

I have two mains questions for this project:

Firstly, I understand the basics of OCR, but I don't see how OCR can get the intonation information from the scans. Can you describe how you can achieve this a bit more?

Secondly, I'm worried about extracting data for the third column.

From the samples you can see that there is a lot of extra information in the dictionary's third 'column'. For me it's garbage since I just want the word(s).

Will it be possible to extract the words from all that 'garbage', and if so how do you propose to do it?

Rekommenderade kompetenser

Arbetsgivares feedback

“He is very nice and super quick”

Profilbild anhanh1122, Vietnam.

Topp bidrag från den här tävlingen

Visa fler bidag

Klargörandetavla

Finns inga meddelanden än.

Hur du kommer igång med tävlingar

  • Lägg upp din tävling

    Lägg upp din tävling Snabbt och enkelt

  • Få massvis med bidrag

    Få massvis med bidrag Från världens alla hörn

  • Utse det bästa bidraget

    Utse det bästa bidraget Ladda ner filerna - enkelt!

Lägg upp en tävling nu eller gå med idag!