Discussion:
[Wikitech-l] Book scans from Tuebingen Digital Library to Wikimedia Commons
Shiju Alex
2018-12-02 09:52:13 UTC
Permalink
Hello,

Recently Tuebingen University
<https://uni-tuebingen.de/en/university.html> (with
the support from German Research Foundation) ran a project titled *Gundert
Legacy project* to digitize close to 137,000 pages from *850 public domain
books*.

All these public domain books are in the South Indian languages *Malayalam,
Kannada, Tamil, Tulu, and Telugu*. In this 293 books are in Malayalam, 187
in Kannada, 25 in Tamil, 4 in Telugu and Tulu.

Also there was a separate sub-project which was run as part of this
project to convert 136 titles in Malayalam to Malayalam Unicode. The number
of pages that were converted to Unicode is close to *25,700* pages.The
Unicode conversion project was ran only for Malayalam. For the other
languages it is just the scanning of books

The project is complete now and the results of the project is available in
the Hermman Gundert Portal https://www.gundert-portal.de/?language=en which
was released on Nov 20. A news report is available here.
<https://timesofindia.indiatimes.com/city/kochi/german-scholars-malayalam-mission-all-set-to-get-a-digital-makeover/articleshow/66633108.cms>

To view the books in each language you can navigate through the various
links in the portal. For example, malayalam books are available here:
https://www.gundert-portal.de/?page=malayalam

Now we need to upload these scans to Wikimedia Commons and Unicode text to
Malayalam Wikisource (25,700 Unicode converted pages)

The first priority is for the scans that are converted to Unicode. Is it
possible to write a script to migrate the scans from Tuebingen Digital
library to Wikimedia Commons? (I can share the exact details of books
converted to Unicode if needed)

All the digitized files are heavy and the size ranges from 100 MB to 1.5 GB
depending on the number of pages in the books. So manually managing this is
going to be a big challenge.

Can some one help with this?

Shiju Alex
Andre Klapper
2018-12-02 10:37:03 UTC
Permalink
Hi,
Post by Shiju Alex
Recently Tuebingen University
<https://uni-tuebingen.de/en/university.html> (with
the support from German Research Foundation) ran a project titled *Gundert
Legacy project* to digitize close to 137,000 pages from *850 public domain
books*.
All these public domain books are in the South Indian languages *Malayalam,
Kannada, Tamil, Tulu, and Telugu*. In this 293 books are in Malayalam, 187
in Kannada, 25 in Tamil, 4 in Telugu and Tulu.
Also there was a separate sub-project which was run as part of this
project to convert 136 titles in Malayalam to Malayalam Unicode. The number
of pages that were converted to Unicode is close to *25,700* pages .The
Unicode conversion project was ran only for Malayalam. For the other
languages it is just the scanning of books
What does "converted to Unicode" mean? Converted from what exactly? Do
you maybe mean "converted via OCR (Optical character recognition) from
images in file formats (JPG, PNG, images in a PDF) which don't allow
marking text to a file format which allows marking text in those files?
Post by Shiju Alex
The project is complete now and the results of the project is available in
the Hermman Gundert Portal https://www.gundert-portal.de/?language=en which
was released on Nov 20. A news report is available here.
<https://timesofindia.indiatimes.com/city/kochi/german-scholars-malayalam-mission-all-set-to-get-a-digital-makeover/articleshow/66633108.cms>
To view the books in each language you can navigate through the various
https://www.gundert-portal.de/?page=malayalam
Now we need to upload these scans to Wikimedia Commons and Unicode text to
Malayalam Wikisource (25,700 Unicode converted pages)
The first priority is for the scans that are converted to Unicode. Is it
possible to write a script to migrate the scans from Tuebingen Digital
library to Wikimedia Commons? (I can share the exact details of books
converted to Unicode if needed)
What would you want the script to do exactly? Pull the files from the
Tuebingen Digital Library and then mass-upload these files to Commons?
OCR (identify letters in pure images and converting those letters to
text which could be marked and copied)? Something else?

To convert image files available on Wikimedia Commons to recognized
text, see https://tools.wmflabs.org/ws-google-ocr/ for example. There
is also https://phabricator.wikimedia.org/T120788 for more info/tools.
Post by Shiju Alex
All the digitized files are heavy and the size ranges from 100 MB to 1.5 GB
depending on the number of pages in the books. So manually managing this is
going to be a big challenge.
Can some one help with this?
Cheers,
andre
--
Andre Klapper | Bugwrangler / Developer Advocate
https://blogs.gnome.org/aklapper/
Shiju Alex
2018-12-02 10:54:23 UTC
Permalink
Hi

Here are the answers

What does "converted to Unicode" mean? Converted from what exactly? Do
Post by Andre Klapper
you maybe mean "converted via OCR (Optical character recognition) from
images in file formats (JPG, PNG, images in a PDF) which don't allow
marking text to a file format which allows marking text in those files?
There is no good OCR for languages like Malayalam. So each scanned image is
manually typed and proofread For example, See the 7th page of this book
<http://idb.ub.uni-tuebingen.de/opendigi/CiXIV125b#p=7&tab=transcript>. You
can see the scan image on the right and the transcribed text for that page
on the left in the *Transcript *tab. This is done for 136 books, and total
pages on these books are close to 25,700 pages.

What would you want the script to do exactly? Pull the files from the
Post by Andre Klapper
Tuebingen Digital Library and then mass-upload these files to Commons?
Yes, this is what is required. Unicode migration we will handle separately.


Shiju Alex
Post by Andre Klapper
Hi,
Post by Shiju Alex
Recently Tuebingen University
<https://uni-tuebingen.de/en/university.html> (with
the support from German Research Foundation) ran a project titled
*Gundert
Post by Shiju Alex
Legacy project* to digitize close to 137,000 pages from *850 public
domain
Post by Shiju Alex
books*.
All these public domain books are in the South Indian languages
*Malayalam,
Post by Shiju Alex
Kannada, Tamil, Tulu, and Telugu*. In this 293 books are in Malayalam,
187
Post by Shiju Alex
in Kannada, 25 in Tamil, 4 in Telugu and Tulu.
Also there was a separate sub-project which was run as part of this
project to convert 136 titles in Malayalam to Malayalam Unicode. The
number
Post by Shiju Alex
of pages that were converted to Unicode is close to *25,700* pages .The
Unicode conversion project was ran only for Malayalam. For the other
languages it is just the scanning of books
What does "converted to Unicode" mean? Converted from what exactly? Do
you maybe mean "converted via OCR (Optical character recognition) from
images in file formats (JPG, PNG, images in a PDF) which don't allow
marking text to a file format which allows marking text in those files?
Post by Shiju Alex
The project is complete now and the results of the project is available
in
Post by Shiju Alex
the Hermman Gundert Portal https://www.gundert-portal.de/?language=en
which
Post by Shiju Alex
was released on Nov 20. A news report is available here.
<
https://timesofindia.indiatimes.com/city/kochi/german-scholars-malayalam-mission-all-set-to-get-a-digital-makeover/articleshow/66633108.cms
Post by Shiju Alex
To view the books in each language you can navigate through the various
https://www.gundert-portal.de/?page=malayalam
Now we need to upload these scans to Wikimedia Commons and Unicode text
to
Post by Shiju Alex
Malayalam Wikisource (25,700 Unicode converted pages)
The first priority is for the scans that are converted to Unicode. Is it
possible to write a script to migrate the scans from Tuebingen Digital
library to Wikimedia Commons? (I can share the exact details of books
converted to Unicode if needed)
What would you want the script to do exactly? Pull the files from the
Tuebingen Digital Library and then mass-upload these files to Commons?
OCR (identify letters in pure images and converting those letters to
text which could be marked and copied)? Something else?
To convert image files available on Wikimedia Commons to recognized
text, see https://tools.wmflabs.org/ws-google-ocr/ for example. There
is also https://phabricator.wikimedia.org/T120788 for more info/tools.
Post by Shiju Alex
All the digitized files are heavy and the size ranges from 100 MB to 1.5
GB
Post by Shiju Alex
depending on the number of pages in the books. So manually managing this
is
Post by Shiju Alex
going to be a big challenge.
Can some one help with this?
Cheers,
andre
--
Andre Klapper | Bugwrangler / Developer Advocate
https://blogs.gnome.org/aklapper/
_______________________________________________
Wikitech-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Ryan Kaldari
2018-12-03 04:31:11 UTC
Permalink
Post by Shiju Alex
There is no good OCR for languages like Malayalam.
Google Drive will do OCR on Malayalam, Kannada, and Telugu. Google Vision
API (which is usable from a Wikisource gadget
<https://wikisource.org/wiki/Wikisource:Google_OCR> built by Sam Wilson)
will do OCR on Tamil. I can't vouch for these being "good", but they do
exist.
Post by Shiju Alex
Hi
Here are the answers
What does "converted to Unicode" mean? Converted from what exactly? Do
Post by Andre Klapper
you maybe mean "converted via OCR (Optical character recognition) from
images in file formats (JPG, PNG, images in a PDF) which don't allow
marking text to a file format which allows marking text in those files?
There is no good OCR for languages like Malayalam. So each scanned image is
manually typed and proofread For example, See the 7th page of this book
<http://idb.ub.uni-tuebingen.de/opendigi/CiXIV125b#p=7&tab=transcript>. You
can see the scan image on the right and the transcribed text for that page
on the left in the *Transcript *tab. This is done for 136 books, and total
pages on these books are close to 25,700 pages.
What would you want the script to do exactly? Pull the files from the
Post by Andre Klapper
Tuebingen Digital Library and then mass-upload these files to Commons?
Yes, this is what is required. Unicode migration we will handle separately.
Shiju Alex
Post by Andre Klapper
Hi,
Post by Shiju Alex
Recently Tuebingen University
<https://uni-tuebingen.de/en/university.html> (with
the support from German Research Foundation) ran a project titled
*Gundert
Post by Shiju Alex
Legacy project* to digitize close to 137,000 pages from *850 public
domain
Post by Shiju Alex
books*.
All these public domain books are in the South Indian languages
*Malayalam,
Post by Shiju Alex
Kannada, Tamil, Tulu, and Telugu*. In this 293 books are in Malayalam,
187
Post by Shiju Alex
in Kannada, 25 in Tamil, 4 in Telugu and Tulu.
Also there was a separate sub-project which was run as part of this
project to convert 136 titles in Malayalam to Malayalam Unicode. The
number
Post by Shiju Alex
of pages that were converted to Unicode is close to *25,700* pages .The
Unicode conversion project was ran only for Malayalam. For the other
languages it is just the scanning of books
What does "converted to Unicode" mean? Converted from what exactly? Do
you maybe mean "converted via OCR (Optical character recognition) from
images in file formats (JPG, PNG, images in a PDF) which don't allow
marking text to a file format which allows marking text in those files?
Post by Shiju Alex
The project is complete now and the results of the project is available
in
Post by Shiju Alex
the Hermman Gundert Portal https://www.gundert-portal.de/?language=en
which
Post by Shiju Alex
was released on Nov 20. A news report is available here.
<
https://timesofindia.indiatimes.com/city/kochi/german-scholars-malayalam-mission-all-set-to-get-a-digital-makeover/articleshow/66633108.cms
Post by Andre Klapper
Post by Shiju Alex
To view the books in each language you can navigate through the various
https://www.gundert-portal.de/?page=malayalam
Now we need to upload these scans to Wikimedia Commons and Unicode text
to
Post by Shiju Alex
Malayalam Wikisource (25,700 Unicode converted pages)
The first priority is for the scans that are converted to Unicode. Is
it
Post by Andre Klapper
Post by Shiju Alex
possible to write a script to migrate the scans from Tuebingen Digital
library to Wikimedia Commons? (I can share the exact details of books
converted to Unicode if needed)
What would you want the script to do exactly? Pull the files from the
Tuebingen Digital Library and then mass-upload these files to Commons?
OCR (identify letters in pure images and converting those letters to
text which could be marked and copied)? Something else?
To convert image files available on Wikimedia Commons to recognized
text, see https://tools.wmflabs.org/ws-google-ocr/ for example. There
is also https://phabricator.wikimedia.org/T120788 for more info/tools.
Post by Shiju Alex
All the digitized files are heavy and the size ranges from 100 MB to
1.5
Post by Andre Klapper
GB
Post by Shiju Alex
depending on the number of pages in the books. So manually managing
this
Post by Andre Klapper
is
Post by Shiju Alex
going to be a big challenge.
Can some one help with this?
Cheers,
andre
--
Andre Klapper | Bugwrangler / Developer Advocate
https://blogs.gnome.org/aklapper/
_______________________________________________
Wikitech-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Shiju Alex
2018-12-03 05:21:21 UTC
Permalink
Post by Ryan Kaldari
Google Drive will do OCR on Malayalam, Kannada, and Telugu. Google Vision
API (which is usable from a Wikisource gadget
<https://wikisource.org/wiki/Wikisource:Google_OCR> built by Sam Wilson)
will do OCR on Tamil. I can't vouch for these being "good", but they do
exist.
The request in this post is not for creating an OCR for any language
script; but to migrate certain Public Domain book scans from Tuebingen
digital library to Wikimedia Commons.

Also there is another task of migrating *already proofread Unicode text* to
Wikisource. But to take up the Unicode migration first the scans need to be
in Commons.

I am making this request only because of the huge amount of pages that we
need to handle. If it was just few hundreds of pages volunteers would have
manually done it.


Shiju
Post by Ryan Kaldari
Post by Shiju Alex
There is no good OCR for languages like Malayalam.
Google Drive will do OCR on Malayalam, Kannada, and Telugu. Google Vision
API (which is usable from a Wikisource gadget
<https://wikisource.org/wiki/Wikisource:Google_OCR> built by Sam Wilson)
will do OCR on Tamil. I can't vouch for these being "good", but they do
exist.
Post by Shiju Alex
Hi
Here are the answers
What does "converted to Unicode" mean? Converted from what exactly? Do
Post by Andre Klapper
you maybe mean "converted via OCR (Optical character recognition) from
images in file formats (JPG, PNG, images in a PDF) which don't allow
marking text to a file format which allows marking text in those files?
There is no good OCR for languages like Malayalam. So each scanned image
is
Post by Shiju Alex
manually typed and proofread For example, See the 7th page of this book
<http://idb.ub.uni-tuebingen.de/opendigi/CiXIV125b#p=7&tab=transcript>. You
can see the scan image on the right and the transcribed text for that
page
Post by Shiju Alex
on the left in the *Transcript *tab. This is done for 136 books, and
total
Post by Shiju Alex
pages on these books are close to 25,700 pages.
What would you want the script to do exactly? Pull the files from the
Post by Andre Klapper
Tuebingen Digital Library and then mass-upload these files to Commons?
Yes, this is what is required. Unicode migration we will handle
separately.
Post by Shiju Alex
Shiju Alex
Post by Andre Klapper
Hi,
Post by Shiju Alex
Recently Tuebingen University
<https://uni-tuebingen.de/en/university.html> (with
the support from German Research Foundation) ran a project titled
*Gundert
Post by Shiju Alex
Legacy project* to digitize close to 137,000 pages from *850 public
domain
Post by Shiju Alex
books*.
All these public domain books are in the South Indian languages
*Malayalam,
Post by Shiju Alex
Kannada, Tamil, Tulu, and Telugu*. In this 293 books are in
Malayalam,
Post by Shiju Alex
Post by Andre Klapper
187
Post by Shiju Alex
in Kannada, 25 in Tamil, 4 in Telugu and Tulu.
Also there was a separate sub-project which was run as part of this
project to convert 136 titles in Malayalam to Malayalam Unicode. The
number
Post by Shiju Alex
of pages that were converted to Unicode is close to *25,700* pages
.The
Post by Shiju Alex
Post by Andre Klapper
Post by Shiju Alex
Unicode conversion project was ran only for Malayalam. For the other
languages it is just the scanning of books
What does "converted to Unicode" mean? Converted from what exactly? Do
you maybe mean "converted via OCR (Optical character recognition) from
images in file formats (JPG, PNG, images in a PDF) which don't allow
marking text to a file format which allows marking text in those files?
Post by Shiju Alex
The project is complete now and the results of the project is
available
Post by Shiju Alex
Post by Andre Klapper
in
Post by Shiju Alex
the Hermman Gundert Portal
https://www.gundert-portal.de/?language=en
Post by Shiju Alex
Post by Andre Klapper
which
Post by Shiju Alex
was released on Nov 20. A news report is available here.
<
https://timesofindia.indiatimes.com/city/kochi/german-scholars-malayalam-mission-all-set-to-get-a-digital-makeover/articleshow/66633108.cms
Post by Shiju Alex
Post by Andre Klapper
Post by Shiju Alex
To view the books in each language you can navigate through the
various
Post by Shiju Alex
Post by Andre Klapper
Post by Shiju Alex
https://www.gundert-portal.de/?page=malayalam
Now we need to upload these scans to Wikimedia Commons and Unicode
text
Post by Shiju Alex
Post by Andre Klapper
to
Post by Shiju Alex
Malayalam Wikisource (25,700 Unicode converted pages)
The first priority is for the scans that are converted to Unicode. Is
it
Post by Andre Klapper
Post by Shiju Alex
possible to write a script to migrate the scans from Tuebingen
Digital
Post by Shiju Alex
Post by Andre Klapper
Post by Shiju Alex
library to Wikimedia Commons? (I can share the exact details of books
converted to Unicode if needed)
What would you want the script to do exactly? Pull the files from the
Tuebingen Digital Library and then mass-upload these files to Commons?
OCR (identify letters in pure images and converting those letters to
text which could be marked and copied)? Something else?
To convert image files available on Wikimedia Commons to recognized
text, see https://tools.wmflabs.org/ws-google-ocr/ for example. There
is also https://phabricator.wikimedia.org/T120788 for more info/tools.
Post by Shiju Alex
All the digitized files are heavy and the size ranges from 100 MB to
1.5
Post by Andre Klapper
GB
Post by Shiju Alex
depending on the number of pages in the books. So manually managing
this
Post by Andre Klapper
is
Post by Shiju Alex
going to be a big challenge.
Can some one help with this?
Cheers,
andre
--
Andre Klapper | Bugwrangler / Developer Advocate
https://blogs.gnome.org/aklapper/
_______________________________________________
Wikitech-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
bawolff
2018-12-03 08:25:22 UTC
Permalink
Have you seen https://commons.wikimedia.org/wiki/Commons:Batch_uploading
? I think the folks at commons are more likely to be able to give you
the help you need than wikitech-l would be.

--
Brian
Post by Shiju Alex
Post by Ryan Kaldari
Google Drive will do OCR on Malayalam, Kannada, and Telugu. Google Vision
API (which is usable from a Wikisource gadget
<https://wikisource.org/wiki/Wikisource:Google_OCR> built by Sam Wilson)
will do OCR on Tamil. I can't vouch for these being "good", but they do
exist.
The request in this post is not for creating an OCR for any language
script; but to migrate certain Public Domain book scans from Tuebingen
digital library to Wikimedia Commons.
Also there is another task of migrating *already proofread Unicode text* to
Wikisource. But to take up the Unicode migration first the scans need to be
in Commons.
I am making this request only because of the huge amount of pages that we
need to handle. If it was just few hundreds of pages volunteers would have
manually done it.
Shiju
Post by Ryan Kaldari
Post by Shiju Alex
There is no good OCR for languages like Malayalam.
Google Drive will do OCR on Malayalam, Kannada, and Telugu. Google Vision
API (which is usable from a Wikisource gadget
<https://wikisource.org/wiki/Wikisource:Google_OCR> built by Sam Wilson)
will do OCR on Tamil. I can't vouch for these being "good", but they do
exist.
Post by Shiju Alex
Hi
Here are the answers
What does "converted to Unicode" mean? Converted from what exactly? Do
Post by Andre Klapper
you maybe mean "converted via OCR (Optical character recognition) from
images in file formats (JPG, PNG, images in a PDF) which don't allow
marking text to a file format which allows marking text in those files?
There is no good OCR for languages like Malayalam. So each scanned image
is
Post by Shiju Alex
manually typed and proofread For example, See the 7th page of this book
<http://idb.ub.uni-tuebingen.de/opendigi/CiXIV125b#p=7&tab=transcript>. You
can see the scan image on the right and the transcribed text for that
page
Post by Shiju Alex
on the left in the *Transcript *tab. This is done for 136 books, and
total
Post by Shiju Alex
pages on these books are close to 25,700 pages.
What would you want the script to do exactly? Pull the files from the
Post by Andre Klapper
Tuebingen Digital Library and then mass-upload these files to Commons?
Yes, this is what is required. Unicode migration we will handle
separately.
Post by Shiju Alex
Shiju Alex
Post by Andre Klapper
Hi,
Post by Shiju Alex
Recently Tuebingen University
<https://uni-tuebingen.de/en/university.html> (with
the support from German Research Foundation) ran a project titled
*Gundert
Post by Shiju Alex
Legacy project* to digitize close to 137,000 pages from *850 public
domain
Post by Shiju Alex
books*.
All these public domain books are in the South Indian languages
*Malayalam,
Post by Shiju Alex
Kannada, Tamil, Tulu, and Telugu*. In this 293 books are in
Malayalam,
Post by Shiju Alex
Post by Andre Klapper
187
Post by Shiju Alex
in Kannada, 25 in Tamil, 4 in Telugu and Tulu.
Also there was a separate sub-project which was run as part of this
project to convert 136 titles in Malayalam to Malayalam Unicode. The
number
Post by Shiju Alex
of pages that were converted to Unicode is close to *25,700* pages
.The
Post by Shiju Alex
Post by Andre Klapper
Post by Shiju Alex
Unicode conversion project was ran only for Malayalam. For the other
languages it is just the scanning of books
What does "converted to Unicode" mean? Converted from what exactly? Do
you maybe mean "converted via OCR (Optical character recognition) from
images in file formats (JPG, PNG, images in a PDF) which don't allow
marking text to a file format which allows marking text in those files?
Post by Shiju Alex
The project is complete now and the results of the project is
available
Post by Shiju Alex
Post by Andre Klapper
in
Post by Shiju Alex
the Hermman Gundert Portal
https://www.gundert-portal.de/?language=en
Post by Shiju Alex
Post by Andre Klapper
which
Post by Shiju Alex
was released on Nov 20. A news report is available here.
<
https://timesofindia.indiatimes.com/city/kochi/german-scholars-malayalam-mission-all-set-to-get-a-digital-makeover/articleshow/66633108.cms
Post by Shiju Alex
Post by Andre Klapper
Post by Shiju Alex
To view the books in each language you can navigate through the
various
Post by Shiju Alex
Post by Andre Klapper
Post by Shiju Alex
https://www.gundert-portal.de/?page=malayalam
Now we need to upload these scans to Wikimedia Commons and Unicode
text
Post by Shiju Alex
Post by Andre Klapper
to
Post by Shiju Alex
Malayalam Wikisource (25,700 Unicode converted pages)
The first priority is for the scans that are converted to Unicode. Is
it
Post by Andre Klapper
Post by Shiju Alex
possible to write a script to migrate the scans from Tuebingen
Digital
Post by Shiju Alex
Post by Andre Klapper
Post by Shiju Alex
library to Wikimedia Commons? (I can share the exact details of books
converted to Unicode if needed)
What would you want the script to do exactly? Pull the files from the
Tuebingen Digital Library and then mass-upload these files to Commons?
OCR (identify letters in pure images and converting those letters to
text which could be marked and copied)? Something else?
To convert image files available on Wikimedia Commons to recognized
text, see https://tools.wmflabs.org/ws-google-ocr/ for example. There
is also https://phabricator.wikimedia.org/T120788 for more info/tools.
Post by Shiju Alex
All the digitized files are heavy and the size ranges from 100 MB to
1.5
Post by Andre Klapper
GB
Post by Shiju Alex
depending on the number of pages in the books. So manually managing
this
Post by Andre Klapper
is
Post by Shiju Alex
going to be a big challenge.
Can some one help with this?
Cheers,
andre
--
Andre Klapper | Bugwrangler / Developer Advocate
https://blogs.gnome.org/aklapper/
_______________________________________________
Wikitech-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Shiju Alex
2018-12-03 16:06:09 UTC
Permalink
Post by bawolff
Have you seen https://commons.wikimedia.org/wiki/Commons:Batch_uploading
the help you need than wikitech-l would be.
? I think the folks at commons are more likely to be able to give you


Thank you. I was not aware about this option. Let me try this.

Shiju Alex
Post by bawolff
Have you seen https://commons.wikimedia.org/wiki/Commons:Batch_uploading
? I think the folks at commons are more likely to be able to give you
the help you need than wikitech-l would be.
--
Brian
Post by Shiju Alex
Post by Ryan Kaldari
Google Drive will do OCR on Malayalam, Kannada, and Telugu. Google
Vision
Post by Shiju Alex
Post by Ryan Kaldari
API (which is usable from a Wikisource gadget
<https://wikisource.org/wiki/Wikisource:Google_OCR> built by Sam
Wilson)
Post by Shiju Alex
Post by Ryan Kaldari
will do OCR on Tamil. I can't vouch for these being "good", but they do
exist.
The request in this post is not for creating an OCR for any language
script; but to migrate certain Public Domain book scans from Tuebingen
digital library to Wikimedia Commons.
Also there is another task of migrating *already proofread Unicode text*
to
Post by Shiju Alex
Wikisource. But to take up the Unicode migration first the scans need to
be
Post by Shiju Alex
in Commons.
I am making this request only because of the huge amount of pages that we
need to handle. If it was just few hundreds of pages volunteers would
have
Post by Shiju Alex
manually done it.
Shiju
Post by Ryan Kaldari
Post by Shiju Alex
There is no good OCR for languages like Malayalam.
Google Drive will do OCR on Malayalam, Kannada, and Telugu. Google
Vision
Post by Shiju Alex
Post by Ryan Kaldari
API (which is usable from a Wikisource gadget
<https://wikisource.org/wiki/Wikisource:Google_OCR> built by Sam
Wilson)
Post by Shiju Alex
Post by Ryan Kaldari
will do OCR on Tamil. I can't vouch for these being "good", but they do
exist.
Post by Shiju Alex
Hi
Here are the answers
What does "converted to Unicode" mean? Converted from what exactly?
Do
Post by Shiju Alex
Post by Ryan Kaldari
Post by Shiju Alex
Post by Andre Klapper
you maybe mean "converted via OCR (Optical character recognition)
from
Post by Shiju Alex
Post by Ryan Kaldari
Post by Shiju Alex
Post by Andre Klapper
images in file formats (JPG, PNG, images in a PDF) which don't
allow
Post by Shiju Alex
Post by Ryan Kaldari
Post by Shiju Alex
Post by Andre Klapper
marking text to a file format which allows marking text in those
files?
Post by Shiju Alex
Post by Ryan Kaldari
Post by Shiju Alex
There is no good OCR for languages like Malayalam. So each scanned
image
Post by Shiju Alex
Post by Ryan Kaldari
is
Post by Shiju Alex
manually typed and proofread For example, See the 7th page of this
book
Post by Shiju Alex
Post by Ryan Kaldari
Post by Shiju Alex
<
http://idb.ub.uni-tuebingen.de/opendigi/CiXIV125b#p=7&tab=transcript>.
Post by Shiju Alex
Post by Ryan Kaldari
Post by Shiju Alex
You
can see the scan image on the right and the transcribed text for that
page
Post by Shiju Alex
on the left in the *Transcript *tab. This is done for 136 books, and
total
Post by Shiju Alex
pages on these books are close to 25,700 pages.
What would you want the script to do exactly? Pull the files from the
Post by Andre Klapper
Tuebingen Digital Library and then mass-upload these files to
Commons?
Post by Shiju Alex
Post by Ryan Kaldari
Post by Shiju Alex
Yes, this is what is required. Unicode migration we will handle
separately.
Post by Shiju Alex
Shiju Alex
Post by Andre Klapper
Hi,
Post by Shiju Alex
Recently Tuebingen University
<https://uni-tuebingen.de/en/university.html> (with
the support from German Research Foundation) ran a project titled
*Gundert
Post by Shiju Alex
Legacy project* to digitize close to 137,000 pages from *850
public
Post by Shiju Alex
Post by Ryan Kaldari
Post by Shiju Alex
Post by Andre Klapper
domain
Post by Shiju Alex
books*.
All these public domain books are in the South Indian languages
*Malayalam,
Post by Shiju Alex
Kannada, Tamil, Tulu, and Telugu*. In this 293 books are in
Malayalam,
Post by Shiju Alex
Post by Andre Klapper
187
Post by Shiju Alex
in Kannada, 25 in Tamil, 4 in Telugu and Tulu.
Also there was a separate sub-project which was run as part of
this
Post by Shiju Alex
Post by Ryan Kaldari
Post by Shiju Alex
Post by Andre Klapper
Post by Shiju Alex
project to convert 136 titles in Malayalam to Malayalam Unicode.
The
Post by Shiju Alex
Post by Ryan Kaldari
Post by Shiju Alex
Post by Andre Klapper
number
Post by Shiju Alex
of pages that were converted to Unicode is close to *25,700*
pages
Post by Shiju Alex
Post by Ryan Kaldari
.The
Post by Shiju Alex
Post by Andre Klapper
Post by Shiju Alex
Unicode conversion project was ran only for Malayalam. For the
other
Post by Shiju Alex
Post by Ryan Kaldari
Post by Shiju Alex
Post by Andre Klapper
Post by Shiju Alex
languages it is just the scanning of books
What does "converted to Unicode" mean? Converted from what
exactly? Do
Post by Shiju Alex
Post by Ryan Kaldari
Post by Shiju Alex
Post by Andre Klapper
you maybe mean "converted via OCR (Optical character recognition)
from
Post by Shiju Alex
Post by Ryan Kaldari
Post by Shiju Alex
Post by Andre Klapper
images in file formats (JPG, PNG, images in a PDF) which don't
allow
Post by Shiju Alex
Post by Ryan Kaldari
Post by Shiju Alex
Post by Andre Klapper
marking text to a file format which allows marking text in those
files?
Post by Shiju Alex
Post by Ryan Kaldari
Post by Shiju Alex
Post by Andre Klapper
Post by Shiju Alex
The project is complete now and the results of the project is
available
Post by Shiju Alex
Post by Andre Klapper
in
Post by Shiju Alex
the Hermman Gundert Portal
https://www.gundert-portal.de/?language=en
Post by Shiju Alex
Post by Andre Klapper
which
Post by Shiju Alex
was released on Nov 20. A news report is available here.
<
https://timesofindia.indiatimes.com/city/kochi/german-scholars-malayalam-mission-all-set-to-get-a-digital-makeover/articleshow/66633108.cms
Post by Shiju Alex
Post by Ryan Kaldari
Post by Shiju Alex
Post by Andre Klapper
Post by Shiju Alex
To view the books in each language you can navigate through the
various
Post by Shiju Alex
Post by Andre Klapper
Post by Shiju Alex
links in the portal. For example, malayalam books are available
https://www.gundert-portal.de/?page=malayalam
Now we need to upload these scans to Wikimedia Commons and
Unicode
Post by Shiju Alex
Post by Ryan Kaldari
text
Post by Shiju Alex
Post by Andre Klapper
to
Post by Shiju Alex
Malayalam Wikisource (25,700 Unicode converted pages)
The first priority is for the scans that are converted to
Unicode. Is
Post by Shiju Alex
Post by Ryan Kaldari
Post by Shiju Alex
it
Post by Andre Klapper
Post by Shiju Alex
possible to write a script to migrate the scans from Tuebingen
Digital
Post by Shiju Alex
Post by Andre Klapper
Post by Shiju Alex
library to Wikimedia Commons? (I can share the exact details of
books
Post by Shiju Alex
Post by Ryan Kaldari
Post by Shiju Alex
Post by Andre Klapper
Post by Shiju Alex
converted to Unicode if needed)
What would you want the script to do exactly? Pull the files from
the
Post by Shiju Alex
Post by Ryan Kaldari
Post by Shiju Alex
Post by Andre Klapper
Tuebingen Digital Library and then mass-upload these files to
Commons?
Post by Shiju Alex
Post by Ryan Kaldari
Post by Shiju Alex
Post by Andre Klapper
OCR (identify letters in pure images and converting those letters
to
Post by Shiju Alex
Post by Ryan Kaldari
Post by Shiju Alex
Post by Andre Klapper
text which could be marked and copied)? Something else?
To convert image files available on Wikimedia Commons to recognized
text, see https://tools.wmflabs.org/ws-google-ocr/ for example.
There
Post by Shiju Alex
Post by Ryan Kaldari
Post by Shiju Alex
Post by Andre Klapper
is also https://phabricator.wikimedia.org/T120788 for more
info/tools.
Post by Shiju Alex
Post by Ryan Kaldari
Post by Shiju Alex
Post by Andre Klapper
Post by Shiju Alex
All the digitized files are heavy and the size ranges from 100
MB to
Post by Shiju Alex
Post by Ryan Kaldari
Post by Shiju Alex
1.5
Post by Andre Klapper
GB
Post by Shiju Alex
depending on the number of pages in the books. So manually
managing
Post by Shiju Alex
Post by Ryan Kaldari
Post by Shiju Alex
this
Post by Andre Klapper
is
Post by Shiju Alex
going to be a big challenge.
Can some one help with this?
Cheers,
andre
--
Andre Klapper | Bugwrangler / Developer Advocate
https://blogs.gnome.org/aklapper/
_______________________________________________
Wikitech-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Shrinivasan T
2018-12-05 09:58:25 UTC
Permalink
we used this script
https://github.com/tshrinivasan/tools-for-wiki/tree/master/pdf-upload-commons

to upload some 2000 public domain tamil books to commons.

Explore the batch uploading to commons.
If it is not apt for you, I can help to customize this script.

Regards,
T. Shrinivasan

Loading...