crawl indian language websites

Hits: 1

CommonCrawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

Homepage

Here are 3 easy steps to download the data in any language, for e.g. Marathi or Gujarati

1) Clone the repo git clone https://github.com/qburst/common-crawl-malayalam.git

2) Change language code from mal to mar at the line –
AND content_languages LIKE ‘mal%'” 

The other codes are: hin (Hindi), mal (Malayalam), mar (Marathi), guj (Gujarati)

Run this shell script after replacing XXX with your AWS access and security key.

./extract_malayalam_warcs.sh XXX XXX 2021-10 s3://nlp-malayalam

3) Change unicode range from mal: range(3328, 3456) to your own desired language range, for e.g. devnagari: range(2304, 2432) Gujarati: range(2688, 2816)

Run this shell script after replacing XXX with your AWS access and security key.

./filter_malayalam.sh XXX XXX s3://nlp-malayalam/2021-10/warcs s3://nlp-malayalam/2021-10/output2

Filtered and unfiltered data will be available in output2 sub-folder of s3 bucket.

_____

Update:

The good French guys from Sorbonne University and inria have extracted sentences from common crawler for all languages and made them available at 

https://oscar-corpus.com/

I have no words to thank them. The paper is available here… 

http://corpora.ids-mannheim.de/CMLC7-final/CMLC-7_2019-Oritz_et_al.pdf

Without this data, it would have been very difficult to extract reliable data for languages like Marathi, Gujarati.

Powered by WPeMatico

Antes de continuar con la Información de ANDROID-TV:

Solicita tu demo Gratis de KAXTV IPTV aquí

https://chat.whatsapp.com/C3xhSt8dP62IhDFGldwwcN